Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Open
weiji14 opened this issue Mar 7, 2023 · 2 comments
Open
Labels
question Further information is requested

Comments

@weiji14
Copy link
Member

weiji14 commented Mar 7, 2023

What is your issue?

Current state

Currently, xbatcher v0.3.0's BatchGenerator is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.

Proposal

Split BatchGenerator into 2 (or more) subcomponents. Specifically:

  1. A Slicer that does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensional xarray object.
  2. A Batcher that groups together the pieces from the Slicer into batches of data.

These are the parameters from the current BatchGenerator that would be handled by each component:

Slicer:

  • input_dims
  • input_overlap

Batcher:

  • batch_dims
  • concat_input_dims
  • preload_batch

Benefits

Cons

  • May result in the current one-liner becoming a multi-liner
  • Could lead to some backwards incompatibility/breaking changes
@weiji14 weiji14 added the question Further information is requested label Mar 7, 2023
@maxrjones
Copy link
Member

Thanks for opening this issue @weiji14! Great idea for a refactor to simplify the code base, promote new contributions, and help solve the web of existing issues!

I think when using concat_input_dims=False, the division between Slicer and Batcher that you suggested makes a lot of sense and would be relatively simple to decouple (at least for those who've spent the time getting familiar with the current implementation).

When using concat_input_dims=True, it's a bit more complicated because batch_dims can impact slicing. Specifically, the input dataset is sliced on the union of input_dims and batch_dims in that case. There are a few options to account for this:

  1. Break backwards compatibility by not ever slicing on batch_dims, even when concat_input_dims==True
  2. batch_dims would need to also be included in Slicer
  3. A third component could handle slicing on batch_dims between the Slicer and Batcher components
  4. Additional slicing would happen in Batcher for this edge case

I expect that option 3 (a separate component for this edge case) would make the most sense. I'll work on this a bit now.

@cmdupuis3
Copy link

I think this setup would mimic what I'm doing now with my rolling/batching scheme outside of xbatcher. The important thing there is that I can explicitly control the batch sizes, even with predicates involved.

I think if we include predicates though, we need to have a map that can "unbatch" the results because the map may not be straightforward, especially if there are overlaps between the result chips. See #43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants