Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

Open
mraspaud opened this issue Oct 20, 2023 · 5 comments · May be fixed by #2697
Assignees
Labels
component:readers enhancement code enhancements, features, improvements future ideas Wishes and ideas for the future

Comments

@mraspaud
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe.
The current way to load data in Satpy is inherently per-file. While this work well in most cases, it becomes difficult and hacky when a dataset is based on multiple files that depend on each other, or when the files contain the same base metadata that we want to avoid loading multiple times.

Describe the solution you'd like
I would like to have an alternative to the YamlReader class that can take in multiple files at once and, to compare to the current architecture, have maybe a MultipleFileHandler that handles multiple files at once for better performance or smarter dataarray or metadata extraction.
Examples of readers that would benefit from this architecture:

  • FCI metadata caching could be done per dataset instead of per-file
  • HRIT PRO and EPI files would not need to be dependencies of the other files.

Describe any changes to existing user workflow
The user workflow will not be changed. However, developer will be provided two ways of creating a reader which could be confusing if we don't document the difference clearly.

Additional context
One added benefit would be that this reader architecture could natively return an xarray DataTree.

@mraspaud mraspaud added enhancement code enhancements, features, improvements component:readers future ideas Wishes and ideas for the future labels Oct 20, 2023
@mraspaud mraspaud self-assigned this Oct 20, 2023
@djhoese
Copy link
Member

djhoese commented Oct 22, 2023

I think this sounds good, reasonable, and not "that hard" to do given how modular we made the reader infrastructure already. Along these lines it'd be great to not need the YAML configuration at all, but maybe your request isn't the right time for this.

Other thoughts/ideas this makes me think of: I've always hated that to create a reader you need to call two or three classmethods on the reader class. I think it's: match files, create file handlers, create reader instance, then finally get_dataset or whatever it is called.

...and I forgot the other idea I had as I was typing the above. I think our YAML-based and very modular per-file readers are just a nice "mold" that technically fits every situation we've run into in the sense that it is "you have files, break the problem down into these parts, go". But as you've pointed out this doesn't make it the best design. We're to the point where we need faster readers that are smarter and they can do that when they are trying to fit this mold (YAML-based per-file file handlers) without doing some really ugly coding.

👍 from me on experimenting with this.

@pnuu
Copy link
Member

pnuu commented Nov 8, 2023

I've been trying to speed-up remote FCI reading by doing some caching in h5netcdf (h5netcdf/h5netcdf#221), and now the biggest bottleneck seems to be that there are gazillion requests made to S3. Ideally the number of requests could go down by a factor of 40 (2 sets of 40 segments per time slot). Naturally not all attributes/metadata can be shared between the segments, but in any case this would save a ton of time. Currently just creating the Scene from local CEPH/S3 takes roughly 5 minutes. Even for local files the Scene creation takes 10+ seconds, so there'd be a big gain also for that case.

@djhoese
Copy link
Member

djhoese commented Nov 8, 2023

I was thinking about FCI recently and thought it might be easy to move file handler creation in the existing YAML reader to use dask delayed objects or maybe futures. There is no reason I can think of that file reading can't be done in parallel and for S3 access this would be really helpful.

@pnuu
Copy link
Member

pnuu commented Nov 21, 2023

Forgot to link my more recent test with parallel filehandler for FCI, so for future reference: https://pytroll.slack.com/archives/C011NR3LE20/p1699870659317759 There's also link to the branch I used.

So the parallelization works but doesn't give any benefit in Scene creation time for local files.

@djhoese
Copy link
Member

djhoese commented Nov 21, 2023

I wonder if that would work better in a distributed/multiprocess environment (the dask scheduler) where in the current threaded environment we are limited by the GIL...but that would also suggest that we aren't I/O bound or that something is doing I/O while holding the GIL...oh or the NetCDF library I suppose and it's internal locking.

@mraspaud mraspaud linked a pull request Dec 18, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements future ideas Wishes and ideas for the future
Projects
Development

Successfully merging a pull request may close this issue.

3 participants