Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

mraspaud · 2023-10-20T13:54:16Z

Feature Request

Is your feature request related to a problem? Please describe.
The current way to load data in Satpy is inherently per-file. While this work well in most cases, it becomes difficult and hacky when a dataset is based on multiple files that depend on each other, or when the files contain the same base metadata that we want to avoid loading multiple times.

Describe the solution you'd like
I would like to have an alternative to the YamlReader class that can take in multiple files at once and, to compare to the current architecture, have maybe a MultipleFileHandler that handles multiple files at once for better performance or smarter dataarray or metadata extraction.
Examples of readers that would benefit from this architecture:

FCI metadata caching could be done per dataset instead of per-file
HRIT PRO and EPI files would not need to be dependencies of the other files.

Describe any changes to existing user workflow
The user workflow will not be changed. However, developer will be provided two ways of creating a reader which could be confusing if we don't document the difference clearly.

Additional context
One added benefit would be that this reader architecture could natively return an xarray DataTree.

djhoese · 2023-10-22T20:25:41Z

I think this sounds good, reasonable, and not "that hard" to do given how modular we made the reader infrastructure already. Along these lines it'd be great to not need the YAML configuration at all, but maybe your request isn't the right time for this.

Other thoughts/ideas this makes me think of: I've always hated that to create a reader you need to call two or three classmethods on the reader class. I think it's: match files, create file handlers, create reader instance, then finally get_dataset or whatever it is called.

...and I forgot the other idea I had as I was typing the above. I think our YAML-based and very modular per-file readers are just a nice "mold" that technically fits every situation we've run into in the sense that it is "you have files, break the problem down into these parts, go". But as you've pointed out this doesn't make it the best design. We're to the point where we need faster readers that are smarter and they can do that when they are trying to fit this mold (YAML-based per-file file handlers) without doing some really ugly coding.

👍 from me on experimenting with this.

pnuu · 2023-11-08T13:05:23Z

I've been trying to speed-up remote FCI reading by doing some caching in h5netcdf (h5netcdf/h5netcdf#221), and now the biggest bottleneck seems to be that there are gazillion requests made to S3. Ideally the number of requests could go down by a factor of 40 (2 sets of 40 segments per time slot). Naturally not all attributes/metadata can be shared between the segments, but in any case this would save a ton of time. Currently just creating the Scene from local CEPH/S3 takes roughly 5 minutes. Even for local files the Scene creation takes 10+ seconds, so there'd be a big gain also for that case.

djhoese · 2023-11-08T13:41:11Z

I was thinking about FCI recently and thought it might be easy to move file handler creation in the existing YAML reader to use dask delayed objects or maybe futures. There is no reason I can think of that file reading can't be done in parallel and for S3 access this would be really helpful.

pnuu · 2023-11-21T13:35:25Z

Forgot to link my more recent test with parallel filehandler for FCI, so for future reference: https://pytroll.slack.com/archives/C011NR3LE20/p1699870659317759 There's also link to the branch I used.

So the parallelization works but doesn't give any benefit in Scene creation time for local files.

djhoese · 2023-11-21T13:38:00Z

I wonder if that would work better in a distributed/multiprocess environment (the dask scheduler) where in the current threaded environment we are limited by the GIL...but that would also suggest that we aren't I/O bound or that something is doing I/O while holding the GIL...oh or the NetCDF library I suppose and it's internal locking.

mraspaud added enhancement code enhancements, features, improvements component:readers future ideas Wishes and ideas for the future labels Oct 20, 2023

mraspaud self-assigned this Oct 20, 2023

mraspaud linked a pull request Dec 18, 2023 that will close this issue

Experiments on a multi-file reader #2697

Draft

4 tasks

sfinkens mentioned this issue Dec 20, 2023

Improve how we store metadata #2702

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

mraspaud commented Oct 20, 2023

djhoese commented Oct 22, 2023

pnuu commented Nov 8, 2023

djhoese commented Nov 8, 2023

pnuu commented Nov 21, 2023

djhoese commented Nov 21, 2023 •

edited

Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605

Comments

mraspaud commented Oct 20, 2023

Feature Request

djhoese commented Oct 22, 2023

pnuu commented Nov 8, 2023

djhoese commented Nov 8, 2023

pnuu commented Nov 21, 2023

djhoese commented Nov 21, 2023 • edited

djhoese commented Nov 21, 2023 •

edited