open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor · 2024-05-07T19:24:11Z

open_datatree performance improvement on NetCDF files

Closes Improving performance of open_datatree #8994 (NetCDF + Zarr datatree)
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

welcome · 2024-05-07T19:24:14Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

…to datatree-zarr merging into same branch

shoyer · 2024-05-14T15:48:46Z

xarray/backends/zarr.py

@@ -416,6 +415,104 @@ class ZarrStore(AbstractWritableDataStore):
        "_close_store_on_close",
    )

+    @classmethod
+    def open_store(


Could you rewrite open_group to call open_store() internally? That would reduce the amount of duplicated code and make this easier to maintain going forward.

…d code

…to datatree-zarr merging branches

flamingbear

I had thoughts about the legacyhdf5 api and how it might be incorporated.

flamingbear · 2024-05-20T16:56:09Z

xarray/backends/netCDF4_.py

@@ -16,7 +16,6 @@
    BackendEntrypoint,
    WritableCFDataStore,
    _normalize_path,
-    _open_datatree_netcdf,


Before this PR _open_datatree_netcdf was used by both the netCDF4_.py and h5netcdf_.py backends. Would it be possible to move these changes back into the backends/common location and ~~remove completely~~ rewrite the _open_datatree_netcdf function?

I think if you move these changes back to backends/common and leave them in _open_datatree_netcdf, You might be able to import the store for both the legacyhdf5 and the netcdf4 libraries. and then call _open_datatree_netcdf for both legacyhdf and netcdf.

currently _open_datatree_netcdftakes ncDataset: ncDataset | ncDatasetLegacyH5,

you might be able to include a new param with type cdfDataStore: NetCDF4DataStore | H5NetCDFStore and pass the appropritate one from both the n5netcdf_.py and netCDF4_.py backends.

Hi @flamingbear,

Following up on your comments, I would like to ask: Does datatree support the to_h5 method? I was looking for this method implementation or test but didn't find it. Therefore, do we need to support the open_datatree method for h5 files even if we do not have a to_h5?

Please let me know your thoughts.

@aladinor I think @flamingbear will have a much more detailed answer, so here just some short thoughts in advance. Xarray is able to use netCDF4 and h5netcdf to read (almost) any HDF5 and netCDF4 files (a netCDF4 file is essentially an HDF5 file). netcdf4-python is able to read non-conforming files ( as well as h5netcdf). Same when writing, engine="netcdf4" and engine="h5netcdf" will create similar (netCDF4) files. But here files will be standard conforming (beside using engine h5netcdf with invalid_netcdf=True).

Yes, Kai is correct. The h5netcdf is just a legacy/alternative interface to the netcdf4 files. Both libraries supported a Dataset class with similar interfaces. The original datatree implementation just chose which Dataset class to use based on which library was available and used them the same way. Let me know if this helps or you want more info?

Thanks, @kmuehlbauer and @flamingbear, for your feedback. I have one more question: Can I use the proposed implementation to open datatree stored in NetCDF4 files for H5 files?

The performance improvement relies on caching the store object for all nodes within the tree. However, we have a different store object for h5

https://github.com/pydata/xarray/blob/cd3ab8d5580eeb3639d38e1e884d2d9838ef6aa1/xarray/backends/h5netcdf_.py#L405C9-L415C10

As well as for netcdf4
https://github.com/pydata/xarray/blob/cd3ab8d5580eeb3639d38e1e884d2d9838ef6aa1/xarray/backends/netCDF4_.py#L646C8-L657C1.

Thus, should we create different implementations for each one or use one for both?

Hi, I was just looking at this now. I was hoping that we could pass just the class NetCDF4DataStore or H5NetCDFStore to the common function and just GenericStore.open(), but it looks like there are different keywords to the open function.
I was just checking to see if in NetCDF4_.open_datatree you can open the store, and pass that and the class to a common open_datatree.

filename_or_obj = _normalize_path(filename_or_obj) store = NetCDF4DataStore.open(...) return _open_datatree_netcdf_common(NetCDF4DataStore, store, ...rest)

And in h5netcdf_.open_datatree:

filename_or_obj = _normalize_path(filename_or_obj) store = H5NetCDFStore.open(...) return _open_datatree_netcdf_common(H5NetCDFStore, store, ...rest)

though, I haven't got it completed yet or determined if the different open() functions could be aligned with the same keywords.

Thanks, @flamingbear, for taking a look at it. I feel more confident about keeping them separately, as we must create a group_store when looping through each node.

for group in _iter_nc_groups(store.ds): group_path = str(parent / group[1:]) group_store = NetCDF4DataStore(manager, group=group_path, **kwargs) # <--- here store_entrypoint = StoreBackendEntrypoint()

I created separate implementations to ensure all keywords are aligned with each store object.

Please let me know your thoughts.

I'm trying not to harp on this too much, but that group_store you create has the same arguments whether it's a NetCDF4DataStore or a H5NetCDFStore which could be passed as a generic store. I'm still on the fence if this is easier to maintain because it's in one place or if it's just adding complexity to try to reuse the code.

I'll let someone else make this call.

The only diffs I found in the Store.open() keywords are:

netcdf:
clobber, diskless, persist, autoclose

h5netcdf:
invalid_netcdf, phony_dims, decode_vlen_strings, driver, driver_kwds

But I'm a little naive about the named keywords vs **kwargs, and how I'd reconcile those.

xarray/backends/netCDF4_.py

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

…tree implementations

keewis · 2024-05-14T16:19:27Z

xarray/backends/zarr.py

+        drop_variables: str | Iterable[str] | None = None,
+        use_cftime=None,
+        decode_timedelta=None,
+        group=None,


is there a reason why we have group here, or was that just copied over from open_dataset? If the latter, I think you could remove it, or we can define it as str | Iterable[str] | callable if we think that would be useful. The callable would take paths and decide whether or not to include the group.

Thanks, @keewis, for bringing this up. We added the group parameter in case someone decides just to open a specific group within the datatree (e.g. open_datatree(path2tree, group='group_0/subgroup_1'). I will define this as str | Iterable[str] | callable

if it turns out to be too difficult, we can also aim for group: str | None = None, and add Iterable[str] | Callable in a later PR (we would need additional code and tests for that new feature)

I got a Mypy error. I am just rolling back to the original idea and then we can add a new PR in the future.

if I read the mypy output correctly, that was because the typing didn't include None, while the default was None. So group: str | None = None would work, but group: str = None would fail.

(but not sure how important typing is here)

@keewis I will retry with the open_datatree function for h5 files to see if it works. I feel like I tested it using None, but I'm not sure. Let me see if I was making a mistake.

I ran the following command locally to see what was happening

python -m mypy --follow-imports=skip xarray/backends/h5netcdf_.py

and this is the output

xarray/backends/h5netcdf_.py:477: error: Value of type "str | Iterable[str] | Callable[..., Any] | None" is not indexable [index] Found 2 errors in 1 file (checked 1 source file)

seems like something is happening in this line

group_path = str(parent / group[1:])

I think it is because I am using group[1:] slicing. I need to concatenate the parent node with the subgroup. This subgroup is a string that starts with / (e.g. '/subgroup_1'), so if I use the whole string, it won't allow string concatenation. Any thoughts how to handle this?

that makes sense, because for Iterable[str], for example, we need to do the same thing for multiple groups, and callable would be applied to every existing group name. Since this would require additional code let's restrict group to just str | None and figure out how exactly the iterable / callable should work in a separate PR.

open_datatree function for h5 files to see if it works

Remember, these are actually still netcdf4 files, just using a different library to access. https://github.com/aladinor/xarray/blob/datatree-zarr/xarray/backends/h5netcdf_.py#L157-L164

~~I'm still looking to see if the open_datatree can be simplified, trying to understand where all of the keywords came from.~~. should be a separate comment.

@keewis and @flamingbear, I think I solved the issue with type hints. It was caused by my using group[1:] slicing notation. I refactored it and submitted a new commit.

keewis · 2024-05-28T21:08:55Z

xarray/backends/zarr.py

-                    f"zarr version {zarr_version}. See also "
-                    "https://github.com/zarr-developers/zarr-specs/issues/136"
-                )
+        gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))]


no need to load the iterator into memory:

Suggested change

gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))]

gpaths = [str(group / i[1:]) for i in _iter_zarr_groups(zarr_group)]

Also, can we rename i to something else, like name or child? (or something, not sure those are good names either)

…for zarr datatree

flamingbear · 2024-05-29T21:56:33Z

xarray/backends/h5netcdf_.py

@@ -431,11 +430,72 @@ def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporti
    def open_datatree(
        self,
        filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
+        mask_and_scale=True,


Suggested change

mask_and_scale=True,

*,

mask_and_scale=True,

separate positional from keyword args like you do in netCDF4_.py

…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files

flamingbear · 2024-05-29T23:58:59Z

xarray/backends/h5netcdf_.py

+        drop_variables: str | Iterable[str] | None = None,
+        use_cftime=None,
+        decode_timedelta=None,
+        format=None,


Suggested change

format=None,

mode="r",

format=None,

I am still trying to figure out how you gathered these keyword args, but the h5netcdf store's open, takes a mode.

flamingbear · 2024-05-29T23:59:22Z

xarray/backends/h5netcdf_.py

+        filename_or_obj = _normalize_path(filename_or_obj)
+        store = H5NetCDFStore.open(
+            filename_or_obj,
+            format=format,


Suggested change

format=format,

mode=mode,

format=format,

open_datatree performance improvement on NetCDF files

14aaf56

aladinor added 5 commits May 7, 2024 14:44

fixing issue with forward slashes

3a5edb4

Merge branch 'main' into datatree-zarr

72d7660

fixing issue with pytest

d9dde29

fixing issue with pytest

2bc5e73

Merge branch 'main' into datatree-zarr

89fb4fb

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024

TomNicholas added this to In progress in DataTree integration via automation May 8, 2024

Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024

aladinor added 4 commits May 10, 2024 07:59

open datatree in zarr format improvement

0343f10

Merge branch 'main' into datatree-zarr

93e1d59

fixing incompatibility in returned object

ac11b3e

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

6d0ee13

…to datatree-zarr merging into same branch

aladinor changed the title ~~open_datatree performance improvement on NetCDF files~~ open_datatree performance improvement on NetCDF and Zarr files May 10, 2024

Merge branch 'main' into datatree-zarr

91c5f0a

shoyer reviewed May 14, 2024

View reviewed changes

aladinor added 5 commits May 18, 2024 17:36

Merge branch 'main' into datatree-zarr

3363e91

passing group parameter to opendatatree method and reducing duplicate…

7bba52c

…d code

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

725aed7

…to datatree-zarr merging branches

passing group parameter to opendatatree method - NetCDF

903effd

Merge branch 'main' into datatree-zarr

d468478

flamingbear reviewed May 20, 2024

View reviewed changes

TomNicholas added topic-backends io topic-performance labels May 28, 2024

TomNicholas reviewed May 28, 2024

View reviewed changes

xarray/backends/netCDF4_.py Outdated Show resolved Hide resolved

aladinor and others added 3 commits May 28, 2024 17:07

Update xarray/backends/netCDF4_.py

51da175

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

Merge branch 'main' into datatree-zarr

24881bd

renaming variables

5f4bff1

renaming variables

41ceb4f

aladinor requested a review from flamingbear May 29, 2024 01:34

This comment was marked as outdated.

Sign in to view

aladinor added 3 commits May 29, 2024 11:03

renaming group_store variable

f18ead6

removing _open_datatree_netcdf function not used anymore in open_data…

33d9769

…tree implementations

improving performance of open_datatree method

3345b92

aladinor changed the title ~~open_datatree performance improvement on NetCDF and Zarr files~~ open_datatree performance improvement on NetCDF, H5, and Zarr files May 29, 2024

keewis reviewed May 29, 2024

View reviewed changes

aladinor added 2 commits May 29, 2024 14:07

renaming 'i' variable within list comprehension in open_store method …

3cb131c

…for zarr datatree

using the default generator instead of loading zarr groups in memory

6a759c0

aladinor force-pushed the datatree-zarr branch from a055344 to 6a759c0 Compare May 29, 2024 20:30

flamingbear reviewed May 29, 2024

View reviewed changes

aladinor added 3 commits May 29, 2024 17:09

fixing issue with group path to avoid using group[1:] notation. Addin…

6c00641

…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args

fixing issue with group path to avoid using group[1:] notation and ad…

189b497

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files

fixing issue with group path to avoid using group[1:] notation and ad…

a9c306d

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files

flamingbear reviewed May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor commented May 7, 2024 •

edited

welcome bot commented May 7, 2024

shoyer May 14, 2024

flamingbear left a comment

flamingbear May 20, 2024

flamingbear May 20, 2024

aladinor May 29, 2024

kmuehlbauer May 29, 2024

flamingbear May 29, 2024

aladinor May 29, 2024 •

edited

flamingbear May 29, 2024

aladinor May 29, 2024 •

edited

flamingbear May 29, 2024

flamingbear May 30, 2024

This comment was marked as outdated.

keewis May 14, 2024

aladinor May 29, 2024

keewis May 29, 2024

aladinor May 29, 2024

keewis May 29, 2024 •

edited

aladinor May 29, 2024

aladinor May 29, 2024

keewis May 29, 2024

flamingbear May 29, 2024 •

edited

aladinor May 29, 2024

keewis May 28, 2024

flamingbear May 29, 2024 •

edited

flamingbear May 29, 2024

flamingbear May 29, 2024

	gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))]
	gpaths = [str(group / i[1:]) for i in _iter_zarr_groups(zarr_group)]

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Are you sure you want to change the base?

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Conversation

aladinor commented May 7, 2024 • edited

welcome bot commented May 7, 2024

Choose a reason for hiding this comment

flamingbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aladinor May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aladinor May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keewis May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aladinor commented May 7, 2024 •

edited

aladinor May 29, 2024 •

edited

aladinor May 29, 2024 •

edited

keewis May 29, 2024 •

edited

flamingbear May 29, 2024 •

edited

flamingbear May 29, 2024 •

edited