-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_datatree performance improvement on NetCDF, H5, and Zarr files #9014
base: main
Are you sure you want to change the base?
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
@@ -416,6 +415,104 @@ class ZarrStore(AbstractWritableDataStore): | |||
"_close_store_on_close", | |||
) | |||
|
|||
@classmethod | |||
def open_store( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rewrite open_group
to call open_store()
internally? That would reduce the amount of duplicated code and make this easier to maintain going forward.
…to datatree-zarr merging branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thoughts about the legacyhdf5 api and how it might be incorporated.
@@ -16,7 +16,6 @@ | |||
BackendEntrypoint, | |||
WritableCFDataStore, | |||
_normalize_path, | |||
_open_datatree_netcdf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR _open_datatree_netcdf
was used by both the netCDF4_.py and h5netcdf_.py backends. Would it be possible to move these changes back into the backends/common location and remove completely rewrite the _open_datatree_netcdf
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you move these changes back to backends/common
and leave them in _open_datatree_netcdf
, You might be able to import the store for both the legacyhdf5 and the netcdf4 libraries. and then call _open_datatree_netcdf
for both legacyhdf and netcdf.
currently _open_datatree_netcdf
takes ncDataset: ncDataset | ncDatasetLegacyH5,
you might be able to include a new param with type cdfDataStore: NetCDF4DataStore | H5NetCDFStore
and pass the appropritate one from both the n5netcdf_.py and netCDF4_.py backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @flamingbear,
Following up on your comments, I would like to ask: Does datatree
support the to_h5
method? I was looking for this method implementation or test
but didn't find it. Therefore, do we need to support the open_datatree
method for h5
files even if we do not have a to_h5
?
Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aladinor I think @flamingbear will have a much more detailed answer, so here just some short thoughts in advance. Xarray is able to use netCDF4
and h5netcdf
to read (almost) any HDF5 and netCDF4 files (a netCDF4 file is essentially an HDF5 file). netcdf4-python
is able to read non-conforming files ( as well as h5netcdf). Same when writing, engine="netcdf4"
and engine="h5netcdf"
will create similar (netCDF4) files. But here files will be standard conforming (beside using engine h5netcdf
with invalid_netcdf=True
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Kai is correct. The h5netcdf is just a legacy/alternative interface to the netcdf4 files. Both libraries supported a Dataset
class with similar interfaces. The original datatree implementation just chose which Dataset
class to use based on which library was available and used them the same way. Let me know if this helps or you want more info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @kmuehlbauer and @flamingbear, for your feedback. I have one more question: Can I use the proposed implementation to open datatree stored in NetCDF4 files for H5 files?
The performance improvement relies on caching the store object for all nodes within the tree. However, we have a different store
object for h5
As well as for netcdf4
https://github.com/pydata/xarray/blob/cd3ab8d5580eeb3639d38e1e884d2d9838ef6aa1/xarray/backends/netCDF4_.py#L646C8-L657C1.
Thus, should we create different implementations for each one or use one for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I was just looking at this now. I was hoping that we could pass just the class NetCDF4DataStore
or H5NetCDFStore
to the common function and just GenericStore.open()
, but it looks like there are different keywords to the open function.
I was just checking to see if in NetCDF4_.open_datatree you can open the store, and pass that and the class to a common open_datatree.
filename_or_obj = _normalize_path(filename_or_obj)
store = NetCDF4DataStore.open(...)
return _open_datatree_netcdf_common(NetCDF4DataStore, store, ...rest)
And in h5netcdf_.open_datatree:
filename_or_obj = _normalize_path(filename_or_obj)
store = H5NetCDFStore.open(...)
return _open_datatree_netcdf_common(H5NetCDFStore, store, ...rest)
though, I haven't got it completed yet or determined if the different open() functions could be aligned with the same keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @flamingbear, for taking a look at it. I feel more confident about keeping them separately, as we must create a group_store
when looping through each node.
for group in _iter_nc_groups(store.ds):
group_path = str(parent / group[1:])
group_store = NetCDF4DataStore(manager, group=group_path, **kwargs) # <--- here
store_entrypoint = StoreBackendEntrypoint()
I created separate implementations to ensure all keywords are aligned with each store
object.
Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying not to harp on this too much, but that group_store
you create has the same arguments whether it's a NetCDF4DataStore
or a H5NetCDFStore
which could be passed as a generic store. I'm still on the fence if this is easier to maintain because it's in one place or if it's just adding complexity to try to reuse the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let someone else make this call.
The only diffs I found in the Store.open() keywords are:
-
netcdf:
clobber
,diskless
,persist
,autoclose
-
h5netcdf:
invalid_netcdf
,phony_dims
,decode_vlen_strings
,driver
,driver_kwds
But I'm a little naive about the named keywords vs **kwargs, and how I'd reconcile those.
renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>
xarray/backends/zarr.py
Outdated
drop_variables: str | Iterable[str] | None = None, | ||
use_cftime=None, | ||
decode_timedelta=None, | ||
group=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we have group
here, or was that just copied over from open_dataset
? If the latter, I think you could remove it, or we can define it as str | Iterable[str] | callable
if we think that would be useful. The callable would take paths and decide whether or not to include the group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @keewis, for bringing this up. We added the group parameter in case someone decides just to open a specific group within the datatree (e.g. open_datatree(path2tree, group='group_0/subgroup_1'
). I will define this as str | Iterable[str] | callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it turns out to be too difficult, we can also aim for group: str | None = None
, and add Iterable[str] | Callable
in a later PR (we would need additional code and tests for that new feature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a Mypy error. I am just rolling back to the original idea and then we can add a new PR in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I read the mypy
output correctly, that was because the typing didn't include None
, while the default was None
. So group: str | None = None
would work, but group: str = None
would fail.
(but not sure how important typing is here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keewis I will retry with the open_datatree function for h5 files to see if it works. I feel like I tested it using None
, but I'm not sure. Let me see if I was making a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the following command locally to see what was happening
python -m mypy --follow-imports=skip xarray/backends/h5netcdf_.py
and this is the output
xarray/backends/h5netcdf_.py:477: error: Value of type "str | Iterable[str] | Callable[..., Any] | None" is not indexable [index]
Found 2 errors in 1 file (checked 1 source file)
seems like something is happening in this line
group_path = str(parent / group[1:])
I think it is because I am using group[1:]
slicing. I need to concatenate the parent node with the subgroup. This subgroup is a string that starts with /
(e.g. '/subgroup_1'), so if I use the whole string, it won't allow string concatenation. Any thoughts how to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, because for Iterable[str]
, for example, we need to do the same thing for multiple groups, and callable
would be applied to every existing group name. Since this would require additional code let's restrict group
to just str | None
and figure out how exactly the iterable / callable should work in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open_datatree function for h5 files to see if it works
Remember, these are actually still netcdf4 files, just using a different library to access. https://github.com/aladinor/xarray/blob/datatree-zarr/xarray/backends/h5netcdf_.py#L157-L164
I'm still looking to see if the open_datatree can be simplified, trying to understand where all of the keywords came from.. should be a separate comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keewis and @flamingbear, I think I solved the issue with type hints. It was caused by my using group[1:]
slicing notation. I refactored it and submitted a new commit.
xarray/backends/zarr.py
Outdated
f"zarr version {zarr_version}. See also " | ||
"https://github.com/zarr-developers/zarr-specs/issues/136" | ||
) | ||
gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to load the iterator into memory:
gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))] | |
gpaths = [str(group / i[1:]) for i in _iter_zarr_groups(zarr_group)] |
Also, can we rename i
to something else, like name
or child
? (or something, not sure those are good names either)
@@ -431,11 +430,72 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti | |||
def open_datatree( | |||
self, | |||
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore, | |||
mask_and_scale=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask_and_scale=True, | |
*, | |
mask_and_scale=True, |
separate positional from keyword args like you do in netCDF4_.py
…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files
drop_variables: str | Iterable[str] | None = None, | ||
use_cftime=None, | ||
decode_timedelta=None, | ||
format=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format=None, | |
mode="r", | |
format=None, |
I am still trying to figure out how you gathered these keyword args, but the h5netcdf store's open, takes a mode.
filename_or_obj = _normalize_path(filename_or_obj) | ||
store = H5NetCDFStore.open( | ||
filename_or_obj, | ||
format=format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format=format, | |
mode=mode, | |
format=format, |
open_datatree performance improvement on NetCDF files
whats-new.rst