Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

aladinor
Copy link

@aladinor aladinor commented May 7, 2024

open_datatree performance improvement on NetCDF files

Copy link

welcome bot commented May 7, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation May 8, 2024
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024
@aladinor aladinor changed the title open_datatree performance improvement on NetCDF files open_datatree performance improvement on NetCDF and Zarr files May 10, 2024
@@ -416,6 +415,104 @@ class ZarrStore(AbstractWritableDataStore):
"_close_store_on_close",
)

@classmethod
def open_store(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rewrite open_group to call open_store() internally? That would reduce the amount of duplicated code and make this easier to maintain going forward.

Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thoughts about the legacyhdf5 api and how it might be incorporated.

@@ -16,7 +16,6 @@
BackendEntrypoint,
WritableCFDataStore,
_normalize_path,
_open_datatree_netcdf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR _open_datatree_netcdf was used by both the netCDF4_.py and h5netcdf_.py backends. Would it be possible to move these changes back into the backends/common location and remove completely rewrite the _open_datatree_netcdf function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you move these changes back to backends/common and leave them in _open_datatree_netcdf, You might be able to import the store for both the legacyhdf5 and the netcdf4 libraries. and then call _open_datatree_netcdf for both legacyhdf and netcdf.

currently _open_datatree_netcdftakes ncDataset: ncDataset | ncDatasetLegacyH5,

you might be able to include a new param with type cdfDataStore: NetCDF4DataStore | H5NetCDFStore and pass the appropritate one from both the n5netcdf_.py and netCDF4_.py backends.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @flamingbear,

Following up on your comments, I would like to ask: Does datatree support the to_h5 method? I was looking for this method implementation or test but didn't find it. Therefore, do we need to support the open_datatree method for h5 files even if we do not have a to_h5?

Please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aladinor I think @flamingbear will have a much more detailed answer, so here just some short thoughts in advance. Xarray is able to use netCDF4 and h5netcdf to read (almost) any HDF5 and netCDF4 files (a netCDF4 file is essentially an HDF5 file). netcdf4-python is able to read non-conforming files ( as well as h5netcdf). Same when writing, engine="netcdf4" and engine="h5netcdf" will create similar (netCDF4) files. But here files will be standard conforming (beside using engine h5netcdf with invalid_netcdf=True).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Kai is correct. The h5netcdf is just a legacy/alternative interface to the netcdf4 files. Both libraries supported a Dataset class with similar interfaces. The original datatree implementation just chose which Dataset class to use based on which library was available and used them the same way. Let me know if this helps or you want more info?

Copy link
Author

@aladinor aladinor May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kmuehlbauer and @flamingbear, for your feedback. I have one more question: Can I use the proposed implementation to open datatree stored in NetCDF4 files for H5 files?

The performance improvement relies on caching the store object for all nodes within the tree. However, we have a different store object for h5

https://github.com/pydata/xarray/blob/cd3ab8d5580eeb3639d38e1e884d2d9838ef6aa1/xarray/backends/h5netcdf_.py#L405C9-L415C10

As well as for netcdf4
https://github.com/pydata/xarray/blob/cd3ab8d5580eeb3639d38e1e884d2d9838ef6aa1/xarray/backends/netCDF4_.py#L646C8-L657C1.

Thus, should we create different implementations for each one or use one for both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was just looking at this now. I was hoping that we could pass just the class NetCDF4DataStore or H5NetCDFStore to the common function and just GenericStore.open(), but it looks like there are different keywords to the open function.
I was just checking to see if in NetCDF4_.open_datatree you can open the store, and pass that and the class to a common open_datatree.

        filename_or_obj = _normalize_path(filename_or_obj)
        store = NetCDF4DataStore.open(...)
        return _open_datatree_netcdf_common(NetCDF4DataStore, store, ...rest)

And in h5netcdf_.open_datatree:

        filename_or_obj = _normalize_path(filename_or_obj)
        store = H5NetCDFStore.open(...)
        return _open_datatree_netcdf_common(H5NetCDFStore, store, ...rest)

though, I haven't got it completed yet or determined if the different open() functions could be aligned with the same keywords.

Copy link
Author

@aladinor aladinor May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @flamingbear, for taking a look at it. I feel more confident about keeping them separately, as we must create a group_store when looping through each node.

for group in _iter_nc_groups(store.ds):
            group_path = str(parent / group[1:])
            group_store = NetCDF4DataStore(manager, group=group_path, **kwargs) #  <--- here
            store_entrypoint = StoreBackendEntrypoint()

I created separate implementations to ensure all keywords are aligned with each store object.

Please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying not to harp on this too much, but that group_store you create has the same arguments whether it's a NetCDF4DataStore or a H5NetCDFStore which could be passed as a generic store. I'm still on the fence if this is easier to maintain because it's in one place or if it's just adding complexity to try to reuse the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let someone else make this call.

The only diffs I found in the Store.open() keywords are:

  • netcdf:
    clobber, diskless, persist, autoclose

  • h5netcdf:
    invalid_netcdf, phony_dims, decode_vlen_strings, driver, driver_kwds

    But I'm a little naive about the named keywords vs **kwargs, and how I'd reconcile those.

aladinor and others added 3 commits May 28, 2024 17:07
@aladinor aladinor requested a review from flamingbear May 29, 2024 01:34
aladinor

This comment was marked as outdated.

@aladinor aladinor changed the title open_datatree performance improvement on NetCDF and Zarr files open_datatree performance improvement on NetCDF, H5, and Zarr files May 29, 2024
drop_variables: str | Iterable[str] | None = None,
use_cftime=None,
decode_timedelta=None,
group=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we have group here, or was that just copied over from open_dataset? If the latter, I think you could remove it, or we can define it as str | Iterable[str] | callable if we think that would be useful. The callable would take paths and decide whether or not to include the group.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @keewis, for bringing this up. We added the group parameter in case someone decides just to open a specific group within the datatree (e.g. open_datatree(path2tree, group='group_0/subgroup_1'). I will define this as str | Iterable[str] | callable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it turns out to be too difficult, we can also aim for group: str | None = None, and add Iterable[str] | Callable in a later PR (we would need additional code and tests for that new feature)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a Mypy error. I am just rolling back to the original idea and then we can add a new PR in the future.

Copy link
Collaborator

@keewis keewis May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I read the mypy output correctly, that was because the typing didn't include None, while the default was None. So group: str | None = None would work, but group: str = None would fail.

(but not sure how important typing is here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis I will retry with the open_datatree function for h5 files to see if it works. I feel like I tested it using None, but I'm not sure. Let me see if I was making a mistake.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the following command locally to see what was happening

 python -m mypy --follow-imports=skip xarray/backends/h5netcdf_.py 

and this is the output

xarray/backends/h5netcdf_.py:477: error: Value of type "str | Iterable[str] | Callable[..., Any] | None" is not indexable  [index] 
Found 2 errors in 1 file (checked 1 source file)

seems like something is happening in this line

group_path = str(parent / group[1:])

I think it is because I am using group[1:] slicing. I need to concatenate the parent node with the subgroup. This subgroup is a string that starts with / (e.g. '/subgroup_1'), so if I use the whole string, it won't allow string concatenation. Any thoughts how to handle this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, because for Iterable[str], for example, we need to do the same thing for multiple groups, and callable would be applied to every existing group name. Since this would require additional code let's restrict group to just str | None and figure out how exactly the iterable / callable should work in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_datatree function for h5 files to see if it works

Remember, these are actually still netcdf4 files, just using a different library to access. https://github.com/aladinor/xarray/blob/datatree-zarr/xarray/backends/h5netcdf_.py#L157-L164

I'm still looking to see if the open_datatree can be simplified, trying to understand where all of the keywords came from.. should be a separate comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis and @flamingbear, I think I solved the issue with type hints. It was caused by my using group[1:] slicing notation. I refactored it and submitted a new commit.

f"zarr version {zarr_version}. See also "
"https://github.com/zarr-developers/zarr-specs/issues/136"
)
gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to load the iterator into memory:

Suggested change
gpaths = [str(group / i[1:]) for i in list(_iter_zarr_groups(zarr_group))]
gpaths = [str(group / i[1:]) for i in _iter_zarr_groups(zarr_group)]

Also, can we rename i to something else, like name or child? (or something, not sure those are good names either)

@@ -431,11 +430,72 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
mask_and_scale=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mask_and_scale=True,
*,
mask_and_scale=True,

separate positional from keyword args like you do in netCDF4_.py

…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files
drop_variables: str | Iterable[str] | None = None,
use_cftime=None,
decode_timedelta=None,
format=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format=None,
mode="r",
format=None,

I am still trying to figure out how you gathered these keyword args, but the h5netcdf store's open, takes a mode.

filename_or_obj = _normalize_path(filename_or_obj)
store = H5NetCDFStore.open(
filename_or_obj,
format=format,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format=format,
mode=mode,
format=format,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io run-benchmark Run the ASV benchmark workflow topic-backends topic-DataTree Related to the implementation of a DataTree class topic-performance
Projects
Development

Successfully merging this pull request may close these issues.

Improving performance of open_datatree
7 participants