Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2067 - Migrate datatree io.py and common.py #9011

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

owenlittlejohns
Copy link
Contributor

This PR migrates what I believe is the last actual source code out of the xarray/datatree_/datatree directory into xarray/core. This is another piece of #8572.

I made a couple of decisions that I'd love some feedback/extra eyes on:

  • I moved the remaining functions from xarray/datatree_/datatree/io.py into their own module, rather than xarray/backends/api.py. It felt like these were a little separate, but I can definitely be convinced otherwise.
  • I created a subclass for TreeAttrAccessMixin. I really just wanted to swap over to using AttrAccessMixin, but I ran into issues with __slots__. Looking at it, I think I would need to work on the DataTree, TreeNode and NamedNode classes. I can go down that path, but thought I'd put up this PR for discussion before beginning that effort.
  • Continues Track merging datatree into xarray #8572 (Contributed to last subitem of "Merge in code for DataTree class.")
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

xarray/core/types.py Show resolved Hide resolved
xarray/core/datatree_io.py Show resolved Hide resolved
@@ -347,6 +347,34 @@ def _ipython_key_completions_(self) -> list[str]:
return list(items)


class TreeAttrAccessMixin(AttrAccessMixin):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's the biggest question I have... is this okay for now? (Maybe this is directed at @TomNicholas)

The use of __slots__ is a little new to me (but makes sense from reading up on it). I think to adopt __slots__ and remove __dict__ I would need to slightly rework DataTree, TreeNode and NamedNode. I wanted to check that the scope mainly seems to be self.children and self.parent attributes, and understand why those classes were implemented as they currently are before trying to alter things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to embarrass me, but why did you stop calling init_subclass on the Parent object? everything else with this merge seems reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I accidentally commented out that last line. Will fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my last comment was true, but... uncommenting out that line means that the same method in the parent class (AttrAccessMixin) gets called by super in addition to this method, right? That would mean we end up calling the validation logic that this child method is trying to avoid for now (checking that __dict__ is not present on the class).

Now if only I'd said that the first time around 😉

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 7, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation May 7, 2024
Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just comments for now

xarray/core/types.py Show resolved Hide resolved
@@ -347,6 +347,34 @@ def _ipython_key_completions_(self) -> list[str]:
return list(items)


class TreeAttrAccessMixin(AttrAccessMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to embarrass me, but why did you stop calling init_subclass on the Parent object? everything else with this merge seems reasonable to me.

xarray/core/datatree_io.py Show resolved Hide resolved
if kwargs.get("format", None) not in [None, "NETCDF4"]:
raise ValueError("to_netcdf only supports the NETCDF4 format")

engine = kwargs.get("engine", None)
if engine not in [None, "netcdf4", "h5netcdf"]:
if engine not in [None, *get_args(T_DataTreeNetcdfEngine)]:
Copy link
Contributor

@Illviljan Illviljan May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No stopper but I'm not sure I like the kwargs.get-style here, engine wont be typed and might hide typing issues downstream. Is engine correctly narrowed from Any after this if-case?

If we want good typing coverage it's better to prefer keyword-only arguments (with default values) rather than using kwargs.get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - yeah. I took a quick look at to_netcdf in xarray/backends/api.py, and it looks like making these explicitly named with defaults would be consistent with that, too.

Looks like the potential list of arguments is: format, engine, group and compute (with group and compute for _datatree_to_zarr). These changes would also presumably require similar updates to DataTree.to_netcdf and DataTree.to_zarr(https://github.com/pydata/xarray/blob/main/xarray/core/datatree.py#L1516).

Just to check @TomNicholas - am I missing a reason for keeping the kwargs.get implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a valid reason not to replace them, but I'm not 100% familiar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of making them consistent with the other to_netcdf functions in xarray/backends/api.py. I'll make this change and push it up as I get chance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check @TomNicholas - am I missing a reason for keeping the kwargs.get implementation?

Not that I can think of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made these changes in this commit

@@ -347,6 +347,34 @@ def _ipython_key_completions_(self) -> list[str]:
return list(items)


class TreeAttrAccessMixin(AttrAccessMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this class necessary? Can't DataTree inherit directly from AttrAccessMixin? And override _attr_sources and so on?

I think the reason this class existed separately was just to do with hacking around limitations of datatree being in a separate repository...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to use AttrAccessMixin. Just dropping it in as a replacement for TreeAttrAccessMixin doesn't work because of the checking done inside the AttrAccessMixin.__init_subclass__ method. The check for if __dict__ exists fails. DataTree, TreeNode and NamedNode do all define __slots__, but I think there are attributes being defined in places like DataTree.__init__, for example DataTree.parent and DataTree.children (versus the DataTree._parent and DataTree._children attributes defined in DataTree.__slots__).

I think we talked about this a little bit in one of our calls - things like DataTree.parent and DataTree.children seemed like tricky things to change over.

if kwargs.get("format", None) not in [None, "NETCDF4"]:
raise ValueError("to_netcdf only supports the NETCDF4 format")

engine = kwargs.get("engine", None)
if engine not in [None, "netcdf4", "h5netcdf"]:
if engine not in [None, *get_args(T_DataTreeNetcdfEngine)]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check @TomNicholas - am I missing a reason for keeping the kwargs.get implementation?

Not that I can think of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants