Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement the PintMetaIndex #163

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Mar 26, 2022

As mentioned in #162, it is possible to get the indexing functions to work, although there still is no public API.

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex

ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air

new_arr = xr.DataArray(
    arr.variable,
    coords={
        "lat": arr.lat.variable,
        "lon": arr.lon.variable,
        "time": arr.time.variable,
    },
    indexes={
        "lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
        "lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
        "time": arr.xindexes["time"],
    },
    fastpath=True,
)
new_arr.sel(
    lat=ureg.Quantity([75, 70, 65], "deg"),
    lon=ureg.Quantity([200, 202.5], "deg"),
)

This will fail at the moment because xarray treats dask arrays differently from duck-dask arrays, but passing single values works!



class PintMetaIndex(Index):
# TODO: inherit from MetaIndex once that exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I'm not sure how a MetaIndex class would look like. So far we used the generic term "meta-index" to refer to indexes that would wrap one or several indexes, but I don't know if there will be a need to provide a generic class for that.

Copy link
Collaborator Author

@keewis keewis Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it doesn't really look like we actually need a base class for that, but I noticed that a few methods don't make sense for meta-indexes, from_variables for example. It's probably fine to use the default for those, though.

@benbovy
Copy link
Member

benbovy commented Mar 28, 2022

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Here are a few comments. Happy to answer questions if any.

There are some Index methods of like isel, stack, rename that I guess do not depend on units and where you could just forward the call + args to the wrapped index.

For some other methods like concat, join, equals, reindex, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise). An alternative option would be to implicitly convert the units of those indexes, but it's difficult to do that in an index-agnostic way unless we define and adopt in Xarray some protocol for that purpose.

The general approach used in the Xarray indexes refactor heavily relies on the type of the indexes (at least when we need to compare them together). That's not super flexible with the PintMetaIndex solution here, but I think it's reasonable enough. For example, alignment will only work when the indexes found for a common set of coordinates / dimensions are all PintMetaIndex objects. This might behave weirdly when the PintMetaIndex objects do not wrap indexes of the same type, although this would be easy to check.

I wonder whether whether PintMetaIndex should accept one unit or a map of units. The latter makes sense if, e.g., you want to reuse it to wrap multi-indexes where the corresponding coordinate variables (levels) may have different units.

Regarding Index methods like from_variables and create_variables, I guess PintMetaIndex would implement a lightweight wrapping layer to respectively get and assign a pint quantity from / to the Xarray variables.

You should also be careful when converting the units of indexed coordinates as it may get out of sync with their index. As there's no concept of "duck" index, the easiest would probably be to drop the index (and maybe reconstruct it from scratch) when the coordinates are updated.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

@keewis I have been looking into this once again and now I think I better understand what you'd like to achieve with the PintMetaIndex wrapper. A few other suggestions:

Wrap index coordinate variables as unit-aware variables

I'm not familiar with pint, but if a pint.Quantity works with any duck-array without coercing it into a specific type, I think that something like this may work:

class PintMetaIndex:

    def create_variables(self, variables=None):
        index_vars = self.index.create_variables(variables)

        index_vars_units = {}
        for name, var in index_vars.items():
            data = array_attach_unit(var.data, self.units[name])
            var_units = xr.Variable(var.dims, data, attrs=var.attrs, encoding=var.encoding)
            index_vars_units[name] = var_units
        
        return index_var_units

We cannot use IndexVariable since (if I remember well) it coerces the data as a pd.Index. I think it is fine to stick with Variable for now, it will work in most cases except for, e.g., xr.testing.assert_identical() that maybe checks for the variable type (although maybe check_default_indexes=False may disable it). We'll need to change that in Xarray anyway.

Set new Pint index(es)

Since PintMetaIndex is meant to wrap any existing index, it could be done via the accessors while quantifying the variables. Something like this might work:

@register_dataset_accessor("pint")
class PintDatasetAccessor:

    def quantify(self, units=_default, unit_registry=None, **unit_kwargs)):
        ...

        ds_xindexes = self.ds.xindexes
        new_indexes, new_index_vars = ds_xindexes.copy_indexes()

        for idx, idx_vars in ds_xindexes.group_by_index():
            idx_units = {k: v for k, v in units.items() if k in idx_vars}
            new_idx = PintMetaIndex(idx, idx_units)
            new_indexes.update({k: new_idx for k in idx_vars})
            new_index_vars.update(new_idx.create_variables(idx_vars))

        new_coords = xr.Coordinates(new_index_vars, new_indexes)

        # needs https://github.com/pydata/xarray/pull/8094 to work properly
        ds_updated_temp = self.ds.assign_coords(new_coords)

        ...

It is still useful to implement PintMetaIndex.from_variables() with the common PandasIndex case so that it also works with Dataset.set_xindex:

class PintMetaIndex:
    
    @classmethod
    def from_variables(cls, variables, options):
        index = xr.indexes.PandasIndex.from_variables(variables)
        units_dict = {index.index.name: options.get("units")}
        return = cls(index, units_dict)


ds = xr.Dataset(coords={"x": [1, 2]})

ds_units = ds.drop_indexes("x").set_xindex("x", PintMetaIndex, units="m")

Data selection

Beware that Index.sel() may return new index objects along with the dimension integer indexers. the PintMetaIndex wrapper will need to wrap them before returning the query results.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

nit: I would rename PintMetaIndex to just PintIndex.

@benbovy
Copy link
Member

benbovy commented Aug 29, 2023

Further comments:

Implementing Dataset.pint.dequantify() would look very much the same than in Dataset.pint.quantify() except that it would extract the wrapped index: new_idx = self.index.

For Dataset.pint.to(), I think it is possible to convert the variables first, then re-create the underlying index with converted_idx = type(self.index).from_variables(converted_vars) and then wrap the latter with PintMetaIndex(converted_idx, new_units). I don't think it is always a good idea, though, as re-building the underlying index from scratch may be expensive and may load lazy coordinate data.

@keewis
Copy link
Collaborator Author

keewis commented Sep 13, 2023

@benbovy, with a few tweaks to your suggestions this:

In [1]: import xarray as xr
   ...: import pint_xarray
   ...: 
   ...: ureg = pint_xarray.unit_registry
   ...: ds = xr.tutorial.open_dataset("air_temperature")
   ...: q = ds.pint.quantify({"lat": "degrees", "lon": "degrees"})
   ...: q.sel(lat=ureg.Quantity(75, "deg").to("rad"))
.../xarray/core/indexes.py:473: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  index = pd.Index(np.asarray(array), **kwargs)
Out[1]: 
<xarray.Dataset>
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 [deg] 75.0
  * lon      (lon) float32 [deg] 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lon) float32 [K] 241.2 242.5 243.5 ... 241.48999 241.79
Indexes:
    lon      PintMetaIndex
    time     PintMetaIndex
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

does work 🎉

The only hickup is that somehow we seem to call safe_cast_to_index, which gives us the warning above. This doesn't seem to affect the functionality of the index, though (at least for sel, which is the only thing I tested).

@keewis
Copy link
Collaborator Author

keewis commented Sep 13, 2023

(the failing tests are expected, I will have to update some of the workaround code)

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis Excellent!

Hmm the warning is weird. safe_cast_to_index is called when a new PandasIndex object is created but there no reason to create one in your example.

Small suggestion: you could implement PintMetaIndex._repr_inline_ so that it displays the type of the wrapped index (e.g., PintMetaIndex(PandasIndex))

pint_xarray/conversion.py Outdated Show resolved Hide resolved
@benbovy
Copy link
Member

benbovy commented Sep 19, 2023

Regarding all the errors in CI TypeError: <pint_xarray.index.PintMetaIndex object at ...> cannot be cast to a pandas.Index object, an easy fix would be to implement PintMetaIndex.to_pandas_index() and just call that method for the wrapped index. This could be done for most of the rest of the Index API I think.

@keewis
Copy link
Collaborator Author

keewis commented Sep 19, 2023

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

@benbovy
Copy link
Member

benbovy commented Sep 19, 2023

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In the case of .to_pandas_index() just delegate is enough. For other methods (.rename, .roll, .isel, etc.) wrapping the result will also be needed. For .sel it might also be needed to check and wrap any index returned within the IndexSelResults.

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

Yes definitely, I think I fixed it in one of my open PRs in the Xarray repo.

@keewis
Copy link
Collaborator Author

keewis commented Dec 9, 2023

Isn't it addressed by pint.Quantity.to_base_unit?

No, because to_base_unit will convert the index index's data to the base units (i.e. combinations of the base units m, s, kg, ...). Instead, the idea is to require exactly matching units for now, which can be relaxed at a later point.

@keewis
Copy link
Collaborator Author

keewis commented Dec 10, 2023

@benbovy, something I just noticed: where do we need PintIndex.from_variables? As far as I understand it, this is used for set_xindex, but since PintIndex is a meta index (or rather, a index that wraps other indexes) it might not really make sense to pass it to set_xindex... Would it be justified to not implement it, even though the base class says otherwise ("Every subclass must at least implement Index.from_variables")?

Other than that, I'm not sure what to do about stack / unstack: we're potentially stacking variables of different unit dimensions, so stack might not make too much sense. Maybe we should automatically dequantify the stacked indexes to at least preserve the units? unstack might work if we're wrapping a MultiIndex, but I'd be leaning towards not supporting it. That would leave us with concat, join, and reindex_like, which I'd skip for now.

I'll have a look at what needs to be changed to get the tests to work, although this would probably require changes to xarray (the diff_* functions still make use of .indexes that call the index's .to_pandas_index, but I don't want to implement .to_pandas_index here to avoid unintentionally losing a unit).

@keewis keewis linked an issue Dec 10, 2023 that may be closed by this pull request
@keewis
Copy link
Collaborator Author

keewis commented Dec 10, 2023

we also need to figure out how to handle set_xindex on top of a quantified dataset (so a PintIndex would be overwritten – if that makes sense?)

@benbovy
Copy link
Member

benbovy commented Dec 11, 2023

I think that in general it would be nice if we can keep a simple and predictable behavior for .set_xindex, i.e., we can use it with any Xarray Index leaf class and also it doesn't make any assumption about the index being overwritten (I think that currently this is not even possible to overwrite an existing index and .drop_indexes must be called first).

I wouldn't make a special case for PintIndex or any other "meta-index". Although it is likely that in most cases pint indexes won't be set directly via .set_xindex, it would make sense to me having PintIndex.from_variables return a pint index wrapping a PandasIndex (or an object of the default index type used by Xarray).

Other than that, I'm not sure what to do about stack / unstack: we're potentially stacking variables of different unit dimensions, so stack might not make too much sense.

Hmm stack / unstack only act on the dimensions so in theory it shouldn't invalidate units? Since the pint_index.units attribute is a mapping of coordinate name to unit, couldn't PintIndex already support multiple units? The core stack / unstack logic is executed by the wrapped index and PintIndex may be agnostic of the wrapped index type.

In the case of a wrapped PandasMultiIndex, the created dimension coordinate (tuple values) may not have a single unit. Perhaps best is to keep it unitless. Eventually that coordinate will be deprecated anyway.

That said, I think it's fine to leave stack / unstack unimplemented for now and implement them in follow-up PRs.

I don't want to implement .to_pandas_index here to avoid unintentionally losing a unit

Hmm that sounds a bit too restrictive to me. How about issuing a user warning instead? In the long-term we can probably get rid of .to_pandas_index when the index refactor will be complete Xarray-wise (this method is mostly used internally for features that still rely on pandas.Index).

@keewis
Copy link
Collaborator Author

keewis commented Dec 11, 2023

(or an object of the default index type used by Xarray)

is there a way to access this default index type? Otherwise it's probably fine to keep hard-coding PandasIndex for now and fail if that doesn't work.

(I think that currently this is not even possible to overwrite an existing index and .drop_indexes must be called first)

Makes sense, though in that case to be extra friendly I wonder if we can customize the error message? Instead of recommending .drop_indexes I'd like to recommend .pint.dequantify (though it's probably fine if we keep that for later or just rely on the docs)

How about issuing a user warning instead?

This is what pint does when cast to a numpy array, but I'm not really happy with that, warnings are just too easy to ignore.

I'll see how far I can get without .to_pandas_index, and if I encounter anything that truly wants to cast to a PandasIndex I may have to reconsider.

couldn't PintIndex already support multiple units?

That might be true, I might have been too tired yesterday when writing this. I'll still postpone the implementation to a later PR, though.

@benbovy
Copy link
Member

benbovy commented Dec 11, 2023

is there a way to access this default index type?

Not yet (I was thinking too much ahead). For now it is PandasIndex but maybe later we could add some option in Xarray to get / set the default index type (e.g., for cases where PandasIndex is too expensive and/or not really needed).

I'll see how far I can get without .to_pandas_index

You might want to see where it is called in Xarray internals. For example, if you leave PintIndex,to_pandas_index() unimplemented, the ds.indexes property will raise an error if there is any pint index in the dataset.

I wonder if we can customize the error message?

This would require some API entrypoint in Xarray, but I'm not sure it is really worth it since here you might recommend setting pint indexes "automatically" via .pint.quantify instead of manually via .set_xindex (but keep support the latter for consistency and advanced use cases).

@keewis
Copy link
Collaborator Author

keewis commented Dec 21, 2023

now that I have more time to work on this, I think the only things left to fix right now is:

  • figure out what to do with assert_identical. The point of using assert_identical was to make sure the units on dimension coordinates survived, but now that they are on indexes we can maybe just remove that check / use assert_equal after applying strip_units to actual
  • figure out how to get rid of the UnitsStrippedWarning emitted by .pint.to, which happens because Dataset / Coordinates tries to create default indexes on quantified variables when assembling the converted variables back into a dataset in dataset_from_variables. To resolve this, I think we need to somehow figure out how to recreate the underlying index from the converted variables.

The remaining methods I think we can leave to separate PRs: concat, stack, unstack, join, reindex_like.

@benbovy
Copy link
Member

benbovy commented Dec 21, 2023

figure out how to get rid of the UnitsStrippedWarning emitted by .pint.to, which happens because Dataset / Coordinates tries to create default indexes on quantified variables when assembling the converted variables back into a dataset in dataset_from_variables.

For this one it would be nice if dataset_from_variables could be refactored to allow passing a Coordinates object (coordinate variables + indexes) that we can then pass the the Dataset constructor (maybe after updating it) so that it will skip the creation of new default indexes. The only thing is that I don't remember if Coordinates was already part of Xarray's public API before this new behavior was implemented (you might need to temporarily rely on importing from xarray.core).

@keewis
Copy link
Collaborator Author

keewis commented Dec 21, 2023

we're already using some part of the private API of xarray (Coordinates._construct_direct), so I would be fine to use Coordinates either way. However, from my experiments it appears just passing the variables to the constructor of that class still creates the default indexes. Do you have any hints on how to avoid that? Is that Coordinates._construct_direct again?

Edit: oh, wait, maybe that's what you've been saying? I'll check.

@benbovy
Copy link
Member

benbovy commented Dec 21, 2023

You'd need to pass Coordinates(variables={...}, indexes={}) to skip the creation of default indexes, or Coordinates._contruct_direct should work too (see pydata/xarray#8107).

@keewis
Copy link
Collaborator Author

keewis commented Dec 21, 2023

Coordinates._construct_direct works, but Coordinates(coords=coords, indexes={}) still tries to create default indexes. Not sure why.

Edit: It appears as_variable converts to an index variable.

@benbovy
Copy link
Member

benbovy commented Dec 21, 2023

@keewis
Copy link
Collaborator Author

keewis commented Dec 22, 2023

a couple more notes:

  • the diff called by assert_identical appears to "work" just by changing a.indexes to a.xindexes (it doesn't raise, at least). However, Dataset.equals and Dataset.identical don't compare the indexes, yet, and the diff doesn't display differing indexes, either.
  • looking at dataset_from_variables, I think we can make that a lot simpler by passing along the original dataset, the new variables and the new indexes (if any)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for set_xindex? Wrong units when using da.integrate() PintMetaIndex
3 participants