Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve how we store metadata #2702

Open
sfinkens opened this issue Dec 20, 2023 · 4 comments
Open

Improve how we store metadata #2702

sfinkens opened this issue Dec 20, 2023 · 4 comments
Labels
enhancement code enhancements, features, improvements

Comments

@sfinkens
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe.

We repeatedly face encoding/decoding problems in the CF writer, so I have been thinking a bit about how we store metadata and possible alternatives.

Pros

Currently we store metadata in dataset attributes. This is very convenient, because we can just attach any Python type (string, array, dict etc) to a dataset.

Cons

Over time attributes have become more complex: From a simple sensor name to orbital parameters, time parameters, calibration coefficients etc. Now we're experiencing some problems with that approach

  • There's no way to attach dimensions to attributes, for example time-dependent attributes . So these attributes are dropped when concatenating datasets along the time dimension.
  • There's no (xarray-native) way to attach units to attributes
  • Encoding/decoding is complicated, especially when we want to achieve a smooth roundtrip. For example: Encode dataset attribute to netCDF attribute, then decode back to original Python type. A significant part of the CF reader/writer is about attribute encoding/decoding.

Describe the solution you'd like

I think many attributes could be stored as Data Arrays or Datasets. Then we could leverage the full potential of the xarray data model combined with CF conventions. We would still be using datasets attributes, but mainly strings (such as units, calendar etc).

When we are moving towards data trees (#2605), we could organize them in a "metadata" branch (either global or dataset-specific). For example:

DataTree('None', parent=None)
├── DataTree('ch1')
│   ├── DataTree('data')
│   │       Dimensions:  (time: 1, y: 2, x: 2)
│   │       Data variables:
│   │           ch1      (time, y, x) int64 1 2 3 4
│   └── DataTree('metadata')
│       └── DataTree('calibration_coefficients')
│               Dimensions:  (time: 1)
│               Data variables:
│                   gain     (time) float64 0.1
|                       Attributes:
|                           units:    W/m2
│                   offset   (time) int64 -51
|                       Attributes:
|                           units:    1
└── DataTree('metadata')
    ├── DataTree('orbital_parameters')
    │       Dimensions:                     (time: 1)
    │       Dimensions without coordinates: time
    │       Data variables:
    │           projection_longitude        (time) int64 0
    |               Attributes:
    |                   units:    degrees east
    │           satellite_actual_longitude  (time) float64 -0.01
    |               Attributes:
    |                   units:    degrees east
    └── DataTree('time_parameters')
            Dimensions:                 (time: 1)
            Dimensions without coordinates: time
            Data variables:
                nominal_start_time      (time) int64 0
                    Attributes:
                        units:    nanoseconds since 1970-01-01
                observation_start_time  (time) int64 1
                    Attributes:
                        units:    nanoseconds since 1970-01-01

Describe any changes to existing user workflow
Users would have to adapt to the new data structure.

Additional context
For units, we could use pint already.

@sfinkens sfinkens added the enhancement code enhancements, features, improvements label Dec 20, 2023
@djhoese
Copy link
Member

djhoese commented Dec 20, 2023

As an additional option for some of these fields, I wonder if we could store them as coordinate variables?

I'm not a huge fan of the "metadata" DataTree, but I also really don't have a good alternative except for maybe coordinate variables. Just so we have the whole picture, besides units can you think of other attributes that these metadata "things" need to keep track of?

In some sense this feels like a failing of the NetCDF model. Where we might want units or other information for attributes we can't do it. I feel like most NetCDF creators have leaned towards external documentation that describes the units of attributes. Not that I want to necessarily do that, but could that simplify this if we "forced" that all distances are in meters, all times are epoch nanoseconds, etc?

@sfinkens
Copy link
Member Author

sfinkens commented Dec 20, 2023

Indeed, coordinate variables would work, nice idea!

<xarray.Dataset>
Dimensions:               (time: 1, y: 2, x: 2, tle_dim: 2)
Coordinates:
  * time                  (time) int64 0
  * y                     (y) int64 0 1
  * x                     (x) int64 0 1
    projection_longitude  (time) int64 0
        Attributes:
            units:    degrees east
    nominal_start_time    (time) int64 0
        Attributes:
            units:    nanoseconds since 1970-01-01
    two_line_elements     (time, tle_dim) int64 1 2
Data variables:
    ch1                   (time, y, x) int64 1 2 3 4

Just so we have the whole picture, besides units can you think of other attributes that these metadata "things" need to keep track of?

Looking at the Metadata section in the documentation, there are also attributes for area definition and raw metadata. The area def can be stored as coordinate variable, too. And raw metadata is probably too special...

Not that I want to necessarily do that, but could that simplify this if we "forced" that all distances are in meters, all times are epoch nanoseconds, etc?

This would certainly help with units, but not with dimensions or encoding.

@mraspaud
Copy link
Member

mraspaud commented Jan 9, 2024

I think it's a good idea to improve the metadata storage indeed.

Coordinate variables are good for some things. For example, in rioxarray, projection information is store in a spatial_ref coordinate variable, which I think is interesting. As we have it now, its info is equivalent to what we have in our area attribute, but we could use the same idea for other info, eg we could have an orbital_ref for everything related to orbital information.

But then comes the question of what really belongs as a coordinate variable. I see coordinate variables as variables that are related in one way or another to dimension/coordinates, hence eg calibration coefficients would not really fit there.

In the first example you provide, the gain and offset are aligned to the time dimension, is this intentional, or should they be dimensionless?

@sfinkens
Copy link
Member Author

calibration coefficients would not really fit there

Good point.

In the first example you provide, the gain and offset are aligned to the time dimension, is this intentional, or should they be dimensionless?

I wanted to indicate that coefficients can be time-dependent, but it isn't strictly needed. xr.concat(datasets, dim="time") also produces a timeseries for scalars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement code enhancements, features, improvements
Projects
None yet
Development

No branches or pull requests

3 participants