Zarr serialisation #48

martinfleis · 2023-07-03T18:27:42Z

Splitting this from #26 for a more focused discussion.

During a discussion with @rabernat earlier today, we talked about the possibility to serialise vector data cubes into a Zarr format, so I want to share a few notes and links.

Zarr is flexible enough to support this in many ways, so a few that has been discussed so far:

including a GeoParquet within Zarr
Using array-based of geometries serialisation of geometry that either mirrors or is as close as possible to the GeoArrow spec. Apart from the existing spec using array storing the interleaved coordinate values ([x, y, x, y, ...], pay also attention to Allow storing points/coordinates as struct arrays in addition to interleaved layout geoarrow/geoarrow#26.
Using serialisation of geometries following CF conventions. That, for points, implements cf_xarray. The spec lives here and is not that complex but it is a ragged array that may not be optimal.

It may be worth ensuring whatever we prototype is done in sync with the GeoZarr efforts.

The text was updated successfully, but these errors were encountered:

edzer · 2023-07-03T18:46:53Z

I think your first proposal makes it harder to read & write Zarr, as the de/encoder needs to read Zarr AND GeoParquet; we had this discussion before when the idea came up to use WKB. I don't think the GDAL Multidimensional Array API can do that, can xarray do it?

Given the geoarrow proposal you point to, there seem to be a strong reasons not to interleave ([x y x y ... ]). I think that when you would use separate arrays for x and y, your second and third proposal will be nearly identical: I think geoarrow lists the offset indices of each geometry, CF their lengths.

The CF convention arrays are not ragged but contiguous in memory/on disk: only when you're cutting them into the polygons or linestrings they (obviously) become ragged of some sort. CF seems the most natural to me, because those who write Zarr are aware of that spec.

rabernat · 2023-07-03T19:19:53Z

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism. The vector indexes could exist in two states:

encoded: the raw x, y, offset arrays, plus associated metadata--these are just normal array, serializable to zarr, netCDF, whatever
decoded: the geometries are translated to xvec indexes

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec, e.g.

ds = xr.open_dataset(netcdf_or_zarr_file)
ds = ds.xvec.decode() # -> create xvec GeometryIndex from x, y, offset arrays
ds = ds.xvec.encode() # -> back to raw form
ds.to_zarr(new_file)

edzer · 2023-07-03T19:26:42Z

Yes we can but should we? I wonder what the motivation is for drawing in non-native en/decoders: the payload is really the data cube cells, the vector geometries are only associated with (typically) a single dimension, meaning orders of magnitude smaller data than the array. Is this bikeshedding?

rabernat · 2023-07-03T19:31:12Z

I'm not sure what you mean by "non native". Could you elaborate? I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

edzer · 2023-07-04T07:41:27Z

Maybe I'm overthinking things.

I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

Then we're on the same page!

martinfleis · 2023-07-04T08:16:37Z

Agree with both of you.

the de/encoder needs to read Zarr AND GeoParquet

Yes, we just mentioned that it is theoretically possible but went quickly over to GeoArrow and CF conventions as a preferable solution.

when you would use separate arrays for x and y, your second and third proposal will be nearly identical

This would also be my preference. The NetCDF serialisation to match {stars} is still on the roadmap in parallel to Zarr and having both using either the same or very close representation of geometries would allow us to reuse big parts of the machinery for one in the other.

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism.

+1

jsignell · 2023-07-15T23:29:55Z

I was taking a look at this at SciPy sprints today and tried to extend the shapely_to_cf function in cf-xarray. I ran into a bit of trouble with handling multilines (I'll try to put up a PR tomorrow).

I was eventually pointed to https://github.com/twhiteaker/CFGeom! I'm not sure what the maintenance situation is on that, but it could either be revived or used as a reference of how to convert back and forth between cf and shapely. And they already have a to_netcdf 🚀

martinfleis · 2023-07-16T06:15:04Z

We could probably use it to ensure that the CF representation of geometries is according to spec but due to its age, the implementation itself will likely be very different now that we have shapely 2. I assume that their code may still work but it will not be as efficient as we can be these days.

We could also generate reference NetCDF files using {stars} if that is useful. And ensure it is equal to one CFGeom generates.

jsignell · 2023-07-17T12:44:32Z

Ok yeah that makes sense. I'll open a PR with what I've got. Not sure when I'll get a chance to come back to it.

martinfleis · 2023-09-12T15:57:25Z

I have tried to go the simple path and encode geometries using geo-arrow-like encoding to store in Zarr and it seems to work nicely (plus it was approved by @MSanKeys963). Below is the draft of encode/decode functions, while Zarr IO on the encoded Dataset works flawlessly.

def decode(ds):
    """Decode Dataset with Xvec indexes into GeoArrow arrays

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with Xvec-backed coordinates

    Returns
    -------
    xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays
    """
    decoded = ds.copy()
    decoded = decoded.assign_attrs(geometry={})
    for geom_coord in ds.xvec.geom_coords:
        geom_type, coords, offsets = shapely.to_ragged_array(ds[geom_coord])
        decoded.attrs["geometry"][geom_coord] = {
            "geometry_type": geom_type.name,
            "crs": ds[geom_coord].crs.to_json_dict(),
        }
        to_assign = {
            geom_coord: numpy.arange(ds[geom_coord].shape[0]),
            f"{geom_coord}_coords_x": coords[:, 0],
            f"{geom_coord}_coords_y": coords[:, 1],
        }
        if coords.shape[1] == 3:
            to_assign[f"{geom_coord}_coords_z"] = coords[:, 2]
        for i, offsets_arr in enumerate(offsets):
            to_assign[f"{geom_coord}_offsets_{i}"] = offsets_arr

    return decoded.assign(to_assign)

def encode(ds):
    """Encode Dataset indexed by geometries represented as GeoArrow arrays into Xvec indexes 

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays

    Returns
    -------
    xarray.Dataset
        Dataset with Xvec-backed coordinates
    """
    encoded = ds.copy()
    for geom_coord in encoded.attrs["geometry"].keys():
        offsets = [c for c in encoded.coords if f"{geom_coord}_offsets_" in c]

        geoms = shapely.from_ragged_array(
            getattr(
                shapely.GeometryType,
                encoded.attrs["geometry"][geom_coord]["geometry_type"],
            ),
            coords=numpy.vstack(
                [encoded[f"{geom_coord}_coords_x"], encoded[f"{geom_coord}_coords_y"]]
            ).T,
            offsets=tuple([encoded[c] for c in offsets]),
        )
        encoded = encoded.drop_vars(
            [
                f"{geom_coord}_coords_x",
                f"{geom_coord}_coords_y",
            ]
            + offsets
        )
        encoded[geom_coord] = geoms
        encoded = encoded.xvec.set_geom_indexes(
            geom_coord, crs=encoded.attrs["geometry"][geom_coord]["crs"]
        )
    del encoded.attrs["geometry"]
    return encoded

Using the example from our docs:

import geopandas as gpd
import xarray as xr
import xvec
import shapely
import numpy

from geodatasets import get_path

counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube

<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

arrow_like = decode(cube)
arrow_like

<xarray.Dataset>
Dimensions:           (county: 3085, year: 4, county_coords_x: 80563,
                       county_coords_y: 80563, county_offsets_0: 3173,
                       county_offsets_1: 3160, county_offsets_2: 3086)
Coordinates:
  * county            (county) int64 0 1 2 3 4 5 ... 3080 3081 3082 3083 3084
  * year              (year) int64 1960 1970 1980 1990
  * county_coords_x   (county_coords_x) float64 -95.34 -95.34 ... -111.3 -111.4
  * county_coords_y   (county_coords_y) float64 48.55 48.72 ... 44.73 44.75
  * county_offsets_0  (county_offsets_0) int64 0 24 74 142 ... 80457 80489 80563
  * county_offsets_1  (county_offsets_1) int64 0 1 2 3 4 ... 3169 3170 3171 3172
  * county_offsets_2  (county_offsets_2) int64 0 1 2 3 4 ... 3156 3157 3158 3159
Data variables:
    population        (county, year) int64 4304 3987 3764 ... 43766 55800 65077
    unemployment      (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce           (county, year) float64 1.859 2.62 3.747 ... 4.782 7.415
    age               (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Attributes:
    geometry:  {'county': {'geometry_type': 'MULTIPOLYGON', 'crs': {'$schema'...

This form can be saved using arrow_like.to_zarr("xvec.zarr") and read back via xr.open_zarr("xvec.zarr"). Then you can convert it back.

encode(arrow_like)

<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object MULTIPOLYGON (((-95.34258270263672 48.54670...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

The only change during the roundtrip is casting of the Polygon to MultiPolygon since GeoArrow does not support mixed single and multi-part geometry types and all are casted to multi-type if there's at least one multi-type.

It does deviate from CF conventions but uses readily available tooling in shapely. And the format is derived from GeoArrow, so there is a spec to follow.

If you find this plausible, I will convert it to a PR.

martinfleis · 2023-09-18T09:25:28Z

Another option would be to encode geometries as WKB (well-known binary) instead of a Geo-Arrow-like ragged array. That allows us to keep the dimensionality and have the decoded object look like the original one.

USing the cube from above:

cube["county"] = shapely.to_wkb(cube["county"])
cube["county"].attrs["crs"] = cube["county"].attrs["crs"].to_json()
cube.to_zarr("wkb.zarr")
cube

<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x18\...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33

from_wkb = xr.open_zarr("wkb.zarr")
crs = from_wkb["county"].attrs["crs"]
from_wkb["county"] = shapely.from_wkb(from_wkb["county"])
from_wkb = from_wkb.xvec.set_geom_indexes("county", crs=crs)

Anyone has a reason why we should not use WKB here? The one reason I can think of is the need to have the WKB geometry reader but if you want to create geometries you probably already have that (it is part of GEOS).

benbovy · 2023-09-20T14:46:10Z

Anyone has a reason why we should not use WKB here?

I understand why this might not be ideal / optimal but I don't see any reason not to support it. It is convenient and it "scales well" (i.e., the number variables, dimensions & attributes doesn't explode) in the case of a dataset with several n-d geometry arrays (data variables), assuming that this case is plausible.

Looking at the geoparquet spec, WKB support doesn't seem to go anywhere anytime soon and will likely co-exist with other encoding options (geoarrow) in the near future. Unless it represents a big burden to maintain, IMHO it would make sense to support those options here too (and in the geozarr-spec if it includes vector data) in addition to the CF option (via cf-xarray).

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec

Or Xvec could provide an I/O backend, which Xarray already allows as plugin. One thing I'm not sure is whether it could be built on top of Xarray's built-in zarr backend (I mean reusing the implementation of raw zarr I/O in a new backend, not extend the API of xarray's zarr backend).

rabernat mentioned this issue Jul 7, 2023

saving and reading vector data cubes #26

Open

jsignell mentioned this issue Jul 16, 2023

State of this repo twhiteaker/CFGeom#90

Open

jsignell mentioned this issue Jul 17, 2023

Add conversion between cf and shapely for lines xarray-contrib/cf-xarray#460

Merged

martinfleis mentioned this issue Sep 18, 2023

Are vector geometries in scope? zarr-developers/geozarr-spec#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr serialisation #48

Zarr serialisation #48

martinfleis commented Jul 3, 2023

edzer commented Jul 3, 2023

rabernat commented Jul 3, 2023

edzer commented Jul 3, 2023

rabernat commented Jul 3, 2023

edzer commented Jul 4, 2023

martinfleis commented Jul 4, 2023

jsignell commented Jul 15, 2023

martinfleis commented Jul 16, 2023

jsignell commented Jul 17, 2023

martinfleis commented Sep 12, 2023

martinfleis commented Sep 18, 2023 •

edited

benbovy commented Sep 20, 2023

Zarr serialisation #48

Zarr serialisation #48

Comments

martinfleis commented Jul 3, 2023

edzer commented Jul 3, 2023

rabernat commented Jul 3, 2023

edzer commented Jul 3, 2023

rabernat commented Jul 3, 2023

edzer commented Jul 4, 2023

martinfleis commented Jul 4, 2023

jsignell commented Jul 15, 2023

martinfleis commented Jul 16, 2023

jsignell commented Jul 17, 2023

martinfleis commented Sep 12, 2023

martinfleis commented Sep 18, 2023 • edited

benbovy commented Sep 20, 2023

martinfleis commented Sep 18, 2023 •

edited