Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr serialisation #48

Open
martinfleis opened this issue Jul 3, 2023 · 12 comments
Open

Zarr serialisation #48

martinfleis opened this issue Jul 3, 2023 · 12 comments

Comments

@martinfleis
Copy link
Member

Splitting this from #26 for a more focused discussion.

During a discussion with @rabernat earlier today, we talked about the possibility to serialise vector data cubes into a Zarr format, so I want to share a few notes and links.

Zarr is flexible enough to support this in many ways, so a few that has been discussed so far:

It may be worth ensuring whatever we prototype is done in sync with the GeoZarr efforts.

@edzer
Copy link

edzer commented Jul 3, 2023

I think your first proposal makes it harder to read & write Zarr, as the de/encoder needs to read Zarr AND GeoParquet; we had this discussion before when the idea came up to use WKB. I don't think the GDAL Multidimensional Array API can do that, can xarray do it?

Given the geoarrow proposal you point to, there seem to be a strong reasons not to interleave ([x y x y ... ]). I think that when you would use separate arrays for x and y, your second and third proposal will be nearly identical: I think geoarrow lists the offset indices of each geometry, CF their lengths.

The CF convention arrays are not ragged but contiguous in memory/on disk: only when you're cutting them into the polygons or linestrings they (obviously) become ragged of some sort. CF seems the most natural to me, because those who write Zarr are aware of that spec.

@rabernat
Copy link

rabernat commented Jul 3, 2023

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism. The vector indexes could exist in two states:

  • encoded: the raw x, y, offset arrays, plus associated metadata--these are just normal array, serializable to zarr, netCDF, whatever
  • decoded: the geometries are translated to xvec indexes

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec, e.g.

ds = xr.open_dataset(netcdf_or_zarr_file)
ds = ds.xvec.decode() # -> create xvec GeometryIndex from x, y, offset arrays
ds = ds.xvec.encode() # -> back to raw form
ds.to_zarr(new_file)

@edzer
Copy link

edzer commented Jul 3, 2023

Yes we can but should we? I wonder what the motivation is for drawing in non-native en/decoders: the payload is really the data cube cells, the vector geometries are only associated with (typically) a single dimension, meaning orders of magnitude smaller data than the array. Is this bikeshedding?

@rabernat
Copy link

rabernat commented Jul 3, 2023

I'm not sure what you mean by "non native". Could you elaborate? I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

@edzer
Copy link

edzer commented Jul 4, 2023

Maybe I'm overthinking things.

I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms.

Then we're on the same page!

@martinfleis
Copy link
Member Author

Agree with both of you.

the de/encoder needs to read Zarr AND GeoParquet

Yes, we just mentioned that it is theoretically possible but went quickly over to GeoArrow and CF conventions as a preferable solution.

when you would use separate arrays for x and y, your second and third proposal will be nearly identical

This would also be my preference. The NetCDF serialisation to match {stars} is still on the roadmap in parallel to Zarr and having both using either the same or very close representation of geometries would allow us to reuse big parts of the machinery for one in the other.

A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism.

+1

@jsignell
Copy link

I was taking a look at this at SciPy sprints today and tried to extend the shapely_to_cf function in cf-xarray. I ran into a bit of trouble with handling multilines (I'll try to put up a PR tomorrow).

I was eventually pointed to https://github.com/twhiteaker/CFGeom! I'm not sure what the maintenance situation is on that, but it could either be revived or used as a reference of how to convert back and forth between cf and shapely. And they already have a to_netcdf 🚀

@martinfleis
Copy link
Member Author

We could probably use it to ensure that the CF representation of geometries is according to spec but due to its age, the implementation itself will likely be very different now that we have shapely 2. I assume that their code may still work but it will not be as efficient as we can be these days.

We could also generate reference NetCDF files using {stars} if that is useful. And ensure it is equal to one CFGeom generates.

@jsignell
Copy link

Ok yeah that makes sense. I'll open a PR with what I've got. Not sure when I'll get a chance to come back to it.

@martinfleis
Copy link
Member Author

I have tried to go the simple path and encode geometries using geo-arrow-like encoding to store in Zarr and it seems to work nicely (plus it was approved by @MSanKeys963). Below is the draft of encode/decode functions, while Zarr IO on the encoded Dataset works flawlessly.

def decode(ds):
    """Decode Dataset with Xvec indexes into GeoArrow arrays

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with Xvec-backed coordinates

    Returns
    -------
    xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays
    """
    decoded = ds.copy()
    decoded = decoded.assign_attrs(geometry={})
    for geom_coord in ds.xvec.geom_coords:
        geom_type, coords, offsets = shapely.to_ragged_array(ds[geom_coord])
        decoded.attrs["geometry"][geom_coord] = {
            "geometry_type": geom_type.name,
            "crs": ds[geom_coord].crs.to_json_dict(),
        }
        to_assign = {
            geom_coord: numpy.arange(ds[geom_coord].shape[0]),
            f"{geom_coord}_coords_x": coords[:, 0],
            f"{geom_coord}_coords_y": coords[:, 1],
        }
        if coords.shape[1] == 3:
            to_assign[f"{geom_coord}_coords_z"] = coords[:, 2]
        for i, offsets_arr in enumerate(offsets):
            to_assign[f"{geom_coord}_offsets_{i}"] = offsets_arr

    return decoded.assign(to_assign)

def encode(ds):
    """Encode Dataset indexed by geometries represented as GeoArrow arrays into Xvec indexes 

    Parameters
    ----------
    ds : xarray.Dataset
        Dataset with geometries represented as GeoArrow-like arrays

    Returns
    -------
    xarray.Dataset
        Dataset with Xvec-backed coordinates
    """
    encoded = ds.copy()
    for geom_coord in encoded.attrs["geometry"].keys():
        offsets = [c for c in encoded.coords if f"{geom_coord}_offsets_" in c]

        geoms = shapely.from_ragged_array(
            getattr(
                shapely.GeometryType,
                encoded.attrs["geometry"][geom_coord]["geometry_type"],
            ),
            coords=numpy.vstack(
                [encoded[f"{geom_coord}_coords_x"], encoded[f"{geom_coord}_coords_y"]]
            ).T,
            offsets=tuple([encoded[c] for c in offsets]),
        )
        encoded = encoded.drop_vars(
            [
                f"{geom_coord}_coords_x",
                f"{geom_coord}_coords_y",
            ]
            + offsets
        )
        encoded[geom_coord] = geoms
        encoded = encoded.xvec.set_geom_indexes(
            geom_coord, crs=encoded.attrs["geometry"][geom_coord]["crs"]
        )
    del encoded.attrs["geometry"]
    return encoded

Using the example from our docs:

import geopandas as gpd
import xarray as xr
import xvec
import shapely
import numpy

from geodatasets import get_path

counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)
arrow_like = decode(cube)
arrow_like
<xarray.Dataset>
Dimensions:           (county: 3085, year: 4, county_coords_x: 80563,
                       county_coords_y: 80563, county_offsets_0: 3173,
                       county_offsets_1: 3160, county_offsets_2: 3086)
Coordinates:
  * county            (county) int64 0 1 2 3 4 5 ... 3080 3081 3082 3083 3084
  * year              (year) int64 1960 1970 1980 1990
  * county_coords_x   (county_coords_x) float64 -95.34 -95.34 ... -111.3 -111.4
  * county_coords_y   (county_coords_y) float64 48.55 48.72 ... 44.73 44.75
  * county_offsets_0  (county_offsets_0) int64 0 24 74 142 ... 80457 80489 80563
  * county_offsets_1  (county_offsets_1) int64 0 1 2 3 4 ... 3169 3170 3171 3172
  * county_offsets_2  (county_offsets_2) int64 0 1 2 3 4 ... 3156 3157 3158 3159
Data variables:
    population        (county, year) int64 4304 3987 3764 ... 43766 55800 65077
    unemployment      (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce           (county, year) float64 1.859 2.62 3.747 ... 4.782 7.415
    age               (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Attributes:
    geometry:  {'county': {'geometry_type': 'MULTIPOLYGON', 'crs': {'$schema'...

This form can be saved using arrow_like.to_zarr("xvec.zarr") and read back via xr.open_zarr("xvec.zarr"). Then you can convert it back.

encode(arrow_like)
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object MULTIPOLYGON (((-95.34258270263672 48.54670...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

The only change during the roundtrip is casting of the Polygon to MultiPolygon since GeoArrow does not support mixed single and multi-part geometry types and all are casted to multi-type if there's at least one multi-type.

It does deviate from CF conventions but uses readily available tooling in shapely. And the format is derived from GeoArrow, so there is a spec to follow.

If you find this plausible, I will convert it to a PR.

@martinfleis
Copy link
Member Author

martinfleis commented Sep 18, 2023

Another option would be to encode geometries as WKB (well-known binary) instead of a Geo-Arrow-like ragged array. That allows us to keep the dimensionality and have the decoded object look like the original one.

USing the cube from above:

cube["county"] = shapely.to_wkb(cube["county"])
cube["county"].attrs["crs"] = cube["county"].attrs["crs"].to_json()
cube.to_zarr("wkb.zarr")
cube
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x18\...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
from_wkb = xr.open_zarr("wkb.zarr")
crs = from_wkb["county"].attrs["crs"]
from_wkb["county"] = shapely.from_wkb(from_wkb["county"])
from_wkb = from_wkb.xvec.set_geom_indexes("county", crs=crs)

Anyone has a reason why we should not use WKB here? The one reason I can think of is the need to have the WKB geometry reader but if you want to create geometries you probably already have that (it is part of GEOS).

@benbovy
Copy link
Member

benbovy commented Sep 20, 2023

Anyone has a reason why we should not use WKB here?

I understand why this might not be ideal / optimal but I don't see any reason not to support it. It is convenient and it "scales well" (i.e., the number variables, dimensions & attributes doesn't explode) in the case of a dataset with several n-d geometry arrays (data variables), assuming that this case is plausible.

Looking at the geoparquet spec, WKB support doesn't seem to go anywhere anytime soon and will likely co-exist with other encoding options (geoarrow) in the near future. Unless it represents a big burden to maintain, IMHO it would make sense to support those options here too (and in the geozarr-spec if it includes vector data) in addition to the CF option (via cf-xarray).

It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec

Or Xvec could provide an I/O backend, which Xarray already allows as plugin. One thing I'm not sure is whether it could be built on top of Xarray's built-in zarr backend (I mean reusing the implementation of raw zarr I/O in a new backend, not extend the API of xarray's zarr backend).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants