Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saving and reading vector data cubes #26

Open
martinfleis opened this issue Feb 7, 2023 · 33 comments
Open

saving and reading vector data cubes #26

martinfleis opened this issue Feb 7, 2023 · 33 comments

Comments

@martinfleis
Copy link
Member

At the moment, we have a decent set of functionality to work with in-memory vector data cubes. What we do not have is a way of saving them to a disk and reading back again, apart from using pickle which is not very interoperable.

We should find a way of saving to a file format that allows us interoperability with @edzer's stars in RSpatial. @edzer, you were mentioning that NetCDF is able to handle VDC. Does stars implements that? Can we create a data cube indexed by geometry and save it via GDAL to NetCDF? I tried only very briefly to play with the write_stars function to no avail1.

The spec for NetCDF way of storing geometries is here but as someone who never worked with the file format, it is not super clear to me how it all works together.

Over in Python, there is some work @dcherian posted on conversion of shapely geometries to CF geometries and back (https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.shapely_to_cf.html) 2. There is also https://github.com/twhiteaker/cfgeom which may potentially be used (although it is only a reference implementation that is not maintained).

So it seems that NetCDF should be able to store vector data cubes and we may just need to figure out some wrappers for a convenient IO from xvec-backed xarray objects. But I'd welcome some help or at least a guidance from someone more familiar with CF conventions.

The other option seems to be Zarr as discussed here but that is at this stage only an idea and (Geo)Zarr spec is not ready as of now.

The last option is to convert the cube to a long-form dataframe and save it as a GeoParquet but that kind of breaks the point of having a cube in the first place.

Footnotes

  1. Getting Error in !all.equal(match(xydims, names(d)), 1:2) : invalid argument type when trying to use write_stars(st, 'test.nc') on the example cube from https://r-spatial.github.io/stars/articles/stars1.html#vector-data-cube-example.

  2. only points are implemented so far but that may change

@nmarchio

This comment was marked as off-topic.

@rabernat
Copy link

rabernat commented Feb 8, 2023

I'm interested in following this. Do you have an example of an existing NetCDF file that has the sort of encoded geometries you'd like to produce? That would be a very useful place to start. Presumably Xarray can open that file--it just doesn't decode the data into anything useful for computation.

I think the general pattern we're looking for is to create an xarray accessor that manages the encoding / decoding of the geometry data from this "raw" state (which is serializable) to something useful, which is not. The pint-xarray package provides a template of what this might look like.

# Open the raw data,
ds = xr.open_dataset("encoded.nc")
# convert the geometry data to something useful 
ds_decoded = ds.xvec.decode_geometries()
# the inverse operation
ds_encoded = ds_decoded.xvec.encode_geometries()
xr.testing.assert_equal(ds, ds_encoded)
ds_encoded.to_netcdf("round_trip.nc")

To make this more convenient for the user, this could be wrapped in an xarray pluggable backend for I/O.

One problem with our ecosystem is that there is no one central package that manages this sort of decoding for geospatial data. Rioxarray does a bunch, leveraging gdal. Odc-geo does things a little differently. Metpy has its own handling of CRS and other geo stuff. So does CF Xarray.

Makes me wish geo-xarray had not turned into a zombie package. It's really needed.

Anyway, I recommend following the path I sketched above to prototype this functionality.

@dcherian
Copy link

dcherian commented Feb 8, 2023

think the general pattern we're looking for is to create an xarray accessor that manages the encoding / decoding of the geometry data from this "raw" state

@aulemahal already taught CF-xarray to do this but only for "point geometries". IMO we should extend it to support other things (PR welcome!). Granted it's a helper function right now, but we could easily add .cf.encode(shapely_to_geometry=True) and .cf.decode(geometry_to_shapely=True).

Part of the point of cf-xarray is to have a single place for these kinds of utilities that can be used in many places.

@e-kotov
Copy link

e-kotov commented Feb 8, 2023

Just for reference, the other day I had a very similar question to @edzer in the stars repo. r-spatial/stars#611 . For now, like in Python, in R the only way to save these cubes is in R's pickle - RDS. It would be great to have an interoperable format, perhaps even based on arrow/parquet, rather than NetCDF.

@edzer
Copy link

edzer commented Feb 8, 2023

For now, like in Python, in R the only way to save these cubes is in R's pickle - RDS

@e-kotov no, that is not what I said, this only applies to the case where more than one dimension is associated with vector geometries.

For the (most common) use case of a single vector dimension, CDL as described in the CF conventions @martinfleis refers to can be perfectly used, leading to NetCDF and Zarr implementations (they both follow CDL / CF). Special cases such as weird raster cells specified by their vertices - think hexagons, DGGS - can also be covered by that.

If we want to write and read more than one vector geometry dimension we can still use CDL (NetCDF, Zarr) but would have to go beyond CF conventions. Nothing wrong with that, if we have a strong use case we can propose to extend CF.

In R, I implemented stars::write_mdim() and stars::read_mdim() to write & read vector data cubes (with one vector geometry dimension). It uses the GDAL C++ multidimensional array API, as it is already using GDAL for most I/O, but could as well have used the NetCDF library. The encoding of points, multilinestrings and multipolygons in CDL arrays is pretty trivial: you just write out all points (vertices), and then index arrays that point to the start of linestrings or polygons, and for polygons if holes are present indexes to the rings that are holes - you get the idea.

@martinfleis
Copy link
Member Author

martinfleis commented Feb 8, 2023

Thanks all!

We can split the issue into two problems:

  1. how to mimic stars implementation of IO with one geometry dimension
  2. how to ensure we have a file format allowing N geometry dimensions

For the first one, I have crated an example NetCDF encoding polygons using this R code (in rocker/geospatial container).

library(stars)

nc = st_read(system.file("gpkg/nc.gpkg", package="sf")) 
location = st_geometry(nc)
mode = c("car", "bike", "foot")
day = 1:100
hour = 0:23
dims = st_dimensions(location = location, mode = mode, day = day, hour = hour)
n = dim(dims)
traffic = array(rpois(prod(n), 10), dim = n) # simulated traffic counts
ds = st_as_stars(list(traffic = traffic),  dimensions = dims)
write_mdim(ds, "stars_cube.nc")

The file is attached (had to manually zip it to upload) -> stars_cube.nc.zip

Xarray does read it and it seems that we should be fairly easily able to convert that information into an array of shapely geometries with pyproj.CRS assigned via xvec.

ds_stars = xr.open_dataset("stars_cube.nc")
print(ds_stars)

<xarray.Dataset>
Dimensions:          (hour: 24, day: 100, mode: 3, location: 100, node: 2529,
                      part: 108)
Coordinates:
  * location         (location) float32 1.0 2.0 3.0 4.0 ... 97.0 98.0 99.0 100.0
  * mode             (mode) object 'car' 'bike' 'foot'
  * day              (day) float32 1.0 2.0 3.0 4.0 5.0 ... 97.0 98.0 99.0 100.0
  * hour             (hour) float32 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
  * node             (node) float32 1.0 2.0 3.0 ... 2.528e+03 2.529e+03
    x                (node) float32 ...
    y                (node) float32 ...
Dimensions without coordinates: part
Data variables:
    traffic          (hour, day, mode, location) float32 ...
    crs              |S1 ...
    node_count       (location) float32 ...
    part_node_count  (part) float32 ...
    interior_ring    (part) float32 ...
    geometry         float32 ...
Attributes:
    Conventions:  CF-1.6

where

print(ds_stars.crs)

<xarray.DataArray 'crs' ()>
[1 values with dtype=|S1]
Attributes:
    grid_mapping_name:            latitude_longitude
    long_name:                    CRS definition
    longitude_of_prime_meridian:  0.0
    semi_major_axis:              6378206.4
    inverse_flattening:           294.978698213898
    spatial_ref:                  GEOGCS["NAD27",DATUM["North_American_Datum_...
    crs_wkt:                      GEOGCS["NAD27",DATUM["North_American_Datum_...

can be interpreted by pyproj and x, y, node_count, part_node_count and interior_ring can be used to construct shapely geometry. The latter should be ideally implemented in cf_xarray.cf_to_shapely() and included in the decode logic @rabernat posted above.

And then do the same the other way.

For the second one (multiple geometry dimension in a file), I believe that should be discussed separately and probably somewhere else. We can then implement here or in other packages whatever spec a community agrees on.

@benbovy
Copy link
Member

benbovy commented Feb 8, 2023

What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?

Regarding zarr (python), this can be registered dynamically via numcodec. For hdf5 / netcdf4 I'm less sure but it looks like custom filters are supported too (e.g., https://github.com/silx-kit/hdf5plugin)?

The geoparquet format currently encodes geometries in the WKB format, which is used by geopandas to read/write parquet and feather files. For a Python/Zarr prototype, I guess we could adapt some logic from geopandas and register a custom numcodec codec that encodes (decodes) to (from) WKB? Such a codec would nicely fit in Xvec actually. Since it is handled by zarr, there wouldn't be anything specific to do on the Xarray side to convert the raw vector data. Perhaps the same can be done for h5py (netcdf4)?

I'm not sure how best to deal with metadata, though. The geozarr specs seems a nice place to discuss / address this. For vector-specific metadata, it might be worth looking at the geoarrow specs.

I'm not familiar with the CF conventions for geometries, but at a first glance it looks a bit more convoluted (and maybe less flexible?) than the custom codec approach.

More generally, it would be nice if we could leverage the best from both Arrow/Parquet and Zarr (HDF5/NetCDF). Those formats are de-facto pretty standard or on the way of being so. It that's possible, it looks like an option that would be both efficient and reasonably portable for vector data cube I/O.

@rabernat
Copy link

rabernat commented Feb 8, 2023

It would be great to have an interoperable format, perhaps even based on arrow/parquet, rather than NetCDF.

I agree interoperability is a great goal to strive for. I have been assuming that data cubes cannot be serialized to Parquet / Arrow because those formats are fundamentally aimed at dataframes (and not ND-Arrays). Maybe vector data are different however. If your data fit into Parquet, Pandas is probably the right choice, rather than Xarray.

Xarray's internal data model is based on NetCDF. However, we can serialize this data model to many different formats. Something like this.

graph TD;
    CF-conventions--> NetCDF-Data-Model;
    NetCDF-Data-Model-->NetCDF3-Binary-Format
    NetCDF-Data-Model-->HDF5;
    NetCDF-Data-Model-->Zarr;
    NetCDF-Data-Model-->Arrow?

That said, I'd love for someone to show me how to serialize the NetCDF data model to Arrow / Parquet.

What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?

This is an interesting idea. Should all of Xarray's CF decoding be re-implemented as custom codecs? I tend to think of codecs as decompression-related. Packing too much logic into the codec concept feels a bit fragile. On the other hand, Xarray's convention encoding / decoding logic is remarkably similar to how codecs work, so maybe this is worth pursuing 🤔 .

@rabernat
Copy link

rabernat commented Feb 8, 2023

I'm not familiar with the CF conventions for geometries, but at a first glance it looks a bit more convoluted (and maybe less flexible?) than the custom codec approach.

One additional point... If an existing standard exists for this problem we should absolutely try to use it, rather than inventing a new standard! I don't need to share the XKCD cartoon which we all know by heart by now.

Edit: I read @edzer's comment more carefully:

If we want to write and read more than one vector geometry dimension we can still use CDL (NetCDF, Zarr) but would have to go beyond CF conventions.

...and now I understand that indeed a new standard is needed. My experience with the CF conventions is that the timescale for something new to be included is about 2 years. I assume we want to move quicker here, so coming up with an ad-hoc extension to those conventions and demonstrating interoperability between R and Python seems like a great place to start.

@edzer
Copy link

edzer commented Feb 8, 2023

What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes?

I think the strength of NetCDF and Zarr is that you only need the NetCDF library to read everything, nothing else. If you would introduce something like WKB in it (assuming you can: with a bit array? But how to deal with the ragged nature?) would mean that someone just using the NetCDF library no longer can interpret the geometries. With the CF geometry conventions, you can. I wouldn't start thinking about giving up this level of interoperability: NetCDF will soon turn 50.

The CF geometry conventions are really, AFAICT, the simplest you can get. I believe that Arrow takes the same approach for speed reasons (right @paleolimbot?). The advantage over WKB is that with explicit indexes you can easily (directly) jump to the last sub-polygon, for instance; with WKB you'd have to go through the full set of rings before finding the last. These CF geometries give up on the difference between LINESTRING and MULTILINESTRING, and between POLYGON and MULTIPOLYGON, so no clean roundtrips of simple features but you could argue that those distinctions are a luxury (and often a hassle) anyway.

@martinfleis great example; for testing it would make sense to also have a polygon with holes, which this dataset doesn't have.

For the second one (multiple geometry dimension in a file), I believe that should be discussed separately and probably somewhere else. We can then implement here or in other packages whatever spec a community agrees on.

I'll see if I can think of an approach how this could be done with CDL / NetCDF building on the CF approach. Maybe adding one (top) level of nesting, indicating the dimension involved, would be enough.

@rabernat I also understand that Parquet/Arrow cater for columnar data, not arrays. You can put an array in a column but you'll loose all its advantages.

@martinfleis
Copy link
Member Author

I think that the only thing we could borrow from Arrow is Geo-Arrow geometry encoding but it is incompatible with CF encoding. Arrow uses the interleaved coordinate values ([x, y, x, y, ...],). Btw, this discussion may be relevant as well on saving raster as Arrow geoarrow/geoarrow#24

I would also try to stick as much was we can with existing standards. The CF geometry encoding looks straightforward to work with and if we can find a way of saving multiple dimensions, we can implement it in R and Python as a de facto standard and try to get it in CF conventions later, if that would slow down things. We just need to ensure that our extension does not break reading of the same file elsewhere.

@benbovy
Copy link
Member

benbovy commented Feb 8, 2023

think the strength of NetCDF and Zarr is that you only need the NetCDF library to read everything, nothing else. If you would introduce something like WKB in it (assuming you can: with a bit array? But how to deal with the ragged nature?) would mean that someone just using the NetCDF library no longer can interpret the geometries. With the CF geometry conventions, you can. I wouldn't start thinking about giving up this level of interoperability: NetCDF will soon turn 50.

Fair enough. You have more experience than me with WKB, CF (geometry), etc. to know what would be the best option.

Arrow uses the interleaved coordinate values ([x, y, x, y, ...],)

Is it settled? There seems to be a long unresolved discussion here: geoarrow/geoarrow#26.

I was also naively wondering how compatible are the CF conventions vs. geo-arrow specs for encoding geometries and how concerning would be a poor compatibility regarding performance and/or other aspects? Especially if we expect many back and forth conversions between data cubes <-> data frames in workflows that involve vector data (usually more than for gridded/raster data)?

@martinfleis
Copy link
Member Author

Arrow uses the interleaved coordinate values ([x, y, x, y, ...],)
Is it settled? There seems to be a long unresolved discussion here: geoarrow/geoarrow#26.

Probably not entirely (I guess that is the main reason why GeoParquet 1.0 still uses WKB) but I lost track of that a bit. I think that at least GeoPolars and cuSpatial already use the interleaved option.

I was also naively wondering how compatible are the CF conventions vs. geo-arrow specs for encoding geometries and how concerning would be a poor compatibility regarding performance and/or other aspects? Especially if we expect many back and forth conversions between data cubes <-> data frames in workflows that involve vector data (usually more than for gridded/raster data)?

We have three sides in here, not two. One is CF representation, another Geo-Arrow but we also need to keep in mind that in most cases, we need GEOS geometries to work with (that applies to both Python and R). So imho fast conversion between CF and GEOS (shapely) is more important in our ecosystem than CF to Geo-Arrow. And for shapely, having separate x and y arrays is better than an interleaved array that would need to be split into separate arrays anyway (or 2D array but that still involves reshuffling).

@benbovy
Copy link
Member

benbovy commented Feb 8, 2023

So imho fast conversion between CF and GEOS (shapely) is more important in our ecosystem than CF to Geo-Arrow.

Yes that's true for shapely. For S2 (s2geography / spherely) I think it is possible to reuse external, in-memory coordinate data. I vaguely remember about @paleolimbot sharing some ideas about using S2 directly with Arrow structures (https://dewey.dunnington.ca/post/2021/prototyping-an-apache-arrow-representation-of-geometry/).

@thomcom
Copy link

thomcom commented Feb 9, 2023

I just wanted to chime in that if interleaved coordinates become a blocking feature for a lot of people toward using GeoArrow, that's a strong reason to drop that requirement.

@paleolimbot
Copy link

The CF geometry conventions are really, AFAICT, the simplest you can get. I believe that Arrow takes the same approach for speed reasons (right paleolimbot?)

Other than just reading them now, I'm not all that familiar with the CF conventions for geometry...it looks like the only difference is that it stores "lengths" instead of "cumulative lengths", and as @edzer noted, there is no distinction between multi and non-multi types and no support for mixed types. I think Edzer's point about interoperability is huge: no need for a WKB parser! And nothing does ndarrays like NetCDF. From reading this thread I can't help but side with NetCDF + CF conventions (but I'm very new to this discussion).

I vaguely remember about paleolimbot sharing some ideas about using S2 directly with Arrow structures

Yes, S2 is very flexible about its data structure (basically every edge is a virtual method call...it makes the bet that many of those edges will never need to be resolved). One could write an S2Geography subclass that lazily reads from the file. That said, I don't think it's a good use of anybody's time to actually do that.

Is it settled? There seems to be a long unresolved discussion

My current prediction is that the GeoArrow specification will allow for both, and there will be a small vendorable library that will make it trivial to translate between WKB, WKT, and both forms of coordinate representations. That small vendorable library ( https://github.com/geoarrow/geoarrow-cpp ) is progressing as fast as my kids capability for napping/sleeping allows, which is at present very very slow.

@dblodgett-usgs
Copy link

dblodgett-usgs commented Feb 10, 2023

Hi All. I think the discussion here is largely on target. We designed the CF geometry spec with simplicity and the 90% level of use case support in mind. @edzer is totally on point with regards to the design theory and accessibility of the encoding here. https://github.com/martinfleis/xvec/issues/26#issuecomment-1422599146 but not quite correct that multi types aren’t supported. See the examples in the vignette linked below for details.

A couple resources.

A poster that Tim and I prepared that talks about the use cases and implementation strategy.
https://www.unidata.ucar.edu/mailing_lists/archives/netcdf-java/2018/pdfxJp5EmUlti.pdf

An R package vignette for my R package that implements the core CF geometry spec. The package is built into {stars} pure netcdf reader. Th package also writes the netcdf geometry format with time series. This vignette and the package test data might be a good place for someone to start. I think the same tests are in Tim’s package.
https://doi-usgs.github.io/ncdfgeom/articles/geometry.html

Tim’s python reference implementation docs.
https://twhiteaker.github.io/CFGeom/tutorial.html

Happy to help clarify things or cross validate tests.

Note that I also worked with a student a few years back to get the core of the cf geometry into gdal (for better or worse) 😅

The format really hasn’t caught hold, but I still think it has a place in the mix. It’s dead simple and pretty easy to read / write from wkb/wkt structures.

Noting some comments about adding further dimensionality. No reason not to. Just call it a custom extension and move on. If it catches on, bring it to CF. I’d be happy to help evolve that part of the spec.

Cheers!

@kylebarron
Copy link

kylebarron commented Mar 8, 2023

I'm still working on wrapping my head around the specifics of vector data cubes (and in general rather unfamiliar with CF etc), but naively it seems like you might still be able to use Arrow.

In a simple case, where the geometry column maps against the first axis of the data cube, you could store the data cube in a nested fixed size list column (essentially the raster/tensor type). Then as I understand it, the entire column's data buffer would be equivalent to a C-order multidimensional array. You could store the geometries in any format you'd like.

I think that at least GeoPolars and cuSpatial already use the interleaved option.

Right now geopolars uses a separated/struct array because polars doesn't support the FixedSizeList type. I hope to implement that in polars at some point to enable using interleaved arrays.

@rabernat
Copy link

rabernat commented Jul 7, 2023

I played around a bit with cf_xarray and xvec and made the following example work. The sample file I'm using is from the netcdfgeom R package: https://github.com/DOI-USGS/ncdfgeom/tree/main/inst/extdata

Note that this file doesn't actually conform to the CF geometries spec, so it has to be augmented before it can be used.

import xarray as xr
import cf_xarray
import xvec

# to download
# wget https://github.com/DOI-USGS/ncdfgeom/raw/main/inst/extdata/example_huc_eta.nc
ds = xr.open_dataset('example_huc_eta.nc')
 
# assign missing geometry_container variable
# Note: we omit node_count because these are point data
ds.coords['geometry_container'] = 0
ds.geometry_container.attrs['geometry_type'] = 'point'
ds.geometry_container.attrs['node_coordinates'] = 'lat lon'

# use cf_xarray to convert to shapely geometries
geometries = cf_xarray.cf_to_shapely(ds)
# (because node_count was missing, the dimension name gets auto-assigned to "features")

# now make this a coordinate variable on the original dataset
ds.coords['geometry'] = geometries.rename({"features": "station"})

# Now make this a dimension, which creates a new Pandas Index
# (It doesn't make sense to me to first create the pandas index. Is this efficient?)
ds_swapped = ds.swap_dims({"station": "geometry"})

# now set the geometry index
ds_cube = ds_swapped.xvec.set_geom_indexes("geometry", crs="4326")

This shows how it is plausible to decode CF geometries from any Xarray dataset into an xvec GeometryIndex. Here are some potential next steps.

  • PR to cf_xarray to cover geometries beyond point. Here I wonder if geoarrow could help with a very efficient decoding of the raw array data into shapely geometries. @paleolimbot - does that seem like the right place to plug in geoarrow?
  • PR to cf_xarray to implement decode_geometries and encode_geometries to facilitate going back and forth between the encoded / decoded representations. (This is just some syntactic sugar around code like what I wrote above, which is essentially decode_geometries. At this point, we can serialize the geometries into any file format supported by Xarray (mainly NetCDF and Zarr [see Zarr serialisation #48]) using the same machinery.

Once those two are done, we would want to decide where to create the xvec indexes. It could possibly be done by cf_xarray during the decode_geometries step. Or it could be a separate step. For example

ds = xr.open_dataset("file_with_cf_geometries.nc")
ds = ds.cf.decode_geometries()
ds = ds.xvec.set_geom_indexes()

@paleolimbot
Copy link

does that seem like the right place to plug in geoarrow?

I think geoarrow can currently get you a pyarrow.Array from the types of buffers you're describing (with the caveat that it's not on pypi or conda-forge yet):

python -m pip install "https://github.com/geoarrow/geoarrow-c/archive/refs/heads/main.zip#egg=geoarrow&subdirectory=python"
>>> import geoarrow.pyarrow as ga
>>> import numpy as np
>>> ga.linestring().from_geobuffers(None,np.array([0, 2, 5], dtype=np.int32), np.array([1, 2, 3, 4, 5], dtype=np.double), np.array([6, 7, 8, 9, 10], dtype=np.double))
LinestringArray:LinestringType(geoarrow.linestring)[2]
<LINESTRING (1 6, 2 7)>
<LINESTRING (3 8, 4 9, 5 10)>

Recent versions of pandas and pyarrow can get you a pandas.Series backed by a pyarrow.Array or pyarrow.ChunkedArray or you could convert to geopandas (slow because of dtype=object). I don't know what the constraint is on the end value type for xarray...if you can avoid dtype=object you will probably be a lot happier.

@rabernat
Copy link

rabernat commented Jul 7, 2023

Currently the way this works (sans any arrow) is something like:

  1. Data on disk are chunked, compressed arrays (either HDF5 or Zarr)
  2. Xarray can open the datasets lazily
  3. Eventually the arrays get loaded into memory as simple 1D numpy arrays (float64 dtype)
  4. cf_xarray.cf_to_shapely consumes those and produces a numpy array with object dtype. Each object is a shapely.geometry.point.Point. This definitely requires new memory allocation.
  5. ds.swap_dims turns that array into a PandasIndex. Its repr looks like PandasIndex(Index([POINT (36.488959 -80.399735), POINT (36.43594 -80.365249)], dtype='object', name='geometry')). Is a copy involved here? 🤷
  6. xvec.set_geom_indexes takes this and turns it into an xvec.index.GeometryIndex. Is a copy involved here? 🤷

It's unclear to me how best to plug in Arrow, since Xarray is not fundamentally a dataframe library. Presumbly Arrow could help by reducing the amount of memory copies? Or make step 4 faster?

@paleolimbot
Copy link

paleolimbot commented Jul 7, 2023

I'm rather new to the Python universe and so I'm not sure what the best way to plug in Arrow is either. It is true that the pyarrow.Array that from_geobuffers() does not copy the 1D numpy arrays it is given as input. pyarrow is also a convenience here...geoarrow is implemented in C and produces C structures converted to pyarrow at the Python level.

If the ultimate endpoint always has to be numpy array with dtype=object, implementing from_ragged_array with non-interleaved coordinates will probably be a lot simpler than invoking geoarrow/arrow. If a pandas.Series is acceptable, there may be some optimizations that are possible to avoid copies.

@benbovy
Copy link
Member

benbovy commented Jul 7, 2023

AFAIK there's not much constraints on the Xarray side regarding the types of the array supported. I think any type implementing the Python (Numpy) array API could in principle be wrapped in an Xarray variable.

If the goal is to use it with an Xarray index, there is even less constraint: Xarray internally relies on some adapters to reuse Pandas (multi-)index objects within both Xarray variables and indexes so that they share the same data. Likewise, it should also be possible to reuse a pyarrow.Array in an xvec.GeometryIndex and its corresponding Xarray variable. If that's not currently possible we should allow it!

@benbovy
Copy link
Member

benbovy commented Jul 7, 2023

@rabernat I don't think steps 5 and 6 do any copy of the underlying shapely objects, but I might be wrong.

I agree that Arrow will probably not bring a lot of added value if at some point (step 4) we need to create new shapely (GEOS) objects that can't reuse the same arrow buffers.

@benbovy
Copy link
Member

benbovy commented Jul 7, 2023

For Geographic indexes built on top of S2Geometry, I think @paleolimbot has shown that it could be possible to use S2 with Arrow without copying the data. So I guess it would make more sense to plug in Arrow in this case. However, there is still a lot of work to put all the pieces together, at least on the Python side.

@martinfleis
Copy link
Member Author

Here I wonder if geoarrow could help with a very efficient decoding of the raw array data into shapely geometries

I don't think so. We should ensure that the way we encode geometries follows a similar logic to what geoarrow implements but I don't think there's any reason to plug arrow in this process, when the source is NetCDF or Zarr.

Once those two are done, we would want to decide where to create the xvec indexes.

That depends on what cf_xarray folks want to do with their package. From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array. If the additional dependency of xvec, pyproj (and hence PROJ) is an issue, we can potentially depend on it only optionally.

Likewise, it should also be possible to reuse a pyarrow.Array in an xvec.GeometryIndex and its corresponding Xarray variable. If that's not currently possible we should allow it!

Not an expert on pyarrow but I am afraid that shapely's implementation of GEOS bindings may not be compatible with this idea as it assumes a numpy array. But I may be wrong here.

@dcherian
Copy link

dcherian commented Jul 9, 2023

From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array.

Do you construct the index from an array of shapely objects anyway? In which case, it seems preferable to have xvec depend on cf-xarray for decoding to shapely objects.

@martinfleis
Copy link
Member Author

From my perspective it makes sense that decoding geometry gives you back an xvec.GeometryIndex rather than a vanilla array.

Do you construct the index from an array of shapely objects anyway? In which case, it seems preferable to have xvec depend on cf-xarray for decoding to shapely objects.

Yes, we do expect an array of shapely objects. I am fine depending on cf-xarray and rolling our own encode and decode logic that takes cf_to_shapely and shapely_to_cf and ensures it results in a GeometryIndex.

@kylebarron
Copy link

With FixedShapeTensorArray, there are now methods for converting to/from a multidimensional numpy array, though it's not clear if you can get full xarray metadata to round trip.

I can get xarray.DataArray to pyarrow with

import pyarrow as pa
import xarray as xr
import numpy as np

def data_array_to_pyarrow(arr: xr.DataArray) -> pa.FixedShapeTensorArray:
    value_type = pa.from_numpy_dtype(arr.dtype)
    shape = arr.shape[1:]
    size = arr.size / arr.shape[0]

    arrow_type = pa.fixed_shape_tensor(value_type, shape, dim_names=arr.dims[1:])
    pa_arr = pa.FixedSizeListArray.from_arrays(np.ravel(arr, order="C"), size)

    return pa.ExtensionArray.from_storage(arrow_type, pa_arr)

But this seems to omit:

  • The coordinates of all dimensions (though presumably the first dimension's coordinates are saved as another column)
  • The dim name of the first dimension (though could round trip from the other column's name)

It doesn't seem possible to round trip to round trip the other dimensions' coordinates after the first?

Testing with some example data from the xvec docs:

import geopandas as gpd
import numpy as np
import pandas as pd
import pyarrow as pa
import xarray as xr
import xvec
from geodatasets import get_path

counties = gpd.read_file(get_path("geoda.natregimes"))

pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
pop1960 = pop1960.xvec.set_geom_indexes("county", crs=counties.crs)

population = xr.DataArray(
    counties[["PO60", "PO70", "PO80", "PO90"]],
    coords=(counties.geometry, [1960, 1970, 1980, 1990]),
    dims=("county", "year"),
).xvec.set_geom_indexes("county", crs=counties.crs)

the population DataArray has county coordinates, which could be saved as a WKB or GeoArrow geometry column in GeoArrow/GeoParquet, but it also has the coordinates in year defined as array([1960, 1970, 1980, 1990]). I'm not sure if/how that could be serialized onto the arrow type. Maybe Arrow could extend the available metadata on the type?

@martinfleis
Copy link
Member Author

Interesting! If we could save geometries as GeoArrow, could we potentially load this directly to lonboard for visualisation? Assuming we resolve the coordinate issue.

@martinfleis
Copy link
Member Author

Also, this is limited to DataArray, right? So no flexibility to save a Dataset so far.

@kylebarron
Copy link

Interesting! If we could save geometries as GeoArrow, could we potentially load this directly to lonboard for visualisation? Assuming we resolve the coordinate issue.

Yeah, that's the idea.

Also, this is limited to DataArray, right? So no flexibility to save a Dataset so far.

I'm still pretty new to xarray, so still wrapping my head around some concepts, but you can have multiple FixedShapeTensorArray columns in a single Arrow table. So as long as all DataArrays are aligned on the same axis (i.e. the unit of observation of the table), you can have as many stored in a GeoArrow table as you wish

@benbovy
Copy link
Member

benbovy commented Nov 22, 2023

I'm not yet very familiar with arrow, but reading your comment @kylebarron it looks like reusing FixedShapeTensorArray in Xarray (or converting it to/from xarray objects) would better be addressed at the xarray.Variable level (or later the NamedArray level).

An Xarray Variable may wrap all sorts of array-like objects beyond numpy and dask arrays. You can see for example a number of wrapper classes in https://github.com/pydata/xarray/blob/main/xarray/core/indexing.py, e.g., PandasIndexingAdapter that wraps a pandas.Index. Those wrapper classes are internal to Xarray, though. These will probably be heavily refactored while implementing NamedArray, although I don't know if there are plans to provide some extension mechanism similarly to Pandas' or PyArrow's ExtensionArray (cc @andersy005).

Xarray DataArray and Dataset are essentially collections of aligned Variable objects (a DataArray can be viewed as a special case of a Dataset with a unique data variable). Both are closer to the pyarrow.Table concept I think.

Mid or long-term, interoperability with Arrow is something that we'll need to address in Xarray itself if we want to keep the same level of interoperability with Pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests