Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking support #1783

Open
eschnett opened this issue Apr 11, 2024 · 2 comments
Open

Chunking support #1783

eschnett opened this issue Apr 11, 2024 · 2 comments

Comments

@eschnett
Copy link

When a large ndarray is stored as binary block with compression, then the (beginning of) the whole block needs to be read and decompressed even when only a small subarray is read. "Chunking" remedies this; instead of storing an ndarray as a single binary block, it is stored as a set of smaller blocks that are compressed and stored independently.

Are there plans to support this? Can this be implemented as extension?

One simple approach would be to introduce a new yaml tag core/chunked-ndarray that consists of a yaml map that maps offsets to ndarrays, for example

chunky: !core/chunked-ndarray-1.0.0
  - !core/ndarray-chunk-1.0.0
    offset: [0,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [100,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [0,100]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  # possibly more chunks here

Has there been any work in this direction?

@braingram
Copy link
Contributor

Thanks for opening this issue.

There has been some work adding support for the zarr storage format within ASDF. This is implemented via an extension: https://github.com/asdf-format/asdf-zarr It's a new package so please let me know if it's something you plan to use "in production" (so we can give it another review, also feel free to give it a try and open issues if you find anything). The extension offers a few options:

  • storing the zarr data inside ASDF blocks (with a chunk per block, I think most similar to what you described)
  • referencing external zarr storage (either DirectoryStore "flat files", S3 stores, or any of the many formats zarr supports).

The use of zarr also opens up a second place where compression can be controlled (which can get a bit confusing).

@eschnett
Copy link
Author

@braingram Nice! We are currently discussing storage formats, and both ASDF and Zarr are contenders that have various advantages and disadvantages. On the surface, using Zarr chunking with ASDF single-file storage seems like an excellent choice. I will have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants