Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add support for objects to Arrow blocks #45272

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

terraflops1048576
Copy link
Contributor

Why are these changes needed?

Currently, Ray does not support blocks/batches with objects and multi-dimensional arrays in different columns. This causes Ray Data to throw exceptions when these are provided because:

  1. Since there's an arbitrary object in the batch, the Arrow block format fails with ArrowNotImplemented with dtype 17. This falls back to return pd.DataFrame(dict(batch)) in BlockAccessor.batch_to_block.
  2. However, this particular DataFrame constructor does not support columns with numpy.ndarray objects, so it throws the exception listed in the linked issue.

This change enables Python object storage in the Arrow blocks by defining an Arrow extension type that simply represents the Python objects as a variable-sized large binary. I suppose the alleged performance benefits listed in the comments are an extra benefit.

I'm not sure that this is the correct approach or that I've properly patched all of the places, so some help would be appreciated!

Related issue number

Resolves #45235

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
Peter Wang added 2 commits May 11, 2024 21:56
Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
@terraflops1048576 terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from abe7c58 to 516ff57 Compare May 12, 2024 05:46
…fix type annnotation

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
@terraflops1048576 terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 516ff57 to 3b38131 Compare May 12, 2024 05:59
@terraflops1048576 terraflops1048576 changed the title Add support for objects to Arrow blocks [Data] Add support for objects to Arrow blocks May 12, 2024
@terraflops1048576 terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch 2 times, most recently from 2d82297 to 978b30e Compare May 12, 2024 19:23
Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
@terraflops1048576 terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 978b30e to 014949f Compare May 12, 2024 20:26
Peter Wang added 2 commits May 12, 2024 16:55
Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
Signed-off-by: Peter Wang <peter.wang9812@gmail.com>
@raulchen raulchen self-assigned this May 13, 2024
@anyscalesam
Copy link
Collaborator

@terraflops1048576 this could be a big contribution but we want to think a little deeper about this with you to properly codevelop this; can you reach out to me on Ray Slack so we can setup some time to discuss further.

My handle is "Sam (Ray Team)"

We should get a formal REP in ray-enhancements for this as well: https://github.com/ray-project/enhancements

cc @bveeramani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Can't return array-like data from UDF if batch contains unsupported type
3 participants