Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Arrow support to cover nullable data #4049

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

eddelbuettel
Copy link
Member

The arrowio header provides import export support from/to TileDB and Arrow with its interface of two void* pointers. This PR extends the support to cover 'nullable' aka 'validity map' data.

The PR is in need of some tests but the existing tests is a little involved between Python, pybind11 and C++ so @ihnorton has kindly 'volunteered' to add this.

The PR will remain a draft til we have tests.


TYPE: FEATURE
DESC: Nullable support for Arrow

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #27472: Extend arrow i/o operations to nullable columns.

@Shelnutt2
Copy link
Member

Looks like a few msvc errors, https://github.com/TileDB-Inc/TileDB/actions/runs/4758463351/jobs/8456556083?pr=4049#step:10:593

 D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(619,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(621,18): message : see usage of 'n' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(624,15): error C3863: array type 'uint8_t [n]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(650,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(651,39): error C2131: expression did not evaluate to a constant (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : failure was caused by a read of a variable outside its lifetime (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(640,34): message : see usage of 'arw_array' (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]
D:\a\TileDB\TileDB\tiledb\sm\cpp_api\arrow_io_impl.h(657,24): error C3863: array type 'int64_t [num_offsets+1]' is not assignable (compiling source file D:\a\TileDB\TileDB\test\src\unit-arrow.cc) [D:\a\tdbbd\tiledb\test\tiledb_unit.vcxproj]

@eddelbuettel
Copy link
Member Author

I also saw an Ubuntu build fail but figure that would be spurious. My changes so far are quite gingerly and limited. Let's see what happens once new tests get injected.

@kounelisagis
Copy link
Member

@eddelbuettel if you see my last commit, we were basically never testing col sizes greater than zero.
My change cause the test to fail with the message:

  WriterBase: Invalid offsets for attribute utf_string1; the last offset: 0 is
  not equal to the size of the data buffer: 71

or something similar.

Do you have any idea? Spent some hours but couldn't find any solution.

@eddelbuettel
Copy link
Member Author

Howdy. Love that you are trying to resurrect the PR. Theses days it would probably be better to wrap some of the code in the nanoarrow (header-only) helper functions we already use in tiledb-r and tiledbsoma (mostly the R parts). See https://github.com/apache/arrow-nanoarrow and check out the recently expanded python side (to manipulate objects; nanoarrow is really three packages namely the header-only C/C++ helper, a somewhat mature R package to work on object s and a growing Python package).

I think I wrote this PR after I had similar work in the R package where I since have changed things. (Just took a look.) Hm, that is on the other hand quite motivated by SOMA's libtiledbsoma and its ColumnBuffers etc.

My pedestrian approach has usually been to setup a similar mock function standalone, or as an R extension, and ten abstract out the pure C++ side for Core.

Back to your PR: I honestly do not know what the col size is mad about. Is there something driving this now and/or urgency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants