Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support categorical data types for Parquet #5706

Open
kklemon opened this issue Apr 4, 2023 · 17 comments · May be fixed by #6892
Open

Support categorical data types for Parquet #5706

kklemon opened this issue Apr 4, 2023 · 17 comments · May be fixed by #6892
Assignees
Labels
enhancement New feature or request

Comments

@kklemon
Copy link

kklemon commented Apr 4, 2023

Feature request

Huggingface datasets does not seem to support categorical / dictionary data types for Parquet as of now. There seems to be a TODO in the code for this feature but no implementation yet. Below you can find sample code to reproduce the error that is currently thrown when attempting to read a Parquet file with categorical columns:

import pandas as pd
import pyarrow.parquet as pq

from datasets import load_dataset

# Create categorical sample DataFrame
df = pd.DataFrame({'type': ['foo', 'bar']}).astype('category')
df.to_parquet('data.parquet')

# Read back as pyarrow table
table = pq.read_table('data.parquet')
print(table.schema)
# type: dictionary<values=string, indices=int32, ordered=0>

# Load with huggingface datasets
load_dataset('parquet', data_files='data.parquet')

Error:

Traceback (most recent call last):                                                                                                                                                                                                        
  File ".venv/lib/python3.10/site-packages/datasets/builder.py", line 1875, in _prepare_split_single
    writer.write_table(table)
  File ".venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 566, in write_table
    self._build_writer(inferred_schema=pa_table.schema)
  File ".venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 379, in _build_writer
    inferred_features = Features.from_arrow_schema(inferred_schema)                                                                                     
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1622, in from_arrow_schema
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1622, in <dictcomp>
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1361, in generate_from_arrow_type
    raise NotImplementedError  # TODO(thom) this will need access to the dictionary as well (for labels). I.e. to the py_table
NotImplementedError

Motivation

Categorical data types, as offered by Pandas and implemented with the DictionaryType dtype in pyarrow can significantly reduce dataset size and are a handy way to turn textual features into numerical representations and back. Lack of support in Huggingface datasets greatly reduces compatibility with a common Pandas / Parquet feature.

Your contribution

I could provide a PR. However, it would be nice to have an initial complexity estimate from one of the core developers first.

@kklemon kklemon added the enhancement New feature or request label Apr 4, 2023
@lhoestq
Copy link
Member

lhoestq commented Apr 4, 2023

Hi ! We could definitely a type that holds the categories and uses a DictionaryType storage. There's a ClassLabel type that is similar with a 'names' parameter (similar to a id2label in deep learning frameworks) that uses an integer array as storage.

It can be added in features.py. Here are some pointers:

  • the conversion from HF type to PyArrow type is done in get_nested_type
  • the conversion from Pyarrow type to HF type is done in generate_from_arrow_type
  • encode_nested_example and decode_nested_example are used to do user's value (what users see) <-> storage value (what is in the pyarrow.array) if there's any conversion to do

@mhattingpete
Copy link

@kklemon did you implement this? Otherwise I would like to give it a try

@kklemon
Copy link
Author

kklemon commented May 10, 2023

@mhattingpete no, I hadn't time for this so far. Feel free to work on this.

@mhattingpete
Copy link

#self-assign

@mmcdermott
Copy link

This would be super useful, so +1.

Also, these prior issues/PRs seem relevant:
#1906
#1936

@amrit110
Copy link

Hi, this is a really useful feature, has this been implemented yet?

@mmcdermott
Copy link

Hey folks -- I'm thinking about trying a PR for this. As far as I can tell the only sticky point is that auto-generation of features from a pyarrow schema will fail under the current generate_from_arrow_type function because there is no encoding of the categorical string label -> int map in the pa.dictionary type itself; that is stored with the full array.

I see two ways to solve this. Option 1 is to require datasets with categorical types to use pyarrow schema metadata to encode the entire HF feature dictionary, that way categorical types don't ever need to be inferred from the pa type alone. The downside to this is that it means that these datasets will be a bit brittle, as if the feature encoding API ever changes, they will suddenly be unloadable.

The other option is to modify generate_from_arrow_type to take per-field metadata, and include just that metadata (the category labels) in the schema metadata.

Does anyone at HF have any preference on these two (or alternate) approaches?

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2023

Maybe we don't need to store the string label -> int map in the categorical for the corresponding datasets feature ?

@mmcdermott
Copy link

mmcdermott commented Sep 8, 2023 via email

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2023

Well IIRC you can concatenate two Arrow arrays with different dictionaries together. But for datasets would mean updating the datasets features when concatenating two arrays of the same type, which is not supported right now. That's why if there is a way to have it without storing the mapping in the feature object it would be nice.

For decoding we do have the string<->integer mapping from the array dictionary attribute so we're fine. For encoding I think it can work if we only encode when converting python objects to pyarrow in TypedSequence.__arrow_array__ in arow_writer.py. It can work by converting the python objects to a pyarrow array and then use the dictionary_encode method.

Another concern about concatenation: I noticed pyarrow creates the new dictionary and indices in memory when concatenating two dictionary encoded arrays. This can be a problem for big datastets, and we should probably use ChunkedArray objects instead. This can surely be taken care of in array_concat in table.py

cc @mariosasko in case you have other ideas

@mmcdermott
Copy link

mmcdermott commented Sep 8, 2023 via email

@mmcdermott
Copy link

@lhoestq @mariosasko just re-pinging on this so I can push forward further here. What are your thoughts on disallowing concatenation of categorical arrays for now such that the index to category map can be stored in the schema metadata? And/or other approaches that should be taken here?

@lhoestq
Copy link
Member

lhoestq commented Sep 22, 2023

I think the easiest for now would be to add a dictionary_decode argument to the parquet loaders that would convert the dictionary type back to strings when set to True, and make dictionary_decode=False raise NotImplementedError for now if there are dictionary type columns. Would that be ok as a first step ?

@mmcdermott
Copy link

I mean, that would certainly be easiest but I don't think it really solves this issue in a meaningful way. This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types. Given that those savings are what is of real interest here, I think keeping it explicit that it is not supported (and forcing the user to do the conversion) might actually be better that way this problem stays top of mind.

Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns?

@lhoestq
Copy link
Member

lhoestq commented Sep 22, 2023

This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types.

There's already a ClassLabel type that does pretty much the same thing (store as integer instead of string) if it can help

Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns?

Yea we do concatenation quite often (e.g. in map) so I don't think this is a viable option

@mmcdermott
Copy link

mmcdermott commented Sep 22, 2023 via email

@lhoestq
Copy link
Member

lhoestq commented Sep 22, 2023

Arrow IPC seems to require unified dictionaries anyway so actually we could surely focus only on this use case indeed @mmcdermott

So defining a new Feature type in datasets that contains the dictionary mapping should be fine (and concatenation would work out of the box), and it should also take care of checking that the data it encodes/decodes has the right dictionary. Do you think it can be done without impacting iterating speed for the other types @mariosasko ?

Right now we have little bandwidth to work in this kind of things though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants