Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

Open
3 tasks done
kinianlo opened this issue Feb 14, 2024 · 7 comments
Open
3 tasks done

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

kinianlo opened this issue Feb 14, 2024 · 7 comments
Assignees
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather

Comments

@kinianlo
Copy link

kinianlo commented Feb 14, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd

list_int = pa.list_(pa.int64())
s = pd.Series([[1, 1], [2, 2]], dtype=pd.ArrowDtype(list_int))

df = pd.DataFrame(s, columns=['col'])
df.to_parquet('ex.parquet')
pd.read_parquet('ex.parquet')

Issue Description

The example code produces TypeError: data type 'list<item: int64>[pyarrow]' not understood

Expected Behavior

No error should occur and pd.read_parquet('ex.parquet') should gives a data frame identical to df

Installed Versions

INSTALLED VERSIONS

commit : fd3f571
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.58+
Version : #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.25.2
pytz : 2023.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.8
pytest : 7.4.4
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 7.34.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.19.2
pyarrow : 10.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@kinianlo kinianlo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 14, 2024
@rhshadrach
Copy link
Member

Thanks for the report. Can you give this issue a meaningful title? Currently it is BUG:.

@kinianlo kinianlo changed the title BUG: BUG: read_parquet gives TypeError if dtype is a pyarrow list Feb 15, 2024
@kinianlo
Copy link
Author

Thanks for the report. Can you give this issue a meaningful title? Currently it is BUG:.

Thanks! Title updated.

@rhshadrach
Copy link
Member

Thanks for the report - further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added IO Parquet parquet, feather Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 15, 2024
@diogojarodrigues
Copy link

take

@kinianlo
Copy link
Author

kinianlo commented Mar 8, 2024

I have done some digging into the problem. By inspecting the stack trace:

Traceback (most recent call last):
  File "/Users/kinianlo/github/pandas/scripts/debug.py", line 19, in <module>
    pd.read_parquet('ex.parquet', engine='auto')
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 670, in read_parquet
    return impl.read(
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 279, in read
    result = pa_table.to_pandas(**to_pandas_kwargs)
  File "pyarrow/array.pxi", line 867, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 4085, in pyarrow.lib.Table._to_pandas
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 764, in table_to_blockmanager
    ext_columns_dtypes = _get_extension_dtypes(
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 817, in _get_extension_dtypes
    pandas_dtype = _pandas_api.pandas_dtype(dtype)
  File "pyarrow/pandas-shim.pxi", line 140, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "pyarrow/pandas-shim.pxi", line 143, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/core/dtypes/common.py", line 1636, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'list<item: int64>[pyarrow]' not understood

it appears that the problem comes from the table_to_blockmanager function in pyarrow.pandas_compat.py. If pandas_metadata exists in the pyarrow table (which comes ultimately from the metadata in the parquet file written by pandas) AND the argument ignore_metadata=False then table_to_blockmanager would try to construct pandas extension data types using the information from the metadata. For some reason the _get_extension_dtypes function which is called by table_to_blockmanager would try to reconstruct a numpy datatype even if a type_mapper is supplied, using the numpy_type in the metadata. This would not cause a problem if numpy_type is set to object but if instead it is set to e.g. list<item: int64>[pyarrow] then numpy would struggle to understand this datatype.

Example of metadata:

The metadata that I was referring to can be obtained through a pyarrow table, e.g. pq.read_table('ex.parquet').schema.metadata

  • A bad metadata that would causes the error reported:

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "list<item: int64>[pyarrow]", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'}

  • A good metadata:

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'}

temporary fix

Use pyarrow.parquet.read_table to read the parquet from disk, and then use .to_pandas(ignore_metadata=True) to convert the pyarrow table to a panda dataframe while ignoring the metadata.

potential long term fix

Ensure that numpy_type is set to object when appropriate, e.g. when it is a list of somethings.

@diogojarodrigues
Copy link

diogojarodrigues commented Mar 27, 2024

Hello everyone!
I managed to make a fix to this problem. But I am not sure if this is a good solution.
I had another if statement (last one) to the function "pandas_dtype" in the "common.py" file

if isinstance(dtype, np.ndarray):
  return dtype.dtype  
elif isinstance(dtype, (np.dtype, ExtensionDtype)):
  return dtype
elif ("list" in str(dtype) and "pyarrow" in str(dtype)):
  return dtype

What do you think about this solution?

@kinianlo
Copy link
Author

kinianlo commented May 14, 2024

After some further digging, I propose the following two potential solutions:

  1. Ensure that the relevant numpy_type in the metadata is set to object when the dataframe is saved as a parquet. This has been proven to work. Although I am not entirely sure how to do it and would it affect other behaviours.
  2. When the parquet is read, dtypes of columns are inferred first using pandas.core.dtypes.base._registry (which is done in the pandas_dtype function). If this fails to find a dtype, np.dtype(dtype) is called which causes the error because numpy doesn't understand the arrow type. The solution is to make sure that the dtype can be found in registry. The registry uses the ArrowDtype.construct_from_string method. Note that atomic types such as int64[pyarrow] can already be reconsigned by construct_from_string as it should, but fails to recognises list<item: int64>[pyarrow]. All we need to do is to ensure the a string representation such as list<item: int64>[pyarrow] can be recognised. This fix should not interferer with behaviours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants