BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

kinianlo · 2024-02-14T00:37:46Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd

list_int = pa.list_(pa.int64())
s = pd.Series([[1, 1], [2, 2]], dtype=pd.ArrowDtype(list_int))

df = pd.DataFrame(s, columns=['col'])
df.to_parquet('ex.parquet')
pd.read_parquet('ex.parquet')

Issue Description

The example code produces TypeError: data type 'list<item: int64>[pyarrow]' not understood

Expected Behavior

No error should occur and pd.read_parquet('ex.parquet') should gives a data frame identical to df

Installed Versions

INSTALLED VERSIONS

commit : fd3f571
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.1.58+
Version : #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.25.2
pytz : 2023.4
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.8
pytest : 7.4.4
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 7.34.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.58.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.19.2
pyarrow : 10.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-02-14T22:28:21Z

Thanks for the report. Can you give this issue a meaningful title? Currently it is BUG:.

kinianlo · 2024-02-15T16:17:45Z

Thanks for the report. Can you give this issue a meaningful title? Currently it is BUG:.

Thanks! Title updated.

rhshadrach · 2024-02-15T21:33:25Z

Thanks for the report - further investigations and PRs to fix are welcome!

diogojarodrigues · 2024-03-04T15:37:45Z

take

kinianlo · 2024-03-08T00:56:33Z

I have done some digging into the problem. By inspecting the stack trace:

Traceback (most recent call last):
  File "/Users/kinianlo/github/pandas/scripts/debug.py", line 19, in <module>
    pd.read_parquet('ex.parquet', engine='auto')
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 670, in read_parquet
    return impl.read(
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/io/parquet.py", line 279, in read
    result = pa_table.to_pandas(**to_pandas_kwargs)
  File "pyarrow/array.pxi", line 867, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 4085, in pyarrow.lib.Table._to_pandas
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 764, in table_to_blockmanager
    ext_columns_dtypes = _get_extension_dtypes(
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 817, in _get_extension_dtypes
    pandas_dtype = _pandas_api.pandas_dtype(dtype)
  File "pyarrow/pandas-shim.pxi", line 140, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "pyarrow/pandas-shim.pxi", line 143, in pyarrow.lib._PandasAPIShim.pandas_dtype
  File "/Users/kinianlo/.pyenv/versions/3.10.13/lib/python3.10/site-packages/pandas/core/dtypes/common.py", line 1636, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'list<item: int64>[pyarrow]' not understood

it appears that the problem comes from the table_to_blockmanager function in pyarrow.pandas_compat.py. If pandas_metadata exists in the pyarrow table (which comes ultimately from the metadata in the parquet file written by pandas) AND the argument ignore_metadata=False then table_to_blockmanager would try to construct pandas extension data types using the information from the metadata. For some reason the _get_extension_dtypes function which is called by table_to_blockmanager would try to reconstruct a numpy datatype even if a type_mapper is supplied, using the numpy_type in the metadata. This would not cause a problem if numpy_type is set to object but if instead it is set to e.g. list<item: int64>[pyarrow] then numpy would struggle to understand this datatype.

Example of metadata:

The metadata that I was referring to can be obtained through a pyarrow table, e.g. pq.read_table('ex.parquet').schema.metadata

A bad metadata that would causes the error reported:

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "list<item: int64>[pyarrow]", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'}

A good metadata:

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "col", "field_name": "col", "pandas_type": "list[int64]", "numpy_type": "object", "metadata": null}], "creator": {"library": "pyarrow", "version": "11.0.0"}, "pandas_version": "2.1.4"}'}

temporary fix

Use pyarrow.parquet.read_table to read the parquet from disk, and then use .to_pandas(ignore_metadata=True) to convert the pyarrow table to a panda dataframe while ignoring the metadata.

potential long term fix

Ensure that numpy_type is set to object when appropriate, e.g. when it is a list of somethings.

diogojarodrigues · 2024-03-27T06:48:58Z

Hello everyone!
I managed to make a fix to this problem. But I am not sure if this is a good solution.
I had another if statement (last one) to the function "pandas_dtype" in the "common.py" file

if isinstance(dtype, np.ndarray):
  return dtype.dtype  
elif isinstance(dtype, (np.dtype, ExtensionDtype)):
  return dtype
elif ("list" in str(dtype) and "pyarrow" in str(dtype)):
  return dtype

What do you think about this solution?

…-dev#57411)

kinianlo · 2024-05-14T22:12:05Z

After some further digging, I propose the following two potential solutions:

Ensure that the relevant numpy_type in the metadata is set to object when the dataframe is saved as a parquet. This has been proven to work. Although I am not entirely sure how to do it and would it affect other behaviours.
When the parquet is read, dtypes of columns are inferred first using pandas.core.dtypes.base._registry (which is done in the pandas_dtype function). If this fails to find a dtype, np.dtype(dtype) is called which causes the error because numpy doesn't understand the arrow type. The solution is to make sure that the dtype can be found in registry. The registry uses the ArrowDtype.construct_from_string method. Note that atomic types such as int64[pyarrow] can already be reconsigned by construct_from_string as it should, but fails to recognises list<item: int64>[pyarrow]. All we need to do is to ensure the a string representation such as list<item: int64>[pyarrow] can be recognised. This fix should not interferer with behaviours.

kinianlo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 14, 2024

kinianlo changed the title ~~BUG:~~ BUG: read_parquet gives TypeError if dtype is a pyarrow list Feb 15, 2024

rhshadrach added IO Parquet parquet, feather Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 15, 2024

github-actions bot assigned diogojarodrigues Mar 4, 2024

diogojarodrigues added a commit to diogojarodrigues/pandas-diogo that referenced this issue Apr 3, 2024

BUG: Fix read_parquet not working with data type pyarrow list (pandas…

2117367

…-dev#57411)

diogojarodrigues mentioned this issue Apr 5, 2024

BUG: Fix read_parquet not working with data type pyarrow list (#57411) #58156

Closed

5 tasks

TomAugspurger mentioned this issue Apr 24, 2024

Docs for pyarrow reader / writer stac-utils/stac-geoparquet#46

Merged

kinianlo mentioned this issue May 21, 2024

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

kinianlo commented Feb 14, 2024 •

edited

INSTALLED VERSIONS

rhshadrach commented Feb 14, 2024

kinianlo commented Feb 15, 2024

rhshadrach commented Feb 15, 2024

diogojarodrigues commented Mar 4, 2024

kinianlo commented Mar 8, 2024

diogojarodrigues commented Mar 27, 2024 •

edited

kinianlo commented May 14, 2024 •

edited

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

BUG: read_parquet gives TypeError if dtype is a pyarrow list #57411

Comments

kinianlo commented Feb 14, 2024 • edited

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Feb 14, 2024

kinianlo commented Feb 15, 2024

rhshadrach commented Feb 15, 2024

diogojarodrigues commented Mar 4, 2024

kinianlo commented Mar 8, 2024

Example of metadata:

temporary fix

potential long term fix

diogojarodrigues commented Mar 27, 2024 • edited

kinianlo commented May 14, 2024 • edited

kinianlo commented Feb 14, 2024 •

edited

diogojarodrigues commented Mar 27, 2024 •

edited

kinianlo commented May 14, 2024 •

edited