feat: add `pandera.io.to_pyarrow_schema` #1047

the-matt-morris · 2022-12-08T22:40:38Z

closes #689

Introduces pandera.io.to_pyarrow_schema

@cosmicBboy , one thing I was unsure about was this type hint. mypy correctly identifies that if conflicts with this type hint. However, I'm not sure under any circumstances when a key in DataFrameSchema.columns is not a string? I'm making the assumption in this function that it is always a string. Tests pass, but perhaps there is a situation under which this would be a problem that aren't covered in the unit tests?

Other assumptions:

pyarrow.date64() type is used when the pandera date data type cannot be inferred by pyarrow
This does not support a DataFrameSchema with field(s) that are not typed. I supposed we could potentially force those to something, say pyarrow.string(), but I don't like the feel of doing something like that.
No support for following types:
- geopandas Geometries
- Float128. We could potentially implement this, would just have to make an assumption about the precision and roll with it
- Complex64, Complex128, Complex256
Argument preserve_index to pandera.io.to_pyarrow_schema functions similarly to preserve_index argument to pyarrow.Schema.from_pandas
As mentioned in the issue discussion, there is no support for complex types like pyarrow.lint_(pyarrow.float64()).

Let me know if you feel like I missed any use cases in the unit tests.

the-matt-morris · 2022-12-08T23:07:54Z

I didn't run those mypy unit tests locally, I'll have to see what's going on there.

The other thing to consider is that this may all be moot. I see the PR for DataFrameSchema.empty(), and this whole thing could potentially be simply refactored to:

import pyarrow
pyarrow.Schema.from_pandas(dataframe_schema.empty())

sam-goodwin · 2024-03-29T22:53:24Z

What's the status of this PR? I have a use-case that requires this. Is there a different supported way or are we still waiting on this?

cosmicBboy · 2024-04-01T21:14:32Z

will need to resolve the merge conflicts and probably rebase this onto the current main branch.

@the-matt-morris not sure if you want to pick this up again. I do think leveraging an empty method would make sense to fulfill this use case.

However, the PR that implements the empty method hasn't seen much movement, the issue is still open.

I do think a workaround for this would be:

import pyarrow

schema = pa.DataFrameSchema(..., coerce=True)
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
pyarrow.Schema.from_pandas(empty_df)

sam-goodwin · 2024-04-02T00:40:02Z

@cosmicBboy this doesn't work for me. The schema infers all types as type null

class TodoList(pa.DataFrameModel):
    int16: Series[pdt.Int16] = pa.Field()
    int_list: Series[list[int]] = pa.Field()
    str_list: Series[list[str]] = pa.Field()
    int16_list: Series[list[pdt.Int16]] = pa.Field()
    int16_List: Series[List[pdt.Int16]] = pa.Field()

def test_to_arrow():
    import pandas as pd
    import pyarrow

    schema = TodoList.to_schema()
    empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
    schema = pyarrow.Schema.from_pandas(empty_df)

    logger.info(schema)

Output:

int16: null
int_list: null
str_list: null
int16_list: null
int16_List: null

cosmicBboy · 2024-04-02T01:04:57Z

yeah, tried this out and I think the approach in this PR (i.e. a dedicated pandera schema -> pyarrow schema translation layer) is the way to go. This is because for any non-scalar type (struct, list, dictionary, etc) I don't think pyarrow.Schema.from_pandas will be able to infer the dtype from an object column. Any pandera column with generics like List[TYPE] will be represented as an object dtype in pandas.

happy to review this or another PR that takes a crack at this, not sure if you want to continue tackling this @the-matt-morris

sam-goodwin · 2024-04-02T01:08:24Z

FYI - I have a local copy of this where I am modifying it to work for my use-case. I probably need some guidance though as I had to do some custom reflection to handle the typing library and am relatively new to python. Gist here: https://gist.github.com/sam-goodwin/85c44d0241f6848e4a183a39c1abfb58

Happy to contribute this back if @the-matt-morris isn't available to finish this PR.

sam-goodwin · 2024-04-02T01:14:27Z

It wasn't clear to me if i am suppsed to use = pa.Field() in NamedTuple or TypedDict:

class TodoItem(NamedTuple):
    name: str
    priority: int
    pd_uint8: pdt.UInt8

I instead am using reflection and mapping based on the python types.

I see this in the original PR:

pandas_types = {
    pd.BooleanDtype(): pa.bool_(),
    pd.Int8Dtype(): pa.int8(),
    pd.Int16Dtype(): pa.int16(),
    pd.Int32Dtype(): pa.int32(),
    pd.Int64Dtype(): pa.int64(),
    pd.UInt8Dtype(): pa.uint8(),
    pd.UInt16Dtype(): pa.uint16(),
    pd.UInt32Dtype(): pa.uint32(),
    pd.UInt64Dtype(): pa.uint64(),
    pd.Float32Dtype(): pa.float32(),  # type: ignore[attr-defined]
    pd.Float64Dtype(): pa.float64(),  # type: ignore[attr-defined]
    pd.StringDtype(): pa.string(),
}

I am just doing this:

elif python_type is pdt.UInt8:
        return pa.uint8()
    elif python_type is pdt.UInt16:
        return pa.uint16()
    elif python_type is pdt.UInt32:
        return pa.uint32()
    elif python_type is pdt.UInt64:
        return pa.uint64()
    elif python_type is pdt.Int8:
        return pa.int8()
    elif python_type is pdt.Int16:
        return pa.int16()
    elif python_type is pdt.Int32:
        return pa.int32()
    elif python_type is pdt.Int64:
        return pa.int64()
    elif python_type is pdt.Float32:
        return pa.float32()
    elif python_type is pdt.Float64:
        return pa.float64()
    elif python_type is pdt.String:
        return pa.string()
    elif python_type is pdt.Bool:

Not sure what the trade-offs are.

cosmicBboy · 2024-04-02T01:29:04Z

The mapping approach is faster and simpler (it's O(1) since it's a lookup table). This would probably work for most of the the simple types. For things like lists and namedtuple types you'll have to use the if statements.

In any case, feel free to create a new PR and we can iterate there.

themattmorris · 2024-04-19T02:35:57Z

Hey @cosmicBboy I am sorry about this one. It has been a long time and I have a new github account (yes the name is nearly exactly the same :) anyways, I can take a look at this one again, rebase and get the tests to pass. I must have gotten distracted but looks like there is at least some interest in getting this working,

sam-goodwin · 2024-04-19T02:37:35Z

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

themattmorris · 2024-04-19T02:43:36Z

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

Oh that's great, thanks for picking it up! If you're nearly there I will stay out of your way, but let me know if you want me to contribute to it at all or have any questions on the approach I was taking.

the-matt-morris added 4 commits December 8, 2022 15:22

test: add unit test for pandera.io.to_pyarrow_field

830d77c

test: add test for pandera.io.to_pyarrow_schema

6a7d9fb

style: make line more readable

559cdde

style: make lines more readable

a5bb4e4

Cakell mentioned this pull request Jan 11, 2024

Generate pyarrow schema from pandera schema #689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `pandera.io.to_pyarrow_schema` #1047

feat: add `pandera.io.to_pyarrow_schema` #1047

the-matt-morris commented Dec 8, 2022 •

edited

the-matt-morris commented Dec 8, 2022

sam-goodwin commented Mar 29, 2024

cosmicBboy commented Apr 1, 2024 •

edited

sam-goodwin commented Apr 2, 2024

cosmicBboy commented Apr 2, 2024 •

edited

sam-goodwin commented Apr 2, 2024

sam-goodwin commented Apr 2, 2024 •

edited

cosmicBboy commented Apr 2, 2024

themattmorris commented Apr 19, 2024

sam-goodwin commented Apr 19, 2024

themattmorris commented Apr 19, 2024

feat: add pandera.io.to_pyarrow_schema #1047

Are you sure you want to change the base?

feat: add pandera.io.to_pyarrow_schema #1047

Conversation

the-matt-morris commented Dec 8, 2022 • edited

the-matt-morris commented Dec 8, 2022

sam-goodwin commented Mar 29, 2024

cosmicBboy commented Apr 1, 2024 • edited

sam-goodwin commented Apr 2, 2024

cosmicBboy commented Apr 2, 2024 • edited

sam-goodwin commented Apr 2, 2024

sam-goodwin commented Apr 2, 2024 • edited

cosmicBboy commented Apr 2, 2024

themattmorris commented Apr 19, 2024

sam-goodwin commented Apr 19, 2024

themattmorris commented Apr 19, 2024

feat: add `pandera.io.to_pyarrow_schema` #1047

feat: add `pandera.io.to_pyarrow_schema` #1047

the-matt-morris commented Dec 8, 2022 •

edited

cosmicBboy commented Apr 1, 2024 •

edited

cosmicBboy commented Apr 2, 2024 •

edited

sam-goodwin commented Apr 2, 2024 •

edited