Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pandera.io.to_pyarrow_schema #1047

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

the-matt-morris
Copy link
Contributor

@the-matt-morris the-matt-morris commented Dec 8, 2022

closes #689

Introduces pandera.io.to_pyarrow_schema

@cosmicBboy , one thing I was unsure about was this type hint. mypy correctly identifies that if conflicts with this type hint. However, I'm not sure under any circumstances when a key in DataFrameSchema.columns is not a string? I'm making the assumption in this function that it is always a string. Tests pass, but perhaps there is a situation under which this would be a problem that aren't covered in the unit tests?

Other assumptions:

  • pyarrow.date64() type is used when the pandera date data type cannot be inferred by pyarrow
  • This does not support a DataFrameSchema with field(s) that are not typed. I supposed we could potentially force those to something, say pyarrow.string(), but I don't like the feel of doing something like that.
  • No support for following types:
    • geopandas Geometries
    • Float128. We could potentially implement this, would just have to make an assumption about the precision and roll with it
    • Complex64, Complex128, Complex256
  • Argument preserve_index to pandera.io.to_pyarrow_schema functions similarly to preserve_index argument to pyarrow.Schema.from_pandas
  • As mentioned in the issue discussion, there is no support for complex types like pyarrow.lint_(pyarrow.float64()).

Let me know if you feel like I missed any use cases in the unit tests.

@the-matt-morris
Copy link
Contributor Author

I didn't run those mypy unit tests locally, I'll have to see what's going on there.

The other thing to consider is that this may all be moot. I see the PR for DataFrameSchema.empty(), and this whole thing could potentially be simply refactored to:

import pyarrow
pyarrow.Schema.from_pandas(dataframe_schema.empty())

@sam-goodwin
Copy link
Contributor

What's the status of this PR? I have a use-case that requires this. Is there a different supported way or are we still waiting on this?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 1, 2024

will need to resolve the merge conflicts and probably rebase this onto the current main branch.

@the-matt-morris not sure if you want to pick this up again. I do think leveraging an empty method would make sense to fulfill this use case.

However, the PR that implements the empty method hasn't seen much movement, the issue is still open.

I do think a workaround for this would be:

import pyarrow

schema = pa.DataFrameSchema(..., coerce=True)
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
pyarrow.Schema.from_pandas(empty_df)

@sam-goodwin
Copy link
Contributor

@cosmicBboy this doesn't work for me. The schema infers all types as type null

class TodoList(pa.DataFrameModel):
    int16: Series[pdt.Int16] = pa.Field()
    int_list: Series[list[int]] = pa.Field()
    str_list: Series[list[str]] = pa.Field()
    int16_list: Series[list[pdt.Int16]] = pa.Field()
    int16_List: Series[List[pdt.Int16]] = pa.Field()

def test_to_arrow():
    import pandas as pd
    import pyarrow

    schema = TodoList.to_schema()
    empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))
    schema = pyarrow.Schema.from_pandas(empty_df)

    logger.info(schema)

Output:

int16: null
int_list: null
str_list: null
int16_list: null
int16_List: null

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 2, 2024

yeah, tried this out and I think the approach in this PR (i.e. a dedicated pandera schema -> pyarrow schema translation layer) is the way to go. This is because for any non-scalar type (struct, list, dictionary, etc) I don't think pyarrow.Schema.from_pandas will be able to infer the dtype from an object column. Any pandera column with generics like List[TYPE] will be represented as an object dtype in pandas.

happy to review this or another PR that takes a crack at this, not sure if you want to continue tackling this @the-matt-morris

@sam-goodwin
Copy link
Contributor

FYI - I have a local copy of this where I am modifying it to work for my use-case. I probably need some guidance though as I had to do some custom reflection to handle the typing library and am relatively new to python. Gist here: https://gist.github.com/sam-goodwin/85c44d0241f6848e4a183a39c1abfb58

Happy to contribute this back if @the-matt-morris isn't available to finish this PR.

@sam-goodwin
Copy link
Contributor

sam-goodwin commented Apr 2, 2024

It wasn't clear to me if i am suppsed to use = pa.Field() in NamedTuple or TypedDict:

class TodoItem(NamedTuple):
    name: str
    priority: int
    pd_uint8: pdt.UInt8

I instead am using reflection and mapping based on the python types.

I see this in the original PR:

pandas_types = {
    pd.BooleanDtype(): pa.bool_(),
    pd.Int8Dtype(): pa.int8(),
    pd.Int16Dtype(): pa.int16(),
    pd.Int32Dtype(): pa.int32(),
    pd.Int64Dtype(): pa.int64(),
    pd.UInt8Dtype(): pa.uint8(),
    pd.UInt16Dtype(): pa.uint16(),
    pd.UInt32Dtype(): pa.uint32(),
    pd.UInt64Dtype(): pa.uint64(),
    pd.Float32Dtype(): pa.float32(),  # type: ignore[attr-defined]
    pd.Float64Dtype(): pa.float64(),  # type: ignore[attr-defined]
    pd.StringDtype(): pa.string(),
}

I am just doing this:

elif python_type is pdt.UInt8:
        return pa.uint8()
    elif python_type is pdt.UInt16:
        return pa.uint16()
    elif python_type is pdt.UInt32:
        return pa.uint32()
    elif python_type is pdt.UInt64:
        return pa.uint64()
    elif python_type is pdt.Int8:
        return pa.int8()
    elif python_type is pdt.Int16:
        return pa.int16()
    elif python_type is pdt.Int32:
        return pa.int32()
    elif python_type is pdt.Int64:
        return pa.int64()
    elif python_type is pdt.Float32:
        return pa.float32()
    elif python_type is pdt.Float64:
        return pa.float64()
    elif python_type is pdt.String:
        return pa.string()
    elif python_type is pdt.Bool:

Not sure what the trade-offs are.

@cosmicBboy
Copy link
Collaborator

The mapping approach is faster and simpler (it's O(1) since it's a lookup table). This would probably work for most of the the simple types. For things like lists and namedtuple types you'll have to use the if statements.

In any case, feel free to create a new PR and we can iterate there.

@themattmorris
Copy link

Hey @cosmicBboy I am sorry about this one. It has been a long time and I have a new github account (yes the name is nearly exactly the same :) anyways, I can take a look at this one again, rebase and get the tests to pass. I must have gotten distracted but looks like there is at least some interest in getting this working,

@sam-goodwin
Copy link
Contributor

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

@themattmorris
Copy link

@the-matt-morris i recently forked and continued this work and have tested in production. Just didn't find the time to contribute it back. I'd be happy to open a PR or share a gist here with where I landed

Oh that's great, thanks for picking it up! If you're nearly there I will stay out of your way, but let me know if you want me to contribute to it at all or have any questions on the approach I was taking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate pyarrow schema from pandera schema
4 participants