Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate pyarrow schema from pandera schema #689

Open
cristianmatache opened this issue Nov 25, 2021 · 14 comments · May be fixed by #1047
Open

Generate pyarrow schema from pandera schema #689

cristianmatache opened this issue Nov 25, 2021 · 14 comments · May be fixed by #1047
Assignees
Labels
enhancement New feature or request

Comments

@cristianmatache
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Need to maintain the schema twice, once for the pandas dataframe and again for the pyarrow table. An example where we need both is writing partitioned parquet datasets.

Describe the solution you'd like
Generate pyarrow schema from pandera schema.

I plan to implement this over the Christmas holidays.

@cristianmatache cristianmatache added the enhancement New feature or request label Nov 25, 2021
@justinlboyer
Copy link

@cristianmatache Any chance you made any headway on this?

@cristianmatache
Copy link
Contributor Author

@justinlboyer not really, i recently changed jobs so i currently have a lot on my plate. Happy to guide you though, if you would be up for implementing it.

@the-matt-morris
Copy link
Contributor

@justinlboyer , did you ever take a look at this? This would be useful, though I'm assuming it would be limited in its implementation, i.e. using pyarrow.list_(pyarrow.float64()) would not be supported, as there's no implementation of complex types like this in pandera (that I'm aware of?)

If a basic implementation is satisfactory (i.e. not able to handle complex types like the list example above), I'd be up for collaborating on this.

@justinlboyer
Copy link

@the-matt-morris I did not, we don't need it much anymore, but I'm happy to help out, feel free to ping me.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Nov 29, 2022

hey @the-matt-morris the basic implementation would be a first good step! (i.e. support for primitive/scalar data types)

This related to #260, support for things like pyarrow.list_(pyarrow.float64()) would be blocked by that.

@the-matt-morris
Copy link
Contributor

@cosmicBboy , cool! Well I can take a stab at a PR on this...thinking would be a DataFrameSchema method that returns the pyarrow schema. Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

@cosmicBboy
Copy link
Collaborator

thinking would be a DataFrameSchema method that returns the pyarrow schema

I'd consider this part of the pandera[io] extra, with the additional pyarrow library dependency.

My recommendation would be to implement a to_pyarrow_schema in the io module. For now I'd hesitate adding it as a DataFrameSchema method so the API surface of the class stays (relatively) small -- I imagine more of these to/from_{schema_format} functions will be implemented in the future, and a reasonable UX for it would be pandera.io.to/from_{schema_format}(dataframe_schema)

Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

Seems about right!

@the-matt-morris the-matt-morris linked a pull request Dec 8, 2022 that will close this issue
@louis-vines
Copy link

Is this PR close to being merged? This is an excellent feature I would be keen to leverage!

@cosmicBboy
Copy link
Collaborator

hi @louis-vines all current PRs are being blocked by #913, which involves a signifant re-write of the pandera internals. Once that's merged (hopefully within the next 2 weeks) we'll circle back to incorporate all the recent PRs, including this one.

@the-matt-morris
Copy link
Contributor

Excited for #913 !

Even once that is merged, I will need to go back and make a few updates to the PR anyways. I'd like to try out DataFrameSchema.empty() in conjunction with pyarrow.Schema.from_pandas, as it might be more robust than hardcoding all the mappings of dtypes to pyarrow types that I did initially.

@louis-vines
Copy link

louis-vines commented Feb 18, 2023

I see #913 is now merged (🥳). Any news on this one? Anything I could do to help?

@novemberkilo
Copy link

Also checking in on the status of this please.

@Cakell
Copy link

Cakell commented Jan 11, 2024

Hi @the-matt-morris

I'd also be happy to use this feature. Any chance you can update #1047 now that #913 is merged?

Thanks!

@sam-goodwin
Copy link
Contributor

Checking in on the status. How can we further this along?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants