Generate pyarrow schema from pandera schema #689

cristianmatache · 2021-11-25T13:39:52Z

Is your feature request related to a problem? Please describe.
Need to maintain the schema twice, once for the pandas dataframe and again for the pyarrow table. An example where we need both is writing partitioned parquet datasets.

Describe the solution you'd like
Generate pyarrow schema from pandera schema.

I plan to implement this over the Christmas holidays.

justinlboyer · 2022-06-27T15:45:31Z

@cristianmatache Any chance you made any headway on this?

cristianmatache · 2022-06-28T00:08:26Z

@justinlboyer not really, i recently changed jobs so i currently have a lot on my plate. Happy to guide you though, if you would be up for implementing it.

the-matt-morris · 2022-11-28T15:51:24Z

@justinlboyer , did you ever take a look at this? This would be useful, though I'm assuming it would be limited in its implementation, i.e. using pyarrow.list_(pyarrow.float64()) would not be supported, as there's no implementation of complex types like this in pandera (that I'm aware of?)

If a basic implementation is satisfactory (i.e. not able to handle complex types like the list example above), I'd be up for collaborating on this.

justinlboyer · 2022-11-29T13:54:26Z

@the-matt-morris I did not, we don't need it much anymore, but I'm happy to help out, feel free to ping me.

cosmicBboy · 2022-11-29T19:54:32Z

hey @the-matt-morris the basic implementation would be a first good step! (i.e. support for primitive/scalar data types)

This related to #260, support for things like pyarrow.list_(pyarrow.float64()) would be blocked by that.

the-matt-morris · 2022-11-29T19:59:00Z

@cosmicBboy , cool! Well I can take a stab at a PR on this...thinking would be a DataFrameSchema method that returns the pyarrow schema. Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

cosmicBboy · 2022-11-29T21:30:52Z

thinking would be a DataFrameSchema method that returns the pyarrow schema

I'd consider this part of the pandera[io] extra, with the additional pyarrow library dependency.

My recommendation would be to implement a to_pyarrow_schema in the io module. For now I'd hesitate adding it as a DataFrameSchema method so the API surface of the class stays (relatively) small -- I imagine more of these to/from_{schema_format} functions will be implemented in the future, and a reasonable UX for it would be pandera.io.to/from_{schema_format}(dataframe_schema)

Obviously will need to create data type mapping to pyarrow types somewhere. Am I oversimplifying this?

Seems about right!

louis-vines · 2023-01-19T16:00:10Z

Is this PR close to being merged? This is an excellent feature I would be keen to leverage!

cosmicBboy · 2023-01-19T19:40:57Z

hi @louis-vines all current PRs are being blocked by #913, which involves a signifant re-write of the pandera internals. Once that's merged (hopefully within the next 2 weeks) we'll circle back to incorporate all the recent PRs, including this one.

the-matt-morris · 2023-01-19T19:44:40Z

Excited for #913 !

Even once that is merged, I will need to go back and make a few updates to the PR anyways. I'd like to try out DataFrameSchema.empty() in conjunction with pyarrow.Schema.from_pandas, as it might be more robust than hardcoding all the mappings of dtypes to pyarrow types that I did initially.

louis-vines · 2023-02-18T15:00:57Z

I see #913 is now merged (🥳). Any news on this one? Anything I could do to help?

novemberkilo · 2023-11-06T01:08:23Z

Also checking in on the status of this please.

Cakell · 2024-01-11T10:03:51Z

Hi @the-matt-morris

I'd also be happy to use this feature. Any chance you can update #1047 now that #913 is merged?

Thanks!

sam-goodwin · 2024-03-29T22:53:50Z

Checking in on the status. How can we further this along?

cristianmatache added the enhancement New feature or request label Nov 25, 2021

cosmicBboy assigned cristianmatache Dec 7, 2021

the-matt-morris linked a pull request Dec 8, 2022 that will close this issue

feat: add pandera.io.to_pyarrow_schema #1047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate pyarrow schema from pandera schema #689

Generate pyarrow schema from pandera schema #689

cristianmatache commented Nov 25, 2021

justinlboyer commented Jun 27, 2022

cristianmatache commented Jun 28, 2022

the-matt-morris commented Nov 28, 2022

justinlboyer commented Nov 29, 2022

cosmicBboy commented Nov 29, 2022 •

edited

the-matt-morris commented Nov 29, 2022

cosmicBboy commented Nov 29, 2022

louis-vines commented Jan 19, 2023

cosmicBboy commented Jan 19, 2023

the-matt-morris commented Jan 19, 2023

louis-vines commented Feb 18, 2023 •

edited

novemberkilo commented Nov 6, 2023

Cakell commented Jan 11, 2024

sam-goodwin commented Mar 29, 2024

Generate pyarrow schema from pandera schema #689

Generate pyarrow schema from pandera schema #689

Comments

cristianmatache commented Nov 25, 2021

justinlboyer commented Jun 27, 2022

cristianmatache commented Jun 28, 2022

the-matt-morris commented Nov 28, 2022

justinlboyer commented Nov 29, 2022

cosmicBboy commented Nov 29, 2022 • edited

the-matt-morris commented Nov 29, 2022

cosmicBboy commented Nov 29, 2022

louis-vines commented Jan 19, 2023

cosmicBboy commented Jan 19, 2023

the-matt-morris commented Jan 19, 2023

louis-vines commented Feb 18, 2023 • edited

novemberkilo commented Nov 6, 2023

Cakell commented Jan 11, 2024

sam-goodwin commented Mar 29, 2024

cosmicBboy commented Nov 29, 2022 •

edited

louis-vines commented Feb 18, 2023 •

edited