Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi type (Unions) in schemas and validation #1152

Open
vianmixtkz opened this issue Apr 4, 2023 · 7 comments · May be fixed by #1227
Open

Support for multi type (Unions) in schemas and validation #1152

vianmixtkz opened this issue Apr 4, 2023 · 7 comments · May be fixed by #1227
Labels
enhancement New feature or request

Comments

@vianmixtkz
Copy link

Is your feature request related to a problem? Please describe.

I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types.
Pydantic allows it.

Here an example of my issue

from typing import Union
import pandas as pd
import pandera as pa
from pandera.typing import Series

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str, float]] = pa.Field()

class OutputSchema(InputSchema):
    revenue: Series[float]

df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
    "comment":["test", float("nan"), "test"]
})

InputSchema(df) # raises TypeError Cannot interpret 'typing.Union[str, float]' as a data type

Describe the solution you'd like

I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?

Describe alternatives you've considered

Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.

Additional context

I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types.
I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust

@vianmixtkz vianmixtkz added the enhancement New feature or request label Apr 4, 2023
@johnkangw
Copy link

@vianmixtkz Great writeup. This is something that would be great for Pandera to support.

@cosmicBboy
Copy link
Collaborator

Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an object dtype column.

One thing we should clarify in the semantics of this feature is the following: we can interpret Union[str, float] either as:

  1. the column is either a str column or a float column
  2. the column is an object column that contains either str or float values

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

  • if the column is str dtype, then pass
  • if the column is float dtype, then pass
  • if the column is object data type, check that values are str or float. If so, then pass.
  • fail if none of the above conditions are met.

@vianmixtkz
Copy link
Author

Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows.
But it would be nice to support both cases anyway.

With something like:

Case 1

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame

Case 2

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows

And yeah, I think the behavior you are describing is what users would expect

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

if the column is str dtype, then pass # passes in case 1 and 2
if the column is float dtype, then pass # passes in case 1 and 2
if the column is object data type, check that values are str or float. If so, then pass. # passes only in case 2
fail if none of the above conditions are met.

@karajan1001 karajan1001 linked a pull request Jun 20, 2023 that will close this issue
karajan1001 added a commit to karajan1001/pandera that referenced this issue Jul 19, 2023
fix: unionai-oss#1152
I would like pandera to support Union Type. That is the validation of a
Series/Column should allow multiple types.

1. Add a new PythonUnion type.
2. Add a new test to for the new UnionType.

Signed-off-by: karajan1001 <mishanyo1001@gmail.com>
@aaravind100
Copy link
Contributor

Just bumping this thread.

Any consensus how to proceed? Seem like the #1227 is stale.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 30, 2024

Revisiting this issue and thinking about it a little bit, here's another proposal for this issue:

from pandera.engines.pandas_engine import Object
from typing import Annotated

class Model(pa.DataFrameModel):
    union_column : Union[str, float]  # the column data type must be either a str or float

    object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
    # or use the annotated types
    object_column: Annotated[Object, [str, float]]

This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special Object type.

I'm still open to the more ambiguous behavior where Union[str, float] would cover all of these cases though. Open to further discussion on this!

@cosmicBboy
Copy link
Collaborator

Re: this proposal: #1152 (comment)

Unfortunately col: Series[TYPE] and col: TYPE in a DataFrameModel are equivalent so Union[Series[str], Series[float]] and Series[Union[str,float]] would effectively be equivalent, and would also introduce more complexity to the handling of types in DataFrameModel, which I don't think would be worth it.

@aaravind100
Copy link
Contributor

I'm not a fan of this case Union[Series[str], Series[float]] from this comment, where the series would consists of only string or only float. Its very ambiguous, the output would sorta change depending on what data you pass. These could be very well their own distinct schema.

Series[Union[str, float]] or Union[str, float] or str | float # python 3.10+, where the output could be either string or float. This case is more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants