Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default int type should be mapped to Int32 in Windows #726

Open
probberechts opened this issue Jan 9, 2022 · 6 comments · May be fixed by #1179
Open

Default int type should be mapped to Int32 in Windows #726

probberechts opened this issue Jan 9, 2022 · 6 comments · May be fixed by #1179
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@probberechts
Copy link
Contributor

Describe the bug

Pandas handles the default int type differently on Windows and Linux. On Linuxint is interpreted as int64 but on Windows as int32. Since Pandera always maps int to int64, you get unexpected SchemaErrors on Windows. You can read more about it in these issues:

Code Sample

import pandas as pd
import pandera as pa

from pandera.typing import DataFrame, Series

class Schema(pa.SchemaModel):
    price: Series[int]

df = pd.DataFrame({ 'price': [8, 12, 10, 16] }).astype(int)

InputSchema.validate(df)

This is fine on Linux, but gives a SchemaError: expected series 'price' to have type int64, got int32 on Windows.

@cosmicBboy
Copy link
Collaborator

Thanks for pointing this out @probberechts!

The relevant part of the code is here:
https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L211-L215

I think something like this will defer to the pandas default dtype for a particular python built-in type:

default_pd_dtype = pd.Series([1], dtype=builtin_name).dtype

assert np.dtype(int) == default_pd_dtype  # True
# Windows
assert np.dtype("int32") == default_pd_dtype
# non-Windows
assert np.dtype("int64") == default_pd_dtype

@jeffzi FYI, I forget why we decided to map int -> int64 regardless of OS, but I think we should defer to the behavior of the underlying platform for dtype defaults.

@probberechts would you be open to making a PR for this? Basically need to change that line of code and add an OS-specific unit test here: https://github.com/pandera-dev/pandera/blob/master/tests/core/test_dtypes.py

@jeffzi
Copy link
Collaborator

jeffzi commented Jan 10, 2022

I forget why we decided to map int -> int64 regardless of OS

I can't remember either. It could be a workaround from when we were fixing the windows CI.

I agree the default int should follow pandas and pandas_engine should not hardcode the size.

That said, we do test that default pandera int matches pandas: https://github.com/pandera-dev/pandera/blob/9448d0a80b8dd02910f9cc553ce00349584b107f/tests/core/test_dtypes.py#L406-L411

The problem is that the "implicit" default for a pandas Series is int64 but astype(int) will apply numpy's default, i.e. int32 on windows.

import platform
import sys
import numpy as np
import pandas as pd

print(platform.system())
#> Windows
print(sys.version)
#> 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)]
print(pd.__version__)
#> 1.3.4
print(np.__version__)
#> 1.21.2

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

If we have int32 as default on windows, then the validation will fail when the user does not explicitly cast the series to int (e.g.: pd.Dataframe(..., dtype=int) or astype(int)). I think that's fine but we should explain this inconsistency in the documentation.

@Midnighter
Copy link

This is quite annoying. I tried to fix it by explicitly defining

class Schema(pa.SchemaModel):
    price: Series[numpy.int64]

    class Config:
       coerce = True

but that didn't help either 🙁

@joaoe
Copy link

joaoe commented Apr 17, 2023

Howdy.
In my project we have people working on Linux and a few on Windows. This causes problems for us.
So far I worked around by monkeypatching Int

from pandera.dtypes import Int
Int.check = lambda self, pandera_dtype, data_container=None: isinstance(pandera_dtype, Int)

Having such work around in the project is ugly, but the fix itself is not too strange.

The default generic Int type shold allow any of int8, int16, int32 or int64. It is just Int. There are other IntN classes for the specific bid widths. But if I have a Schema with just column: int I'm not too interested in pandera checking for bid width.
Perhaps, the rules in the Int class should be relaxed ?

EDIT: for reference, this is the numpy bug numpy/numpy#9464

@DrShushen
Copy link

@joaoe
I agree, I've had to write custom workarounds in some of my projects to essentially accomplish this on Windows:
https://github.com/vanderschaarlab/temporai/blob/a6bb11d63341c4947f940e3d075e8a77be06b898/src/tempor/data/pandera_utils.py#L134-L140

@cosmicBboy cosmicBboy linked a pull request May 7, 2023 that will close this issue
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 7, 2023

So the problem here is stated by @jeffzi #726 (comment):

On Windows:

print(np.dtype(int))
#> int32
print(pd.Series([1]).dtype) # implicit dtype
#> int64
print(pd.Series([1], dtype=int).dtype) # explicit dtype
#> int32

It seems reasonable that pd.Series([1], dtype=int) should match pd.SeriesSchema(dtype=int). But it's not ideal that the implicit integer type of pd.Series([1]) is int64, so pd.SeriesSchema(dtype=int).validate(pd.Series([1])) will fail, which I'm sure will still be surprise many users. I think this will just have to be documented clearly somewhere.

There are 130 failing tests as a result of this PR fix: https://github.com/unionai-oss/pandera/actions/runs/4909424667/jobs/8765820508?pr=1179

If anyone on this thread so far has the bandwidth to fix all of these breaking tests on windows, that would be much appreciated! The code changes to fix this issue are already on #1179, just need to update the tests to explicitly use dtype=int in tests cases that rely on the assumption that pd.Series([1]) does the right thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants