Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert result of group by agg to pyarrow if input is pyarrow #58129

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

undermyumbrella1
Copy link
Contributor

@undermyumbrella1 undermyumbrella1 commented Apr 3, 2024

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to a particular issue?

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
@mroeschke mroeschke added Apply Apply, Aggregate, Transform Arrow pyarrow functionality labels Apr 3, 2024
@undermyumbrella1
Copy link
Contributor Author

I have added the issue in the pr, the pr is still work in progress

@undermyumbrella1 undermyumbrella1 marked this pull request as draft April 5, 2024 07:05
@undermyumbrella1 undermyumbrella1 marked this pull request as ready for review April 8, 2024 17:01
@undermyumbrella1
Copy link
Contributor Author

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

While I experiment with this some more, a few questions.

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/groupby/ops.py Outdated Show resolved Hide resolved
@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Apr 12, 2024

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from 4021573 to 6dc40f5 Compare April 21, 2024 08:46
@undermyumbrella1
Copy link
Contributor Author

Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue?

For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs.

@rhshadrach
Copy link
Member

May I check what would be suggested output dtype to fix this issue?

I think this is the same problem as the discussion here: #58258 (comment). This appears to me something that will need careful implementation on the EA side.

@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from e290535 to 3b6696b Compare April 25, 2024 16:00
@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Apr 25, 2024

Noted, I have made the change according to the comments using convert_dtypes and a helper method to account for some unaccounted pyarrow dtypes. Although I am not sure if the change should be made here or in lib.maybe_convert_objects.

Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side.

@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from 8a95274 to 680e238 Compare April 29, 2024 06:57
@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ed27650 to 3a8597e Compare April 29, 2024 08:29
@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have updated the pr according to comments.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two general remarks about the tests:

  • Use other pandas methods as little as possible; try to construct what you want directly.
  • It looks like many of the tests added here can be parametrized, can you give that a shot.

For the first, you can do things like

df = pd.DataFrame(
    {
        "a": pd.array([1, 2, 3], dtype="..."),
        "b": pd.array([True, False, True], dtype="..."),
    },
    index=pd.Index([1, 2, 3]),
)

instead of using astype, set_index, and the like.

Comment on lines +928 to +931
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
out.dtype.pyarrow_dtype
):
out = npvalues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test that hits this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion test hits this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giving us back a Arrow dtypes we don't have EAs for

Can you give an example? this confuses me.

should maybe go in dtypes.cast._maybe_cast_to_extension_array

_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.core.dtypes.cast import maybe_cast_pointwise_result

arr = np.array([{"number": 1}])
result = maybe_cast_pointwise_result(
    arr, 
    dtype=pd.ArrowDtype(pa.int64()), 
    numeric_only=True, 
    same_dtype=False,
)
print(result)
# Length: 1, dtype: struct<number: int64>[pyarrow]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.

@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?

pandas/tests/extension/test_arrow.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ad15c86 to 712c36a Compare May 2, 2024 05:23
@undermyumbrella1
Copy link
Contributor Author

Thank you foe the review, I have made changes according to the pr comments.

Comment on lines +928 to +931
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
out.dtype.pyarrow_dtype
):
out = npvalues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

Comment on lines 1134 to 1139
if pa.types.is_date(pa_dtype):
return "date32[day][pyarrow]"
elif pa.types.is_time(pa_dtype):
return "time64[us][pyarrow]"
elif pa.types.is_decimal(pa_dtype):
return ArrowDtype(pa.decimal128(4, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer look, I think this is a bug being introduced here. This test is using .first(), it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype argument of agg_series. When that is true, we should be passing same_dtype=True to maybe_cast_pointwise_result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op):
expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_()))
tm.assert_series_equal(result, expected)

def test_groupby_agg_extension(self, data_for_grouping):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
4 participants