Convert result of group by agg to pyarrow if input is pyarrow #58129

undermyumbrella1 · 2024-04-03T10:48:16Z

closes BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy #53030 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Root cause:

agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

replace all NA with nan
convert the `results' to respective pyarrow extension array, using pyarrow library methods
pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

mroeschke

Is this related to a particular issue?

pandas/core/groupby/ops.py

pandas/tests/groupby/aggregate/test_aggregate.py

undermyumbrella1 · 2024-04-05T06:49:36Z

I have added the issue in the pr, the pr is still work in progress

undermyumbrella1 · 2024-04-08T17:02:00Z

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

rhshadrach

Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

While I experiment with this some more, a few questions.

pandas/core/dtypes/cast.py

pandas/core/groupby/ops.py

undermyumbrella1 · 2024-04-12T05:22:34Z

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

undermyumbrella1 · 2024-04-21T09:21:28Z

Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue?

For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs.

rhshadrach · 2024-04-22T20:39:02Z

May I check what would be suggested output dtype to fix this issue?

I think this is the same problem as the discussion here: #58258 (comment). This appears to me something that will need careful implementation on the EA side.

…unt for missing pyarrow dtypes

undermyumbrella1 · 2024-04-25T16:40:04Z

Noted, I have made the change according to the comments using convert_dtypes and a helper method to account for some unaccounted pyarrow dtypes. Although I am not sure if the change should be made here or in lib.maybe_convert_objects.

Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side.

pandas/core/groupby/ops.py

…to_pyarrow_array

undermyumbrella1 · 2024-04-29T09:54:45Z

Thank you for the review, I have updated the pr according to comments.

rhshadrach

Two general remarks about the tests:

Use other pandas methods as little as possible; try to construct what you want directly.
It looks like many of the tests added here can be parametrized, can you give that a shot.

For the first, you can do things like

df = pd.DataFrame(
    {
        "a": pd.array([1, 2, 3], dtype="..."),
        "b": pd.array([True, False, True], dtype="..."),
    },
    index=pd.Index([1, 2, 3]),
)

instead of using astype, set_index, and the like.

rhshadrach · 2024-04-29T20:47:57Z

pandas/core/groupby/ops.py

+            if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
+                out.dtype.pyarrow_dtype
+            ):
+                out = npvalues


Is there a test that hits this?

resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion test hits this

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

giving us back a Arrow dtypes we don't have EAs for

Can you give an example? this confuses me.

should maybe go in dtypes.cast._maybe_cast_to_extension_array

_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.

from pandas.core.dtypes.cast import maybe_cast_pointwise_result arr = np.array([{"number": 1}]) result = maybe_cast_pointwise_result( arr, dtype=pd.ArrowDtype(pa.int64()), numeric_only=True, same_dtype=False, ) print(result) # Length: 1, dtype: struct<number: int64>[pyarrow]

@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.

@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?

pandas/tests/extension/test_arrow.py

pandas/tests/groupby/aggregate/test_aggregate.py

undermyumbrella1 · 2024-05-02T07:03:12Z

Thank you foe the review, I have made changes according to the pr comments.

rhshadrach · 2024-05-04T12:22:58Z

pandas/core/groupby/ops.py

+            if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
+                out.dtype.pyarrow_dtype
+            ):
+                out = npvalues


@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

rhshadrach · 2024-05-04T12:52:25Z

pandas/tests/extension/test_arrow.py

+            if pa.types.is_date(pa_dtype):
+                return "date32[day][pyarrow]"
+            elif pa.types.is_time(pa_dtype):
+                return "time64[us][pyarrow]"
+            elif pa.types.is_decimal(pa_dtype):
+                return ArrowDtype(pa.decimal128(4, 3))


On closer look, I think this is a bug being introduced here. This test is using .first(), it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype argument of agg_series. When that is true, we should be passing same_dtype=True to maybe_cast_pointwise_result.

…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type

rhshadrach · 2024-05-08T22:06:43Z

pandas/tests/extension/test_arrow.py

@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op):
        expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_()))
        tm.assert_series_equal(result, expected)

+    def test_groupby_agg_extension(self, data_for_grouping):


I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?

rhshadrach · 2024-05-15T20:31:59Z

I think merging main once more should resolve the CI issues.

Kei added 5 commits April 1, 2024 19:04

Set preserve_dtype flag for bool type only when result is also bool

9faa460

Update implementation to change type to pyarrow only

969d5b1

Change import order

66114f3

Convert numpy array to pandas representation of pyarrow array

b0290ed

Add tests

20c8fa0

undermyumbrella1 requested review from rhshadrach and WillAyd as code owners April 3, 2024 10:48

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

97b3d54

mroeschke requested changes Apr 3, 2024

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved

mroeschke added Apply Apply, Aggregate, Transform Arrow pyarrow functionality labels Apr 3, 2024

Kei added 3 commits April 5, 2024 14:19

Change pyarrow to optional import in agg_series() method

932d737

Seperate tests

82ddeb5

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

d510052

undermyumbrella1 marked this pull request as draft April 5, 2024 07:05

Kei added 5 commits April 8, 2024 20:41

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

62a31d9

Revert to old implementation

a54bf58

Update implementation to use pyarrow array method

64330f0

Update test_aggregate tests

0647711

Move pyarrow import to top of method

affde38

undermyumbrella1 marked this pull request as ready for review April 8, 2024 17:01

rhshadrach reviewed Apr 10, 2024

View reviewed changes

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

Kei added 5 commits April 12, 2024 13:36

Update according to pr comments

842f561

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

93b5bf3

Fallback convert to input dtype is output is all nan or empty array

6f35c0e

Strip na values when inferring pyarrow dtype

abd0adf

Update tests to check expected inferred dtype instead of inputy dtype

bebc442

Kei added 5 commits April 24, 2024 01:28

In agg series, convert to np values, then cast to pyarrow dtype, acco…

4ef96f7

…unt for missing pyarrow dtypes

Update tests

c6a98c0

Update rst docs

9181eaf

Update impl to fix tests

612d7d0

Declare variable in outer scope

3b6696b

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from e290535 to 3b6696b Compare April 25, 2024 16:00

rhshadrach reviewed Apr 25, 2024

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

Update impl to use maybe_cast_pointwise_result instead of maybe_cast_…

680e238

…to_pyarrow_array

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from 8a95274 to 680e238 Compare April 29, 2024 06:57

Fix tests with nested array

3a8597e

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ed27650 to 3a8597e Compare April 29, 2024 08:29

rhshadrach requested changes Apr 29, 2024

View reviewed changes

Kei added 2 commits May 2, 2024 13:22

Update according to pr comments

6496b15

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

712c36a

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ad15c86 to 712c36a Compare May 2, 2024 05:23

rhshadrach requested changes May 4, 2024

View reviewed changes

Kei and others added 4 commits May 7, 2024 12:53

Preserve_dtype if argument is passed in, else don't preserve

e1ccef6

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

0ce083d

Update tests

a1d73f5

Merge branch 'fix/group_by_agg_pyarrow_bool_numpy_same_type' of githu…

57845a8

…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type

rhshadrach requested changes May 8, 2024

View reviewed changes

undermyumbrella1 and others added 3 commits May 12, 2024 15:39

Remove redundant tests

fa257b0

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

0a9b83f

retrigger pipeline

139319a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert result of group by agg to pyarrow if input is pyarrow #58129

Convert result of group by agg to pyarrow if input is pyarrow #58129

undermyumbrella1 commented Apr 3, 2024 •

edited

mroeschke left a comment

undermyumbrella1 commented Apr 5, 2024

undermyumbrella1 commented Apr 8, 2024

rhshadrach left a comment •

edited

undermyumbrella1 commented Apr 12, 2024 •

edited

undermyumbrella1 commented Apr 21, 2024

rhshadrach commented Apr 22, 2024

undermyumbrella1 commented Apr 25, 2024 •

edited

undermyumbrella1 commented Apr 29, 2024

rhshadrach left a comment

rhshadrach Apr 29, 2024

undermyumbrella1 May 2, 2024

rhshadrach May 4, 2024

jbrockmendel May 4, 2024

rhshadrach May 8, 2024

rhshadrach May 8, 2024

undermyumbrella1 commented May 2, 2024

rhshadrach May 4, 2024

rhshadrach May 4, 2024

undermyumbrella1 May 7, 2024 •

edited

rhshadrach May 8, 2024

undermyumbrella1 May 12, 2024

rhshadrach commented May 15, 2024

Convert result of group by agg to pyarrow if input is pyarrow #58129

Are you sure you want to change the base?

Convert result of group by agg to pyarrow if input is pyarrow #58129

Conversation

undermyumbrella1 commented Apr 3, 2024 • edited

mroeschke left a comment

Choose a reason for hiding this comment

undermyumbrella1 commented Apr 5, 2024

undermyumbrella1 commented Apr 8, 2024

rhshadrach left a comment • edited

Choose a reason for hiding this comment

undermyumbrella1 commented Apr 12, 2024 • edited

undermyumbrella1 commented Apr 21, 2024

rhshadrach commented Apr 22, 2024

undermyumbrella1 commented Apr 25, 2024 • edited

undermyumbrella1 commented Apr 29, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 May 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented May 15, 2024

undermyumbrella1 commented Apr 3, 2024 •

edited

rhshadrach left a comment •

edited

undermyumbrella1 commented Apr 12, 2024 •

edited

undermyumbrella1 commented Apr 25, 2024 •

edited

undermyumbrella1 May 7, 2024 •

edited