New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert result of group by agg to pyarrow if input is pyarrow #58129
base: main
Are you sure you want to change the base?
Convert result of group by agg to pyarrow if input is pyarrow #58129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to a particular issue?
I have added the issue in the pr, the pr is still work in progress |
Hi, I have completed the implementation, may i check why linux test if failing with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:
df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")
result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]
result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object
While I experiment with this some more, a few questions.
Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case? |
4021573
to
6dc40f5
Compare
Sorry for the delay in resolving the test cases. Originally the approach is to always caste it back to input dtype, even if the output dtype is different. Currenlty the approach is to let pyarow infer the dtype of the output. However this can result in some unexpected cases as mentioned above. May I check what would be suggested output dtype to fix this issue? For windows run, it seems to be failing as it is unable to import pyarrow in files unrelated to the code change in the pr, not sure if i have to change some configs. |
I think this is the same problem as the discussion here: #58258 (comment). This appears to me something that will need careful implementation on the EA side. |
e290535
to
3b6696b
Compare
Noted, I have made the change according to the comments using Some of the windows tests are failing due to unable to import pyarrow in other section of the code unrelated to this change. Not sure if this is an issue on my side. |
8a95274
to
680e238
Compare
ed27650
to
3a8597e
Compare
Thank you for the review, I have updated the pr according to comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two general remarks about the tests:
- Use other pandas methods as little as possible; try to construct what you want directly.
- It looks like many of the tests added here can be parametrized, can you give that a shot.
For the first, you can do things like
df = pd.DataFrame(
{
"a": pd.array([1, 2, 3], dtype="..."),
"b": pd.array([True, False, True], dtype="..."),
},
index=pd.Index([1, 2, 3]),
)
instead of using astype
, set_index
, and the like.
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct( | ||
out.dtype.pyarrow_dtype | ||
): | ||
out = npvalues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test that hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion
test hits this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - I was surprised maybe_cast_pointwise_result
was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array
in a followup. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
giving us back a Arrow dtypes we don't have EAs for
Can you give an example? this confuses me.
should maybe go in dtypes.cast._maybe_cast_to_extension_array
_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.core.dtypes.cast import maybe_cast_pointwise_result
arr = np.array([{"number": 1}])
result = maybe_cast_pointwise_result(
arr,
dtype=pd.ArrowDtype(pa.int64()),
numeric_only=True,
same_dtype=False,
)
print(result)
# Length: 1, dtype: struct<number: int64>[pyarrow]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.
@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?
ad15c86
to
712c36a
Compare
Thank you foe the review, I have made changes according to the pr comments. |
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct( | ||
out.dtype.pyarrow_dtype | ||
): | ||
out = npvalues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - I was surprised maybe_cast_pointwise_result
was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array
in a followup. Any thoughts?
pandas/tests/extension/test_arrow.py
Outdated
if pa.types.is_date(pa_dtype): | ||
return "date32[day][pyarrow]" | ||
elif pa.types.is_time(pa_dtype): | ||
return "time64[us][pyarrow]" | ||
elif pa.types.is_decimal(pa_dtype): | ||
return ArrowDtype(pa.decimal128(4, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer look, I think this is a bug being introduced here. This test is using .first()
, it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype
argument of agg_series
. When that is true, we should be passing same_dtype=True
to maybe_cast_pointwise_result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type
pandas/tests/extension/test_arrow.py
Outdated
@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op): | |||
expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_())) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_groupby_agg_extension(self, data_for_grouping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Root cause:
agg_series
always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be differentFix:
maybe_convert_object
, asmaybe_convert_object
does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),