Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert result of group by agg to pyarrow if input is pyarrow #58129

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
9faa460
Set preserve_dtype flag for bool type only when result is also bool
Apr 1, 2024
969d5b1
Update implementation to change type to pyarrow only
Apr 2, 2024
66114f3
Change import order
Apr 2, 2024
b0290ed
Convert numpy array to pandas representation of pyarrow array
Apr 3, 2024
20c8fa0
Add tests
Apr 3, 2024
97b3d54
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 3, 2024
932d737
Change pyarrow to optional import in agg_series() method
Apr 5, 2024
82ddeb5
Seperate tests
Apr 5, 2024
d510052
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 5, 2024
62a31d9
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 8, 2024
a54bf58
Revert to old implementation
Apr 8, 2024
64330f0
Update implementation to use pyarrow array method
Apr 8, 2024
0647711
Update test_aggregate tests
Apr 8, 2024
affde38
Move pyarrow import to top of method
Apr 8, 2024
842f561
Update according to pr comments
Apr 12, 2024
93b5bf3
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 20, 2024
6f35c0e
Fallback convert to input dtype is output is all nan or empty array
Apr 20, 2024
abd0adf
Strip na values when inferring pyarrow dtype
Apr 20, 2024
bebc442
Update tests to check expected inferred dtype instead of inputy dtype
Apr 20, 2024
bb6343b
Override test case for test_arrow.py
Apr 21, 2024
3a3f2a2
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 21, 2024
6dc40f5
Empty commit to trigger build run
Apr 21, 2024
4ef96f7
In agg series, convert to np values, then cast to pyarrow dtype, acco…
Apr 23, 2024
c6a98c0
Update tests
Apr 23, 2024
9181eaf
Update rst docs
Apr 25, 2024
612d7d0
Update impl to fix tests
Apr 25, 2024
3b6696b
Declare variable in outer scope
Apr 25, 2024
680e238
Update impl to use maybe_cast_pointwise_result instead of maybe_cast…
Apr 29, 2024
3a8597e
Fix tests with nested array
Apr 29, 2024
6496b15
Update according to pr comments
May 2, 2024
712c36a
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
May 2, 2024
e1ccef6
Preserve_dtype if argument is passed in, else don't preserve
May 7, 2024
0ce083d
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
undermyumbrella1 May 7, 2024
a1d73f5
Update tests
May 7, 2024
57845a8
Merge branch 'fix/group_by_agg_pyarrow_bool_numpy_same_type' of githu…
May 7, 2024
fa257b0
Remove redundant tests
undermyumbrella1 May 12, 2024
0a9b83f
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
undermyumbrella1 May 12, 2024
139319a
retrigger pipeline
undermyumbrella1 May 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
36 changes: 36 additions & 0 deletions pandas/core/dtypes/cast.py
Expand Up @@ -478,6 +478,42 @@ def maybe_cast_pointwise_result(
return result


def maybe_cast_to_pyarrow_dtype(
result: ArrayLike, converted_result: ArrayLike
) -> ArrayLike:
"""
Try casting result of a pointwise operation to its pyarrow dtype if
appropriate.

Parameters
----------
result : array-like
Result to cast.

Returns
-------
result : array-like
result maybe casted to the dtype.
"""
try:
import pyarrow as pa
from pyarrow import (
ArrowInvalid,
ArrowNotImplementedError,
)

from pandas.core.construction import array as pd_array

result[isna(result)] = np.nan
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
pyarrow_result = pa.array(result)
pandas_pyarrow_dtype = ArrowDtype(pyarrow_result.type)
result = pd_array(result, dtype=pandas_pyarrow_dtype)
except (ArrowNotImplementedError, ArrowInvalid):
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
return converted_result

return result


def _maybe_cast_to_extension_array(
cls: type[ExtensionArray], obj: ArrayLike, dtype: ExtensionDtype | None = None
) -> ArrayLike:
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/groupby/ops.py
Expand Up @@ -36,6 +36,7 @@

from pandas.core.dtypes.cast import (
maybe_cast_pointwise_result,
maybe_cast_to_pyarrow_dtype,
maybe_downcast_to_dtype,
)
from pandas.core.dtypes.common import (
Expand All @@ -51,6 +52,7 @@
)

from pandas.core.arrays import Categorical
from pandas.core.arrays.arrow.array import ArrowExtensionArray
from pandas.core.frame import DataFrame
from pandas.core.groupby import grouper
from pandas.core.indexes.api import (
Expand Down Expand Up @@ -914,18 +916,22 @@ def agg_series(
np.ndarray or ExtensionArray
"""

if not isinstance(obj._values, np.ndarray):
if not isinstance(obj._values, np.ndarray) and not isinstance(
obj._values, ArrowExtensionArray
):
# we can preserve a little bit more aggressively with EA dtype
# because maybe_cast_pointwise_result will do a try/except
# with _from_sequence. NB we are assuming here that _from_sequence
# is sufficiently strict that it casts appropriately.
preserve_dtype = True

result = self._aggregate_series_pure_python(obj, func)

npvalues = lib.maybe_convert_objects(result, try_float=False)

if preserve_dtype:
out = maybe_cast_pointwise_result(npvalues, obj.dtype, numeric_only=True)
elif isinstance(obj._values, ArrowExtensionArray):
out = maybe_cast_to_pyarrow_dtype(result, npvalues)
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
else:
out = npvalues
return out
Expand Down