BUG/PERF: groupby.transform with unobserved categories #58084

undermyumbrella1 · 2024-03-30T09:07:50Z

closes BUG: groupby.transform with a reducer and unobserved categories coerces dtype #55326
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

asishm · 2024-03-30T12:19:12Z

Is there an issue linked with this?

Aloqeely · 2024-03-30T20:21:36Z

Is there an issue linked with this?

No clue.
@undermyumbrella1 I'd appreciate an explanation of what this change accomplishes. And please make sure all the code tests pass

undermyumbrella1 · 2024-03-31T15:05:10Z

this is a work in progress for issue #55326 , i have added the issue number

undermyumbrella1 · 2024-04-05T06:50:10Z

ok, the pr implementation is completed

rhshadrach

Thanks for the PR! In addition to the issue highlighted below, I think it might be a better approach to compute the result using only observed data for transforms. Not only would that fix this issue, but it would also give a good performance gain. This is on my radar to look into and may not work out, but I think it should be tried first before other approaches. If you would like to give this a shot, please feel free!

rhshadrach · 2024-04-06T15:21:04Z

pandas/core/groupby/ops.py

+        if remove_nan:
+            mask = np.zeros(shape=values.shape, dtype=bool)
+            result_mask = np.zeros(shape=(1, ngroups), dtype=bool)


We can't just ignore mask, e.g. this gives the wrong result

data = pd.array([pd.NA, 2, 3, 4], dtype="Int64") df = DataFrame({"key": ["a", "a", "b", "b"], "col": data}) grouped = df.groupby("key", observed=False) print(grouped.transform("min")) # col # 0 1 # 1 1 # 2 3 # 3 3

pandas/tests/groupby/transform/test_transform.py

rhshadrach · 2024-04-06T15:24:07Z

pandas/core/groupby/groupby.py

@@ -3089,6 +3139,7 @@ def min(
        min_count: int = -1,
        engine: Literal["cython", "numba"] | None = None,
        engine_kwargs: dict[str, bool] | None = None,
+        **kwargs,


I think we should try very hard to avoid adding kwargs to a method for internal use.

…le empty param

undermyumbrella1 · 2024-04-17T16:47:15Z

HI thank you for the pr review, I have changed my implementation to temporarily set observed to true (and respective groupers), so that transform will return the correct result.

I have initially tried to change the result of getattr(self, func)(*args, **kwargs), by using grouped reduce to map each result block to out_dtype that was determined in _cython_operation. However this impl turned out to be way too complicated, as the out_dtype, out_shape, views of the original value block is determined by the entire nested sequence of method calls. Extracting this logic out proved to be complicated.

rhshadrach

This is looking close to what I was envisioning, though more attributes appear to need to be modified than I was hoping. This introduces fragility (e.g. adding a new cached attribute could break things) and possibly hard to detect bugs (issues that would only show up if you reuse a groupby instance with two different operations in a certain order). It's still the best way I see to solve it.

rhshadrach · 2024-04-18T21:12:56Z

pandas/core/groupby/groupby.py

+            grouper, exclusions, obj = get_grouper(
+                self.orig_obj,
+                self.keys,
+                level=self.level,
+                sort=self.sort,
+                observed=True,
+                dropna=self.dropna,
+            )


I think we'll want to cache this on the groupby instance - we do not want to have to recompute it if the groupby is reused.

resolved, the group by init now accepts observed_grouper, observed_exclusions params

rhshadrach · 2024-04-18T21:13:14Z

pandas/core/groupby/groupby.py

+                com.temp_setattr(self, "observed", True),
+                com.temp_setattr(self, "_grouper", grouper),
+                com.temp_setattr(self, "exclusions", exclusions),
+                com.temp_setattr(self, "obj", obj, condition=obj_has_not_changed),


Why can't we unconditionally set obj here?

resolved, removed setting obj to obj

…edge agg cases

undermyumbrella1 · 2024-04-20T10:31:03Z

Thank you for the review, I have made the changes as requested

rhshadrach · 2024-04-20T11:19:35Z

pandas/core/groupby/groupby.py

+        "observed_grouper",
+        "observed_exclusions",


Instead of this, I recommend adding it as a cached method on the BaseGrouper class in ops.py.

@cache_readonly def observed_grouper(self): if all(ping._observed for ping in self.groupings): return self grouper = BaseGrouper(...) return grouper

For this to work, you also need to do the same to Grouping:

@cache_readonly def observed_grouping(self): if self._observed: return self grouping = Grouping(...) return grouping

and use the observed_groupings in the BaseGrouper call above. For BinGrouper, I think you can just always return self (doesn't change behavior on to observed=True/False).

Also, you can ignore exclusions - this is independent of the grouping data stored in BaseGrouper/Grouping.

Ah my bad, I have made the changes as suggested

undermyumbrella1 · 2024-04-21T08:49:02Z

Thank you for the review, i have made the changes as suggested

rhshadrach · 2024-04-25T21:32:34Z

Thanks for the changes @undermyumbrella1 - this is looking good! I have some minor refactor/style requests, but I'd like to get another eye here before any more work is done.

@mroeschke - would you be able to take a look? In addition to the issue linked in the OP, this is fixing a regression caused by #55738:

N = 10**3
data = {
    "a1": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "a2": Categorical(np.random.randint(100, size=N), categories=np.arange(N)),
    "b": np.random.random(N),
}
df = DataFrame(data)
%timeit df.groupby(["a1", "a2"], observed=False)["b"].transform("sum")
# 6.83 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- main
# 687 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR

While it's undesirable to swap out the grouper as is done here, I do not see any better way. There may be more efficient ways of computed the observed codes / result_index, but that can be readily built upon this later on.

pandas/core/groupby/grouper.py

undermyumbrella1 · 2024-04-29T08:40:44Z

Thank you for the review, I have updated the pr according to comments.

mroeschke

Looks OK to me

rhshadrach

A few style requests, otherwise looks great!

pandas/tests/groupby/transform/test_transform.py

rhshadrach · 2024-04-30T21:28:23Z

pandas/tests/groupby/transform/test_transform.py

+
+
+# GH#58084
+def test_min_multiple_unobserved_categories_no_type_coercion():


This seems redundant to me - I think the above test is sufficient here.

rhshadrach · 2024-04-30T21:30:19Z

pandas/tests/groupby/transform/test_transform.py

+
+
+# GH#58084
+def test_min_float32_multiple_unobserved_categories_no_type_coercion():


Can you instead parametrize test_min_one_unobserved_category_no_type_coercion. Something like

@pytest.mark.parametrize("dtype", ["int32", "float32"]) def test_min_one_unobserved_category_no_type_coercion(dtype): ... df["B"] = df["B"].astype(dtype)

rhshadrach · 2024-04-30T21:31:15Z

pandas/tests/groupby/transform/test_transform.py

+                categories=[
+                    1,
+                    "randomcat",
+                    100,
+                    333,
+                    "cat43543",
+                    -4325466,
+                    54665,
+                    -546767,
+                    "432945",
+                    767076,


I don't think there is a need for so many here - can you make it 1-3 categories (so the test is more compact).

rhshadrach · 2024-04-30T21:32:22Z

pandas/core/groupby/generic.py

@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None):
        elif ndim == 1:
            if subset is None:
                subset = self.obj[key]
+


Can you revert this line addition

This still appears in the diff of this PR.

undermyumbrella1 · 2024-05-02T04:19:02Z

Thank you for the review, I have updated the pr according to comments.

rhshadrach

Looking really good - just some unintentional changes to core/generic.py and core/groupby/generic.py - I think you deleted a line from the former instead of the latter 😄

Also - a note about force pushing. Force pushing on your PR is okay, but do know it can make review a little harder. Namely, when you force push the "Show changes since your last review" option no longer works.

rhshadrach · 2024-05-04T12:06:24Z

pandas/core/groupby/generic.py

@@ -2044,6 +2044,7 @@ def _gotitem(self, key, ndim: int, subset=None):
        elif ndim == 1:
            if subset is None:
                subset = self.obj[key]
+


This still appears in the diff of this PR.

rhshadrach · 2024-05-04T12:06:39Z

pandas/core/generic.py

@@ -2055,7 +2055,6 @@ def __setstate__(self, state) -> None:
                object.__setattr__(self, "_attrs", attrs)
                flags = state.get("_flags", {"allows_duplicate_labels": True})
                object.__setattr__(self, "_flags", Flags(self, **flags))
-


Can you revert this line removal. Shouldn't have any diff in this file.

undermyumbrella1 · 2024-05-07T04:17:59Z

Thank you for the review, I have updated the pr according to comments. Noted on force pushing

rhshadrach

lgtm

rhshadrach · 2024-05-08T22:34:09Z

Thanks @undermyumbrella1 - very nice!

undermyumbrella1 requested a review from rhshadrach as a code owner March 30, 2024 09:07

rhshadrach requested changes Apr 6, 2024

View reviewed changes

mroeschke added Groupby Categorical Categorical Data Type Apply Apply, Aggregate, Transform labels Apr 9, 2024

Kei added 2 commits April 17, 2024 17:01

Temporarily change observed=True, for groupby.transform

a52e7fe

Add tests

898fd12

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from be71a4d to 898fd12 Compare April 17, 2024 09:02

Kei added 2 commits April 17, 2024 22:42

Add orig_obj in BaseGroupBy hidden attr

5311004

Update tests according to pr comments

fb548ad

undermyumbrella1 closed this Apr 17, 2024

undermyumbrella1 reopened this Apr 17, 2024

Move orig_obj arg in constructor to last param, to account for possib…

baa1b28

…le empty param

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 8c1cef0 to baa1b28 Compare April 17, 2024 16:09

rhshadrach reviewed Apr 18, 2024

View reviewed changes

Move calculation of observed grouper to when initialising groupby

30013ee

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from af75b3a to 30013ee Compare April 20, 2024 08:38

Only calculate observed_grouper when grouper is absent to account to …

3b9d27b

…edge agg cases

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 73a6fef to 3b9d27b Compare April 20, 2024 09:48

Merge branch 'main' into fix/type_coercion_for_unobserved_categories

4221c34

rhshadrach reviewed Apr 20, 2024

View reviewed changes

Kei added 3 commits April 20, 2024 19:59

Remove observed exclusions

8588a1e

Add observed grouper/grouping as cached method

8e669d9

change return type to grouping

0d9f89d

Kei added 2 commits April 25, 2024 20:32

Update rst docs

84f83ae

Update rst docs

cbabce0

mroeschke reviewed Apr 26, 2024

View reviewed changes

pandas/core/groupby/grouper.py Show resolved Hide resolved

Cache observed grouping/grouper instead of self obj

bcca14f

mroeschke approved these changes Apr 30, 2024

View reviewed changes

rhshadrach requested changes Apr 30, 2024

View reviewed changes

Update according to pr comments

f3a3f63

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 49f5a1e to f3a3f63 Compare May 2, 2024 03:13

Merge main

58e759f

undermyumbrella1 force-pushed the fix/type_coercion_for_unobserved_categories branch from 64aa8cd to 58e759f Compare May 2, 2024 03:25

rhshadrach requested changes May 4, 2024

View reviewed changes

Kei and others added 3 commits May 7, 2024 10:47

Revert unintentional changes

7f99b71

Revert unintentional changes

4364440

Merge branch 'main' into fix/type_coercion_for_unobserved_categories

5d63405

rhshadrach approved these changes May 8, 2024

View reviewed changes

rhshadrach added Bug Performance Memory or execution speed performance labels May 8, 2024

rhshadrach changed the title ~~Use mask to create result_mask that filters nan categories~~ BUG/PERF: Use mask to create result_mask that filters nan categories May 8, 2024

rhshadrach changed the title ~~BUG/PERF: Use mask to create result_mask that filters nan categories~~ BUG/PERF: groupby.transform with unobserved categories May 8, 2024

rhshadrach added this to the 3.0 milestone May 8, 2024

rhshadrach merged commit 8d543ba into pandas-dev:main May 8, 2024
52 checks passed



		# GH#58084
		def test_min_multiple_unobserved_categories_no_type_coercion():



		# GH#58084
		def test_min_float32_multiple_unobserved_categories_no_type_coercion():

BUG/PERF: groupby.transform with unobserved categories #58084

BUG/PERF: groupby.transform with unobserved categories #58084

Conversation

undermyumbrella1 commented Mar 30, 2024 • edited

asishm commented Mar 30, 2024

Aloqeely commented Mar 30, 2024

undermyumbrella1 commented Mar 31, 2024 • edited

undermyumbrella1 commented Apr 5, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented Apr 17, 2024 • edited

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented Apr 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented Apr 21, 2024

rhshadrach commented Apr 25, 2024

undermyumbrella1 commented Apr 29, 2024

mroeschke left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented May 2, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undermyumbrella1 commented May 7, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented May 8, 2024

undermyumbrella1 commented Mar 30, 2024 •

edited

undermyumbrella1 commented Mar 31, 2024 •

edited

undermyumbrella1 commented Apr 17, 2024 •

edited