When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

nprihodko · 2024-05-10T09:33:59Z

Describe the issue:

Under certain conditions—aggregations of groups that have only missing values, when using PyArrow floats—Dask creates NaNs of an unexpected type. PyArrow types are nullable, so it should create <NA> but it creates NaN, which are then incorrectly not treated as missing values.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({
    "x": range(15),
    "y": [pd.NA] * 10 + [1.0] * 5,
    "group": ["a"] * 5 + ["b"] * 5 + ["c"] * 5
})

ddf = dd.from_pandas(df, npartitions=2).astype({
    "y": "double[pyarrow]"
})
print(ddf.compute())
#      x     y group
# 0    0  <NA>     a
# 1    1  <NA>     a
# 2    2  <NA>     a
# 3    3  <NA>     a
# 4    4  <NA>     a
# 5    5  <NA>     b
# 6    6  <NA>     b
# 7    7  <NA>     b
# 8    8  <NA>     b
# 9    9  <NA>     b
# 10  10   1.0     c
# 11  11   1.0     c
# 12  12   1.0     c
# 13  13   1.0     c
# 14  14   1.0     c

ddf = ddf.groupby("group").agg(x=("x", "mean"), y=("y", "mean"))
print(ddf.compute())
#           x    y
# group           
# a       2.0  NaN
# b       7.0  NaN
# c      12.0  1.0
# (Note that groups with <NA> resulted in NaN rather than <NA>)

# These NaNs are then not recognized as missing values
print(ddf["y"].isnull().sum().compute())
# 0
# (While the expected result is 2).

# This breaks other functions, example further aggregations in Pandas
print(ddf["y"].compute().mean(skipna=True))
# nan
# (While the exected result is 1.0).

# And also further aggreations in Dask, albeit with a different unexpected result
print(ddf["y"].mean(skipna=True).compute())
# 0.0
# (While the exected result is 1.0)

Anything else we need to know?:

If we use Numpy dtypes, i.e. convert y to float64 rather than double[pyarrow] everything works as expected.

Environment:

Dask version: 2024.4.1
Python version: 3.10.13
Operating System: MacOS on the client, dask running inside Docker based on ghcr.io/dask/dask image on the cluster.
Install method (conda, pip, source): pip on the client, using ghcr.io/dask/dask as the base image on the cluster.

The text was updated successfully, but these errors were encountered:

nprihodko · 2024-05-10T09:36:31Z

Maybe related to this pandas-dev/pandas#58151?

github-actions bot added the needs triage Needs a response from a contributor label May 10, 2024

phofl linked a pull request May 13, 2024 that will close this issue

Fix na casting behavior for groupby.agg with arrow dtypes #11118

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

nprihodko commented May 10, 2024

nprihodko commented May 10, 2024

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

Comments

nprihodko commented May 10, 2024

nprihodko commented May 10, 2024