Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

Open
nprihodko opened this issue May 10, 2024 · 1 comment · May be fixed by #11118
Open

When using PyArrow dtypes, aggregations create NaNs of unexpected type #11116

nprihodko opened this issue May 10, 2024 · 1 comment · May be fixed by #11118
Labels
needs triage Needs a response from a contributor

Comments

@nprihodko
Copy link

Describe the issue:

Under certain conditions—aggregations of groups that have only missing values, when using PyArrow floats—Dask creates NaNs of an unexpected type. PyArrow types are nullable, so it should create <NA> but it creates NaN, which are then incorrectly not treated as missing values.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({
    "x": range(15),
    "y": [pd.NA] * 10 + [1.0] * 5,
    "group": ["a"] * 5 + ["b"] * 5 + ["c"] * 5
})

ddf = dd.from_pandas(df, npartitions=2).astype({
    "y": "double[pyarrow]"
})
print(ddf.compute())
#      x     y group
# 0    0  <NA>     a
# 1    1  <NA>     a
# 2    2  <NA>     a
# 3    3  <NA>     a
# 4    4  <NA>     a
# 5    5  <NA>     b
# 6    6  <NA>     b
# 7    7  <NA>     b
# 8    8  <NA>     b
# 9    9  <NA>     b
# 10  10   1.0     c
# 11  11   1.0     c
# 12  12   1.0     c
# 13  13   1.0     c
# 14  14   1.0     c

ddf = ddf.groupby("group").agg(x=("x", "mean"), y=("y", "mean"))
print(ddf.compute())
#           x    y
# group           
# a       2.0  NaN
# b       7.0  NaN
# c      12.0  1.0
# (Note that groups with <NA> resulted in NaN rather than <NA>)

# These NaNs are then not recognized as missing values
print(ddf["y"].isnull().sum().compute())
# 0
# (While the expected result is 2).

# This breaks other functions, example further aggregations in Pandas
print(ddf["y"].compute().mean(skipna=True))
# nan
# (While the exected result is 1.0).

# And also further aggreations in Dask, albeit with a different unexpected result
print(ddf["y"].mean(skipna=True).compute())
# 0.0
# (While the exected result is 1.0)

Anything else we need to know?:

If we use Numpy dtypes, i.e. convert y to float64 rather than double[pyarrow] everything works as expected.

Environment:

  • Dask version: 2024.4.1
  • Python version: 3.10.13
  • Operating System: MacOS on the client, dask running inside Docker based on ghcr.io/dask/dask image on the cluster.
  • Install method (conda, pip, source): pip on the client, using ghcr.io/dask/dask as the base image on the cluster.
@github-actions github-actions bot added the needs triage Needs a response from a contributor label May 10, 2024
@nprihodko
Copy link
Author

Maybe related to this pandas-dev/pandas#58151?

@phofl phofl linked a pull request May 13, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant