You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under certain conditions—aggregations of groups that have only missing values, when using PyArrow floats—Dask creates NaNs of an unexpected type. PyArrow types are nullable, so it should create <NA> but it creates NaN, which are then incorrectly not treated as missing values.
Minimal Complete Verifiable Example:
importpandasaspdimportdask.dataframeasdddf=pd.DataFrame({
"x": range(15),
"y": [pd.NA] *10+ [1.0] *5,
"group": ["a"] *5+ ["b"] *5+ ["c"] *5
})
ddf=dd.from_pandas(df, npartitions=2).astype({
"y": "double[pyarrow]"
})
print(ddf.compute())
# x y group# 0 0 <NA> a# 1 1 <NA> a# 2 2 <NA> a# 3 3 <NA> a# 4 4 <NA> a# 5 5 <NA> b# 6 6 <NA> b# 7 7 <NA> b# 8 8 <NA> b# 9 9 <NA> b# 10 10 1.0 c# 11 11 1.0 c# 12 12 1.0 c# 13 13 1.0 c# 14 14 1.0 cddf=ddf.groupby("group").agg(x=("x", "mean"), y=("y", "mean"))
print(ddf.compute())
# x y# group # a 2.0 NaN# b 7.0 NaN# c 12.0 1.0# (Note that groups with <NA> resulted in NaN rather than <NA>)# These NaNs are then not recognized as missing valuesprint(ddf["y"].isnull().sum().compute())
# 0# (While the expected result is 2).# This breaks other functions, example further aggregations in Pandasprint(ddf["y"].compute().mean(skipna=True))
# nan# (While the exected result is 1.0).# And also further aggreations in Dask, albeit with a different unexpected resultprint(ddf["y"].mean(skipna=True).compute())
# 0.0# (While the exected result is 1.0)
Anything else we need to know?:
If we use Numpy dtypes, i.e. convert y to float64 rather than double[pyarrow] everything works as expected.
Environment:
Dask version: 2024.4.1
Python version: 3.10.13
Operating System: MacOS on the client, dask running inside Docker based on ghcr.io/dask/dask image on the cluster.
Install method (conda, pip, source): pip on the client, using ghcr.io/dask/dask as the base image on the cluster.
The text was updated successfully, but these errors were encountered:
Describe the issue:
Under certain conditions—aggregations of groups that have only missing values, when using PyArrow floats—Dask creates NaNs of an unexpected type. PyArrow types are nullable, so it should create
<NA>
but it createsNaN
, which are then incorrectly not treated as missing values.Minimal Complete Verifiable Example:
Anything else we need to know?:
If we use Numpy dtypes, i.e. convert
y
tofloat64
rather thandouble[pyarrow]
everything works as expected.Environment:
The text was updated successfully, but these errors were encountered: