`where` dtype upcast with numpy 2 #8402

djhoese · 2023-11-02T14:12:49Z

What happened?

I'm testing my code with numpy 2.0 and current main xarray and dask and ran into a change that I guess is expected given the way xarray does things, but want to make sure as it could be unexpected for many users.

Doing DataArray.where with an integer array less than 64-bits and an integer as the new value will upcast the array to 64-bit integers (python's int). With old versions of numpy this would preserve the dtype of the array. As far as I can tell the relevant xarray code hasn't changed so this seems to be more about numpy making things more consistent.

The main problem seems to come down to:

xarray/xarray/core/duck_array_ops.py

Line 218 in d933578

arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]

As this converts my scalar input int to a numpy array. If it didn't do this array conversion then numpy works as expected. See the MCVE for the xarray specific example, but here's the numpy equivalent:

import numpy as np

a = np.zeros((2, 2), dtype=np.uint16)

# what I'm intending to do with my xarray `data_arr.where(cond, 2)`
np.where(a != 0, a, 2).dtype
# dtype('uint16')

# equivalent to what xarray does:
np.where(a != 0, a, np.asarray(2)).dtype
# dtype('int64')

# workaround, cast my scalar to a specific numpy type
np.where(a != 0, a, np.asarray(np.uint16(2))).dtype
# dtype('uint16')

From a numpy point of view, the second where call makes sense that 2 arrays should be upcast to the same dtype so they can be combined. But from an xarray user point of view, I'm entering a scalar so I expect it to be the same as the first where call above.

What did you expect to happen?

See above.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

data_arr = xr.DataArray(np.array([1, 2], dtype=np.uint16))
print(data_arr.where(data_arr == 2, 3).dtype)
# int64

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Numpy 1.x preserves the dtype.

In [1]: import numpy as np

In [2]: np.asarray(2).dtype
Out[2]: dtype('int64')

In [3]: a = np.zeros((2, 2), dtype=np.uint16)

In [4]: np.where(a != 0, a, np.asarray(2)).dtype
Out[4]: dtype('uint16')

In [5]: np.where(a != 0, a, np.asarray(np.uint16(2))).dtype
Out[5]: dtype('uint16')

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
python-bits: 64
OS: Linux
OS-release: 6.4.6-76060406-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.10.2.dev21+gfcdc8102
pandas: 2.2.0.dev0+495.gecf449b503
numpy: 2.0.0.dev0+git20231031.42c33f3
scipy: 1.12.0.dev0+1903.18d0a2f
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7.post0.dev7
dask: 2023.10.1+4.g91098a63
distributed: 2023.10.1+5.g76dd8003
matplotlib: 3.9.0.dev0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: 0.22
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.0.0
pip: 23.2.1
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.14.0
sphinx: 7.1.2

The text was updated successfully, but these errors were encountered:

dcherian · 2023-11-02T14:36:07Z

Thanks for the well written issue.

I'm not sure what to do here: asarray(x, xp=xp) for x in scalars_or_arrays is quite intentionally converting scalars to 0D arrays.

Perhaps @shoyer has thoughts.

djhoese · 2023-11-02T16:36:33Z

FYI it looks like this effects data_arr.where(cond, np.nan) too. This used to retain the floating type of data_arr, but now is upcasting 32-bit floats to 64-bit float. I know a lot of work was done for NA handling in xarray, but this automatic np.nan 32 or 64-bit handling worked after that. Now it is back to upcasting.

shoyer · 2023-11-03T20:12:50Z

I think we should probably make an exception for Python built-in numbers when casting to arrays. Something closer to:

scalar_types = (int, float, complex)
arrays = [x if isinstance(x, scalar_types) else asarray(x, xp=xp) for x in scalars_or_arrays]

djhoese · 2023-11-03T20:43:07Z

@shoyer do you have any memory why the scalar types are converted to arrays? I don't think your suggestion catches the case where np.nan is passed which if that has to be that way then I'll work around it, but if it can be handled by a similar change it'd make things easier for me. But before I can suggest anything I'll need to figure out what the scalar -> array conversion is meant to catch/fix.

keewis · 2023-11-03T21:40:10Z

If I remember correctly, the array API does not support python scalar types (or numpy scalars), so we have to convert them to 0D arrays (see also the discussion in #7721).

shoyer · 2023-11-04T06:02:32Z

I don't remember exactly why this is, but I'm pretty confident I would have written unit tests to cover it. So I would suggest changing it and finding out!

…

On Fri, Nov 3, 2023 at 2:40 PM Justus Magin ***@***.***> wrote: If I remember correctly, the array API does not support python scalar types (or numpy scalars), so we have to convert them to 0D arrays (see also the discussion in #7721 <#7721>). — Reply to this email directly, view it on GitHub <#8402 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVXCVZJUT6J6F6YGC3DYCVQENAVCNFSM6AAAAAA626T2DGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGEZTKNRWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

djhoese · 2023-11-06T03:28:26Z

I started hacking around with your idea @shoyer and had the realization that np.nan is a python builtin float so it should work fine. This is also still inline with NEP 50 and handling scalars. I did get my simple np.nan with a DataArray.where working with the below:

diff --git a/xarray/core/duck_array_ops.py b/xarray/core/duck_array_ops.py
index 4f245e59..8fbe4731 100644
--- a/xarray/core/duck_array_ops.py
+++ b/xarray/core/duck_array_ops.py
@@ -202,13 +202,13 @@ def as_shared_dtype(scalars_or_arrays, xp=np):

         arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
     else:
-        arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
+        arrays = [x if isinstance(x, (int, float, complex)) else asarray(x, xp=xp) for x in scalars_or_arrays]
     # Pass arrays directly instead of dtypes to result_type so scalars
     # get handled properly.
     # Note that result_type() safely gets the dtype from dask arrays without
     # evaluating them.
     out_type = dtypes.result_type(*arrays)
-    return [astype(x, out_type, copy=False) for x in arrays]
+    return [astype(x, out_type, copy=False) if hasattr(x, "dtype") else x for x in arrays]

Without the second modification astype fails on scalars. I was surprised that xarray's result_type was OK with the scalars being passed to it. Now for the tests...

65 failed, 15045 passed, 1637 skipped, 154 xfailed, 59 xpassed, 6494 warnings

I'll try investigating more tomorrow if I have the time, but I get 63 failures with no changes so... 🤷‍♂️

djhoese · 2023-11-06T14:17:02Z

Ok, very interesting. I think this issue will need to be fixed to preserve behavior when numpy 2.0 comes out. So this test fails right now with numpy 2 (not including my changes):

FAILED xarray/tests/test_rolling.py::TestDataArrayRolling::test_rolling_dask_dtype[float32] - AssertionError: assert dtype('float64') == dtype('float32')

As you can see it is specifically checking the result dtype and they are not equal (the result's type was upcast to 64-bit float). This is WITHOUT my changes. This test is fixed when I rerun the tests with my hacky solution mentioned above.

As for tests that fail with my changes that didn't before:

FAILED xarray/tests/test_duck_array_ops.py::test_argmin_max[x-True-min-True-True-str-1] - numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.StrDType'> could not be promoted by <class 'numpy._FloatAbstractDType'>. This means that no common DType exists for the given inputs. For example they cannot be s...
FAILED xarray/tests/test_duck_array_ops.py::test_argmin_max[x-True-min-True-False-str-1] - numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.StrDType'> could not be promoted by <class 'numpy._FloatAbstractDType'>. This means that no common DType exists for the given inputs. For example they cannot be s...
FAILED xarray/tests/test_duck_array_ops.py::test_argmin_max[x-True-max-True-True-str-1] - numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.StrDType'> could not be promoted by <class 'numpy._FloatAbstractDType'>. This means that no common DType exists for the given inputs. For example they cannot be s...
FAILED xarray/tests/test_duck_array_ops.py::test_argmin_max[x-True-max-True-False-str-1] - numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.StrDType'> could not be promoted by <class 'numpy._FloatAbstractDType'>. This means that no common DType exists for the given inputs. For example they cannot be s...

This is kind of expected given how inelegant my update was.

djhoese · 2024-04-15T19:02:24Z

Are there any issues or milestones to track for general numpy 2 compatibility work? I'm running into this particular issue again in another one of my projects (with xarray main) and my little hack above suggested by shoyer seems to fix things still, but I don't know enough about the xarray internals to know if any of those changes are pull request worthy or bound to conflict with other things in xarray.

dcherian · 2024-04-15T19:18:48Z

Good timing on the bump! We have some convo here: #8844

AFAIK no one has looked at this though I nominally volunteered to do so.

It sounds like you've made some progress. Can you open a PR please, it might be easier to help then?

djhoese added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 2, 2023

dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Nov 2, 2023

djhoese mentioned this issue Nov 2, 2023

Cleanup various warnings encountered during tests pytroll/satpy#2625

Merged

4 tasks

djhoese mentioned this issue Apr 15, 2024

Build against numpy 2.0rc1 pytroll/trollimage#168

Merged

5 tasks

djhoese linked a pull request Apr 15, 2024 that will close this issue

Fix upcasting with python builtin numbers and numpy 2 #8946

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`where` dtype upcast with numpy 2 #8402

`where` dtype upcast with numpy 2 #8402

djhoese commented Nov 2, 2023 •

edited

dcherian commented Nov 2, 2023

djhoese commented Nov 2, 2023

shoyer commented Nov 3, 2023

djhoese commented Nov 3, 2023

keewis commented Nov 3, 2023

shoyer commented Nov 4, 2023 via email

djhoese commented Nov 6, 2023

djhoese commented Nov 6, 2023

djhoese commented Apr 15, 2024

dcherian commented Apr 15, 2024

where dtype upcast with numpy 2 #8402

where dtype upcast with numpy 2 #8402

Comments

djhoese commented Nov 2, 2023 • edited

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

dcherian commented Nov 2, 2023

djhoese commented Nov 2, 2023

shoyer commented Nov 3, 2023

djhoese commented Nov 3, 2023

keewis commented Nov 3, 2023

shoyer commented Nov 4, 2023 via email

djhoese commented Nov 6, 2023

djhoese commented Nov 6, 2023

djhoese commented Apr 15, 2024

dcherian commented Apr 15, 2024

`where` dtype upcast with numpy 2 #8402

`where` dtype upcast with numpy 2 #8402

djhoese commented Nov 2, 2023 •

edited