ENH: datasets: array API standard support #20594

lucascolley · 2024-04-27T16:09:11Z

I think this deserves its own issue. It should be quite simple to add support to scipy.datasets: just add a few xp.asarray wrappers.

However, there are performance concerns with device transfers - will it be possible to create the arrays on the desired device / of the desired type in the first place? Do we want to follow the pattern of the array-creation functions fft.{r}fftfreq by adding xp, device kwargs?

This is something of concern for downstream libraries as well e.g. second half of colour-science/colour#1244 (comment).

cc @rgommers @KelSolaar

The text was updated successfully, but these errors were encountered:

rgommers · 2024-04-28T19:44:44Z

The data loaders in scipy.datasets are quite different from fftfreq & co. For the latter, arrays are being created with standard array API functions - and hence we can create them on non-CPU devices fairly easily. For the former, we're loading data from binary files - there is no way to avoid loading them in CPU memory first. If you want them anywhere else, the device transfer cost has to be paid.

Looking at this more closely, I honestly don't know if there's much of a point of doing anything here. Writing

ascent = xp.asarray(datasets.ascent(), device=my_device)

is pretty much the same (maybe even clearer) as:

ascent = datasets.ascent(xp=xp, device=my_device)

lucascolley · 2024-04-28T21:07:03Z

Right, so we can probably just check off datasets in the tracker - I agree that there isn't much point in changing the one-liner.

For the downstream issue, is the performance save of python list -> xp array over python list -> np array -> xp array significant? If so, I suppose you just want to implement a getter function which takes in xp, right?

rgommers · 2024-04-29T03:19:29Z

For the downstream issue, is the performance save of python list -> xp array over python list -> np array -> xp array significant? If so, I suppose you just want to implement a getter function which takes in xp, right?

Not really - or even when it is (very small lists), it's probably not performance-relevant. When constructing from a list, the data is already in host memory. The most expensive part is typically the transfer of data from CPU to GPU, and that has to be done no matter what. And for other CPU libraries, the conversion from a numpy array to another array type doesn't have to copy data, so it's a cheap operation.

lucascolley · 2024-04-29T08:41:28Z

Okay. So as far as I can tell, the only thing one can do to help performance is try to create the arrays on the desired device at non-performance-critical times so that they can be used later.

rgommers · 2024-04-29T09:20:30Z

I'll close this, since I think the discussion covers it for the scipy.datasets module - nothing to do here.

Okay. So as far as I can tell, the only thing one can do to help performance is try to create the arrays on the desired device at non-performance-critical times so that they can be used later.

That's part of how you optimize GPU-based workloads. But you also can't do it too early, because GPUs/accelerators are often memory-constrained.

lucascolley · 2024-04-29T09:42:46Z

@KelSolaar I imagine that having most computation performed on a GPU could outweigh the cost of device transfers with datasets stored on the CPU. Hopefully, this won't be a blocker for you downstream. (Even if some Colour operations ended up slower in some cases, being able to integrate in a GPU stack without forcing the user to manage which device their arrays are on still seems very valuable).

lucascolley added the array types Items related to array API support and input array validation (see gh-18286) label Apr 27, 2024

lucascolley mentioned this issue Apr 27, 2024

Tracker: array types (CuPy, PyTorch & co) and array API standard support #18867

Open

20 tasks

github-actions bot added the enhancement A new feature or improvement label Apr 27, 2024

lucascolley added the scipy.datasets label Apr 27, 2024

rgommers closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: datasets: array API standard support #20594

ENH: datasets: array API standard support #20594

lucascolley commented Apr 27, 2024 •

edited

rgommers commented Apr 28, 2024

lucascolley commented Apr 28, 2024

rgommers commented Apr 29, 2024

lucascolley commented Apr 29, 2024

rgommers commented Apr 29, 2024

lucascolley commented Apr 29, 2024

ENH: datasets: array API standard support #20594

ENH: datasets: array API standard support #20594

Comments

lucascolley commented Apr 27, 2024 • edited

rgommers commented Apr 28, 2024

lucascolley commented Apr 28, 2024

rgommers commented Apr 29, 2024

lucascolley commented Apr 29, 2024

rgommers commented Apr 29, 2024

lucascolley commented Apr 29, 2024

lucascolley commented Apr 27, 2024 •

edited