Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: datasets: array API standard support #20594

Closed
lucascolley opened this issue Apr 27, 2024 · 6 comments
Closed

ENH: datasets: array API standard support #20594

lucascolley opened this issue Apr 27, 2024 · 6 comments
Labels
array types Items related to array API support and input array validation (see gh-18286) enhancement A new feature or improvement scipy.datasets

Comments

@lucascolley
Copy link
Member

lucascolley commented Apr 27, 2024

Towards gh-18867

I think this deserves its own issue. It should be quite simple to add support to scipy.datasets: just add a few xp.asarray wrappers.

However, there are performance concerns with device transfers - will it be possible to create the arrays on the desired device / of the desired type in the first place? Do we want to follow the pattern of the array-creation functions fft.{r}fftfreq by adding xp, device kwargs?

This is something of concern for downstream libraries as well e.g. second half of colour-science/colour#1244 (comment).

cc @rgommers @KelSolaar

@lucascolley lucascolley added the array types Items related to array API support and input array validation (see gh-18286) label Apr 27, 2024
@github-actions github-actions bot added the enhancement A new feature or improvement label Apr 27, 2024
@rgommers
Copy link
Member

The data loaders in scipy.datasets are quite different from fftfreq & co. For the latter, arrays are being created with standard array API functions - and hence we can create them on non-CPU devices fairly easily. For the former, we're loading data from binary files - there is no way to avoid loading them in CPU memory first. If you want them anywhere else, the device transfer cost has to be paid.

Looking at this more closely, I honestly don't know if there's much of a point of doing anything here. Writing

ascent = xp.asarray(datasets.ascent(), device=my_device)

is pretty much the same (maybe even clearer) as:

ascent = datasets.ascent(xp=xp, device=my_device)

@lucascolley
Copy link
Member Author

Right, so we can probably just check off datasets in the tracker - I agree that there isn't much point in changing the one-liner.

For the downstream issue, is the performance save of python list -> xp array over python list -> np array -> xp array significant? If so, I suppose you just want to implement a getter function which takes in xp, right?

@rgommers
Copy link
Member

For the downstream issue, is the performance save of python list -> xp array over python list -> np array -> xp array significant? If so, I suppose you just want to implement a getter function which takes in xp, right?

Not really - or even when it is (very small lists), it's probably not performance-relevant. When constructing from a list, the data is already in host memory. The most expensive part is typically the transfer of data from CPU to GPU, and that has to be done no matter what. And for other CPU libraries, the conversion from a numpy array to another array type doesn't have to copy data, so it's a cheap operation.

@lucascolley
Copy link
Member Author

Okay. So as far as I can tell, the only thing one can do to help performance is try to create the arrays on the desired device at non-performance-critical times so that they can be used later.

@rgommers
Copy link
Member

I'll close this, since I think the discussion covers it for the scipy.datasets module - nothing to do here.

Okay. So as far as I can tell, the only thing one can do to help performance is try to create the arrays on the desired device at non-performance-critical times so that they can be used later.

That's part of how you optimize GPU-based workloads. But you also can't do it too early, because GPUs/accelerators are often memory-constrained.

@lucascolley
Copy link
Member Author

@KelSolaar I imagine that having most computation performed on a GPU could outweigh the cost of device transfers with datasets stored on the CPU. Hopefully, this won't be a blocker for you downstream. (Even if some Colour operations ended up slower in some cases, being able to integrate in a GPU stack without forcing the user to manage which device their arrays are on still seems very valuable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) enhancement A new feature or improvement scipy.datasets
Projects
None yet
Development

No branches or pull requests

3 participants
@rgommers @lucascolley and others