P2901 LEWG review: 2023/10/24 #420

mhoemmen · 2023-10-24T16:53:40Z

P2901R0 LEWG review: 2023/10/24

LEWG reviewed P2901R0 on 2023/10/24. They didn't take polls, but would like to see the paper again. They gave the following feedback.

Should the batch dimension be required to expect independent data? For example, if users want to convince the implementation to vectorize with a custom accessor, will they need to specify *unseq? Answer: This likely relates to the definition of an "element function" in the parallel algorithms. Accessor's access function is an element function in that sense.

Should users be able to customize algorithms based on execution policy?

The Standard does not currently permit users to define their own execution policies. (See https://eel.is/c++draft/execpol#type-note-1 .)
It would be useful if "users" (think "performance optimization experts") could define custom policies, and customize standard algorithms for them. However, we would want to be able to do that for all the standard algorithms (at least the ones that take ExecutionPolicy overloads), not just the algorithms in P2901.
Specifically regarding schedulers, P2500 relates execution policies to P2300 schedulers. A reasonable and expected goal of P2500 and related efforts would be to support user-defined schedulers going into the standard parallel algorithms.

What about heterogeneous computation? Senders / receivers lets you build a graph expressing asynchronous computation.

This relates to MAGMA (UTK) and other task-parallel heterogeneous linear algebra software.
Building blocks of heterogeneous computation are homogeneous computations.
Asynchronous linear algebra would be a different interface, even for the nonbatched (P1673-like) case. NVIDIA and others are working on developing the idioms for those interfaces. @brycelelbach et al. have an interest in this.

Explain wow to port code from a for loop around calls to the nonbatched interface, to a single call to the batched interface. Give code examples.

For example, how would users set up the mdspan? Where would they put the batch dimension?
(More questions I could imagine coming up: What layouts should they use? How would they write an interleaved layout? How would vendors recognize an interleaved layout and optimize for it?)

How do users tell a batched problem / call from an unbatched problem / call? Explore alternate ways to describe a batched problem, and to differentiate it from a nonbatched problem.

This might just amount to explaining why it's less efficient to provide an array of problems ("P2P"). Presumably we have or can get performance data for that, for the C interface case.

Regarding algorithm customization based on execution policy, and integration with senders / receivers: Dissent on whether we should standardize some interfaces first and tie them all together later, or try to standardize a "total plan" first and develop specific interfaces from there. The latter was what we tried to do with executors; it was dropped and replaced by a more cut-down plan P2300. It's hard to get intuition for interfaces without being able to play with them. If we start too abstractly, we may not end up being able to put the pieces together.

LEWG requests information on:

Numbers of users for existing interfaces
Types of projects that use this. Does P2901 have the same broad applicability as P1673?
Performance of batched interfaces

Anonymous not-quite-quote: "It's likely I wouldn't use this interface in my product, because I have no way to guarantee that my use of the Standard Library would run on the GPU. The alternative is to get a vendor-specific "clone" of the Standard interface out of a vendor-specific interface (e.g., using vendor::std)."

Even if the product uses an optimized implementation, that implementation might not necessarily target the GPU.
If the product developer wanted to be sure of using the GPU, the developer would still need to reach for a nonstandard library (that perhaps has the same interface, and is brought in via using vendor::std::matrix_product).
We already have this issue with the C++17 parallel algorithms. How could Vendor A tell Vendor B to run Vendor B's standard algorithms on Vendor A's hardware? The only way to do that would be through the ability to customize the standard algorithms on execution policy.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2901 LEWG review: 2023/10/24 #420

P2901 LEWG review: 2023/10/24 #420

mhoemmen commented Oct 24, 2023 •

edited

P2901 LEWG review: 2023/10/24 #420

P2901 LEWG review: 2023/10/24 #420

Comments

mhoemmen commented Oct 24, 2023 • edited

P2901R0 LEWG review: 2023/10/24

mhoemmen commented Oct 24, 2023 •

edited