-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guarantee parallel_scan to use two passes #6897
Comments
Does the documentation say anything about this? If the answer is "no we don't", does the doc include anything regarding guarantees with parallel_reduce or parallel_for (I understand parallel_scan is different).
We will see about that. Some might not be really happy about the serial parallel scan taking a 2x slow down. |
I'd be curious to hear under which circumstances users care about performance for the |
Arborx for instance…
…On Tue, Mar 26, 2024 at 9:50 PM Daniel Arndt ***@***.***> wrote:
We will see about that. Some might not be really happy about the serial
parallel scan taking a 2x slow down.
I'd be curious to hear under which circumstances users care about
performance for the Serial backend that much.
—
Reply to this email directly, view it on GitHub
<#6897 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACIQERVS3ESJIQYNND2ANTY2FVKRAVCNFSM6AAAAABFHXGDHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGM2DGOBZGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
In general we want Kokkos Serial to have as low overhead as possible, isn't this why a serial backend exists? Otherwise why not just run OpenMP with 1 thread? |
The |
Most systems have OpenMP support so I think the bar is pretty low. But my point is really that we want the Serial backend to be as close to zero overhead as possible. No atomic overhead, and no multiple loops in the |
@stanmoore1 , @masterleinad , regarding "no atomic overhead"; as of Kokkos-4.3 , there's a new CMake keyword that might be useful for serial / host-only builds: |
Currently, we only guarantee that
parallel_scan
calls the functor withis_final==true
but not if it's called withis_final==false
.Note that all backends apart from
Serial
use a two-pass implementation. If the user can't know/rely on two passes, it's not possible to avoid repeating expensive calculations, i.e.doesn't work if there is only one pass. Instead, we would need to launch another kernel to evaluate
expensive_condition
beforehand so that we need to launch three kernels for most backends (one "expensive" one forparallel_for
, and two "cheap" ones forparallel_scan
). It appears that the benefit of avoiding one pass in theSerial
backend might not be sufficient to justify not allowing caching results in the firstparallel_scan
pass.The text was updated successfully, but these errors were encountered: