-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels in fences create noticable overhead #6894
Comments
With #include <Kokkos_Core.hpp>
#include <benchmark/benchmark.h>
void test_fence_with_kokkos(::benchmark::State& state)
{
using ExecutionSpace = Kokkos::DefaultExecutionSpace;
ExecutionSpace exec_space;
for (auto _ : state) {
Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
exec_space.fence("blabla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
}
}
void test_fence_backend_native(::benchmark::State& state)
{
using ExecutionSpace = Kokkos::DefaultExecutionSpace;
ExecutionSpace exec_space;
for (auto _ : state) {
Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(exec_space.cuda_stream())); // backend "raw" fence, here shown for Cuda
}
}
void test_global_fence_with_kokkos(::benchmark::State& state)
{
for (auto _ : state) {
Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
Kokkos::fence("bla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
}
}
void test_global_fence_backend_native(::benchmark::State& state)
{
for (auto _ : state) {
Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); // backend "raw" fence, here shown for Cuda
}
}
void allocate_fence_message(::benchmark::State& state)
{
for (auto _ : state) {
std::string test("Kokkos::Cuda::fence(): Unnamed Instance Fence");
::benchmark::ClobberMemory();
}
}
int main(int argc, char *argv[])
{
Kokkos::ScopeGuard guard(argc, argv);
benchmark::Initialize(&argc, argv);
BENCHMARK(test_fence_with_kokkos);
BENCHMARK(test_fence_backend_native);
BENCHMARK(test_global_fence_with_kokkos);
BENCHMARK(test_global_fence_backend_native);
BENCHMARK(allocate_fence_message);
benchmark::RunSpecifiedBenchmarks();
return EXIT_SUCCESS;
} I am seeing
on an |
Without submitting any work, I'm seeing
Interestingly, the global fence is faster than the instance fence in that case. |
Not sure how I came with my numbers... It seems I've run your code with ---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos 442 ns 441 ns 9231328
test_fence_backend_native 323 ns 322 ns 12924138
test_global_fence_with_kokkos 612 ns 611 ns 7390350
test_global_fence_backend_native 345 ns 345 ns 12118007
allocate_fence_message 11.9 ns 11.9 ns 353373974 which seem quite consistent with what I presented earlier. So I'm really seeing an overhead in the order of However, I'm convinced that depending on the CPU, GPU and motherboard considered, these results can significantly vary (for the same compile flags)... |
I'm arguing that the numbers without parallel regions don't really matter. We shouldn't fence if there is nothing to fence anyway. |
OK, I've run the benchmark on several machines ( It seems the overhead will be more or less, depending on CPU (caching e.g. of the message might influence our benchmarks here), GPU, Cuda/HIP version and compiler. Note that my initial concern was about the possible difference between a native backend fence (no message at all) and a |
@romintomasetti - many thanks for submitting this issue. Given the significant variation / difficulty in reproducing the pattern, is there any work to be done on this issue? In any case, @vlkale should be aware of the issue. |
@ajpowelsnl Thanks for pointing this out. I have been following this, and I am trying to see the implications to performance of a Kokkos Tools connectors when the tool's global fences are supported and enabled by the Kokkos Tools user (the user enables this by typing The part in Roman's data of importance to me is the timing difference 612 ns - 345 ns = 267 ns between It also may be good to quantify 'noticeable overhead' with respect to a few Kokkos benchmarks or mini-apps. |
Thanks @vlkale -- any way to mitigate this overhead, significant variation / difficulty reproducing notwithstanding? |
While I am not certain this comes from Kokkos Tools primarily, lines 219-230 in the following may be a culprit: https://github.com/kokkos/kokkos/blob/master/core/src/impl/Kokkos_Profiling.cpp Basically, fencing should happen at a finer-granularity per execution instance in Kokkos Tools. |
I have put the associated Kokkos Tools Github Issue here: kokkos/kokkos-tools#245
I will be working on a corresponding Kokkos Tools PR for review.
Also, I think the GitHub issue title should be changed to something like ‘Kokkos fences seem noticeably slower compared to the corresponding native fence’. Does that title capture this github issue better?
|
TODO:
|
Results, for my
AMPERE86
GPU, release mode (g++-12
):So, the overhead of using
Kokkos
to fence is somewhere around(68941-53325)/100 ≃ 689 ns
. It seems this is explained by the allocation of the default message, but not only. I guess we can attribute the additional cost to the function calls that happen inKokkos::<...>::fence
(like calls to Kokkos Tools, happening e.g. inKokkos::Impl::cuda_stream_synchronize
).Should we move this discussion somewhere else ?
Originally posted by @romintomasetti in #5147 (comment)
The text was updated successfully, but these errors were encountered: