Labels in fences create noticable overhead #6894

masterleinad · 2024-03-22T14:24:59Z

          @dalg24 I made a benchmark that looks like:

bool test_fence_with_kokkos(::benchmark::State&)
{
    ...
    for( 100 times )
        space.fence(); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
    ...
}

bool test_fence_backend_native(::benchmark::State&)
{
    ...
    for( 100 times)
        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(space.cuda_stream())); // backend "raw" fence, here shown for Cuda
    ...
}

bool allocate_fence_message(::benchmark::State&) const
{
    std::string test("Kokkos::Cuda::fence(): Unnamed Instance Fence");
    ::benchmark::ClobberMemory();
    return test.size() > 0;
}

Results, for my AMPERE86 GPU, release mode (g++-12):

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
SpaceFenceBenchmark/fence_wrapped/manual_time      68941 ns        79466 ns         7887
SpaceFenceBenchmark/fence_raw/manual_time          53325 ns        63913 ns        13256
SpaceFenceBenchmark/allocate_fence_message          11.1 ns         11.1 ns     63001969

So, the overhead of using Kokkos to fence is somewhere around (68941-53325)/100 ≃ 689 ns. It seems this is explained by the allocation of the default message, but not only. I guess we can attribute the additional cost to the function calls that happen in Kokkos::<...>::fence (like calls to Kokkos Tools, happening e.g. in Kokkos::Impl::cuda_stream_synchronize).

Should we move this discussion somewhere else ?

Originally posted by @romintomasetti in #5147 (comment)

The text was updated successfully, but these errors were encountered:

masterleinad · 2024-03-22T14:31:44Z

@romintomasetti

With

#include <Kokkos_Core.hpp>
#include <benchmark/benchmark.h>

void test_fence_with_kokkos(::benchmark::State& state)
{
  using ExecutionSpace = Kokkos::DefaultExecutionSpace;
  ExecutionSpace exec_space;

  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    exec_space.fence("blabla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
  }
}

void test_fence_backend_native(::benchmark::State& state)
{
  using ExecutionSpace = Kokkos::DefaultExecutionSpace;
  ExecutionSpace exec_space;

  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(exec_space.cuda_stream())); // backend "raw" fence, here shown for Cuda
  }
}

void test_global_fence_with_kokkos(::benchmark::State& state)
{
  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    Kokkos::fence("bla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
  }
}

void test_global_fence_backend_native(::benchmark::State& state)
{
  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); // backend "raw" fence, here shown for Cuda
  }
}

void allocate_fence_message(::benchmark::State& state)
{
	  for (auto _ : state) {
    std::string test("Kokkos::Cuda::fence(): Unnamed Instance Fence");
    ::benchmark::ClobberMemory();
	  }
}

int main(int argc, char *argv[])
{
  Kokkos::ScopeGuard guard(argc, argv);
  benchmark::Initialize(&argc, argv);

  BENCHMARK(test_fence_with_kokkos);
  BENCHMARK(test_fence_backend_native);
  BENCHMARK(test_global_fence_with_kokkos);
  BENCHMARK(test_global_fence_backend_native);
  BENCHMARK(allocate_fence_message);

  benchmark::RunSpecifiedBenchmarks();

  return EXIT_SUCCESS;
}

I am seeing

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                 9487 ns         9486 ns        73989
test_fence_backend_native              9467 ns         9467 ns        73920
test_global_fence_with_kokkos          9575 ns         9574 ns        73161
test_global_fence_backend_native       9554 ns         9554 ns        73311
allocate_fence_message                 21.3 ns         21.3 ns     31312641

on an A100 so the overhead is something like 20ns if there is any kernel to fence.

masterleinad · 2024-03-22T14:34:52Z

Without submitting any work, I'm seeing

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                 1622 ns         1622 ns       432608
test_fence_backend_native              1534 ns         1534 ns       457689
test_global_fence_with_kokkos           813 ns          813 ns       865624
test_global_fence_backend_native        652 ns          652 ns      1070945
allocate_fence_message                 21.1 ns         21.1 ns     32601159

Interestingly, the global fence is faster than the instance fence in that case.

romintomasetti · 2024-03-22T15:52:18Z

Not sure how I came with my numbers... It seems (68941-53325)/100 ≃ 689 ns is wrong! (68941-53325)/100 ≃ 156 ns...

I've run your code with --benchmark_min_time=3s (with the parallel regions commented out) and got the following results:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                  442 ns          441 ns      9231328
test_fence_backend_native               323 ns          322 ns     12924138
test_global_fence_with_kokkos           612 ns          611 ns      7390350
test_global_fence_backend_native        345 ns          345 ns     12118007
allocate_fence_message                 11.9 ns         11.9 ns    353373974

which seem quite consistent with what I presented earlier. So I'm really seeing an overhead in the order of 100 ns per call. This still seems quite a lot, right?

However, I'm convinced that depending on the CPU, GPU and motherboard considered, these results can significantly vary (for the same compile flags)...

masterleinad · 2024-03-22T15:56:29Z

I'm arguing that the numbers without parallel regions don't really matter. We shouldn't fence if there is nothing to fence anyway.

romintomasetti · 2024-03-25T10:53:12Z

OK, I've run the benchmark on several machines (VOLTA70, AMPERE86 and VEGA906) and from the results, it seems the overhead is not a reproducible behavior (and is very run-to-run dependent), as you noticed on your A100.

It seems the overhead will be more or less, depending on CPU (caching e.g. of the message might influence our benchmarks here), GPU, Cuda/HIP version and compiler.

Note that my initial concern was about the possible difference between a native backend fence (no message at all) and a Kokkos fence without message (that will create the default message, even when Kokkos Tools is not enabled).

ajpowelsnl · 2024-03-25T15:47:24Z

@romintomasetti - many thanks for submitting this issue. Given the significant variation / difficulty in reproducing the pattern, is there any work to be done on this issue? In any case, @vlkale should be aware of the issue.

vlkale · 2024-03-25T16:57:10Z

@ajpowelsnl Thanks for pointing this out. I have been following this, and I am trying to see the implications to performance of a Kokkos Tools connectors when the tool's global fences are supported and enabled by the Kokkos Tools user (the user enables this by typing export KOKKOS_TOOLS_GLOBALFENCES=1).

The part in Roman's data of importance to me is the timing difference 612 ns - 345 ns = 267 ns between test_global_fence_with_kokkos and test_global_fence_backend_native.

It also may be good to quantify 'noticeable overhead' with respect to a few Kokkos benchmarks or mini-apps.

ajpowelsnl · 2024-03-25T17:19:22Z

@ajpowelsnl Thanks for pointing this out. I have been following this, and I am trying to see the implications to performance of a Kokkos Tools connectors when the tool's global fences are supported and enabled by the Kokkos Tools user (the user enables this by typing export KOKKOS_TOOLS_GLOBALFENCES=1).

The part in Roman's data of importance to me is the timing difference 612 ns - 345 ns = 267 ns between test_global_fence_with_kokkos and test_global_fence_backend_native.

It also may be good to quantify 'noticeable overhead' with respect to a few Kokkos benchmarks or mini-apps.

Thanks @vlkale -- any way to mitigate this overhead, significant variation / difficulty reproducing notwithstanding?

vlkale · 2024-03-25T18:43:00Z

While I am not certain this comes from Kokkos Tools primarily, lines 219-230 in the following may be a culprit: https://github.com/kokkos/kokkos/blob/master/core/src/impl/Kokkos_Profiling.cpp

Basically, fencing should happen at a finer-granularity per execution instance in Kokkos Tools.
I need to think about this but I will create a Kokkos Tools Github Issue for this in any case.

vlkale · 2024-03-27T21:26:11Z

I have put the associated Kokkos Tools Github Issue here: kokkos/kokkos-tools#245 I will be working on a corresponding Kokkos Tools PR for review. Also, I think the GitHub issue title should be changed to something like ‘Kokkos fences seem noticeably slower compared to the corresponding native fence’. Does that title capture this github issue better?

ajpowelsnl · 2024-04-02T17:15:18Z

TODO:

Investigate the possibility of "over fencing" in Kokkos-Tools / profiling capabilities (e.g., kernel logger)
- CAVEAT: "over fencing" can have performance costs
- See fencing-relevant tests in Kokkos Core to understand basic assumptions about fencing;
@romintomasetti , do you have any other concerns about Kokkos performance overhead, beyond fences?
Is it of interest to quantify the overhead of Kokkos data structures , parallel constructs, atomics, etc., with respect to raw backend baselines, @crtrott , @dalg24 , @masterleinad, as a Kokkos Tools exercise ?
Close issue if understanding Kokkos overheads (for main backends in Kokkos Core) is not a worthwhile activity

ajpowelsnl added Question For Kokkos internal and external contributors and users Performance Code showing unusually slow performance for an architecture and/or backend labels Mar 25, 2024

masterleinad assigned vlkale Mar 27, 2024

masterleinad mentioned this issue Mar 27, 2024

Tool-invoked fencing on ExecSpace Instance kokkos/kokkos-tools#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labels in fences create noticable overhead #6894

Labels in fences create noticable overhead #6894

masterleinad commented Mar 22, 2024

masterleinad commented Mar 22, 2024

masterleinad commented Mar 22, 2024

romintomasetti commented Mar 22, 2024

masterleinad commented Mar 22, 2024

romintomasetti commented Mar 25, 2024

ajpowelsnl commented Mar 25, 2024

vlkale commented Mar 25, 2024 •

edited

ajpowelsnl commented Mar 25, 2024

vlkale commented Mar 25, 2024

vlkale commented Mar 27, 2024 via email •

edited

ajpowelsnl commented Apr 2, 2024 •

edited

Labels in fences create noticable overhead #6894

Labels in fences create noticable overhead #6894

Comments

masterleinad commented Mar 22, 2024

masterleinad commented Mar 22, 2024

masterleinad commented Mar 22, 2024

romintomasetti commented Mar 22, 2024

masterleinad commented Mar 22, 2024

romintomasetti commented Mar 25, 2024

ajpowelsnl commented Mar 25, 2024

vlkale commented Mar 25, 2024 • edited

ajpowelsnl commented Mar 25, 2024

vlkale commented Mar 25, 2024

vlkale commented Mar 27, 2024 via email • edited

ajpowelsnl commented Apr 2, 2024 • edited

vlkale commented Mar 25, 2024 •

edited

vlkale commented Mar 27, 2024 via email •

edited

ajpowelsnl commented Apr 2, 2024 •

edited