Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labels in fences create noticable overhead #6894

Open
masterleinad opened this issue Mar 22, 2024 · 11 comments
Open

Labels in fences create noticable overhead #6894

masterleinad opened this issue Mar 22, 2024 · 11 comments
Assignees
Labels
Performance Code showing unusually slow performance for an architecture and/or backend Question For Kokkos internal and external contributors and users

Comments

@masterleinad
Copy link
Contributor

          @dalg24 I made a benchmark that looks like:
bool test_fence_with_kokkos(::benchmark::State&)
{
    ...
    for( 100 times )
        space.fence(); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
    ...
}

bool test_fence_backend_native(::benchmark::State&)
{
    ...
    for( 100 times)
        KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(space.cuda_stream())); // backend "raw" fence, here shown for Cuda
    ...
}

bool allocate_fence_message(::benchmark::State&) const
{
    std::string test("Kokkos::Cuda::fence(): Unnamed Instance Fence");
    ::benchmark::ClobberMemory();
    return test.size() > 0;
}

Results, for my AMPERE86 GPU, release mode (g++-12):

----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
SpaceFenceBenchmark/fence_wrapped/manual_time      68941 ns        79466 ns         7887
SpaceFenceBenchmark/fence_raw/manual_time          53325 ns        63913 ns        13256
SpaceFenceBenchmark/allocate_fence_message          11.1 ns         11.1 ns     63001969

So, the overhead of using Kokkos to fence is somewhere around (68941-53325)/100 ≃ 689 ns. It seems this is explained by the allocation of the default message, but not only. I guess we can attribute the additional cost to the function calls that happen in Kokkos::<...>::fence (like calls to Kokkos Tools, happening e.g. in Kokkos::Impl::cuda_stream_synchronize).

Should we move this discussion somewhere else ?

Originally posted by @romintomasetti in #5147 (comment)

@masterleinad
Copy link
Contributor Author

@romintomasetti

With

#include <Kokkos_Core.hpp>
#include <benchmark/benchmark.h>

void test_fence_with_kokkos(::benchmark::State& state)
{
  using ExecutionSpace = Kokkos::DefaultExecutionSpace;
  ExecutionSpace exec_space;

  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    exec_space.fence("blabla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
  }
}

void test_fence_backend_native(::benchmark::State& state)
{
  using ExecutionSpace = Kokkos::DefaultExecutionSpace;
  ExecutionSpace exec_space;

  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaStreamSynchronize(exec_space.cuda_stream())); // backend "raw" fence, here shown for Cuda
  }
}

void test_global_fence_with_kokkos(::benchmark::State& state)
{
  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    Kokkos::fence("bla"); // will use the default message, and will use Kokkos (+ expect some Kokkos Tools related overhead even when it's not used)
  }
}

void test_global_fence_backend_native(::benchmark::State& state)
{
  for (auto _ : state) {
    Kokkos::parallel_for(1, KOKKOS_LAMBDA(int i) {});
    KOKKOS_IMPL_CUDA_SAFE_CALL(cudaDeviceSynchronize()); // backend "raw" fence, here shown for Cuda
  }
}

void allocate_fence_message(::benchmark::State& state)
{
	  for (auto _ : state) {
    std::string test("Kokkos::Cuda::fence(): Unnamed Instance Fence");
    ::benchmark::ClobberMemory();
	  }
}

int main(int argc, char *argv[])
{
  Kokkos::ScopeGuard guard(argc, argv);
  benchmark::Initialize(&argc, argv);

  BENCHMARK(test_fence_with_kokkos);
  BENCHMARK(test_fence_backend_native);
  BENCHMARK(test_global_fence_with_kokkos);
  BENCHMARK(test_global_fence_backend_native);
  BENCHMARK(allocate_fence_message);

  benchmark::RunSpecifiedBenchmarks();

  return EXIT_SUCCESS;
}

I am seeing

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                 9487 ns         9486 ns        73989
test_fence_backend_native              9467 ns         9467 ns        73920
test_global_fence_with_kokkos          9575 ns         9574 ns        73161
test_global_fence_backend_native       9554 ns         9554 ns        73311
allocate_fence_message                 21.3 ns         21.3 ns     31312641

on an A100 so the overhead is something like 20ns if there is any kernel to fence.

@masterleinad
Copy link
Contributor Author

Without submitting any work, I'm seeing

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                 1622 ns         1622 ns       432608
test_fence_backend_native              1534 ns         1534 ns       457689
test_global_fence_with_kokkos           813 ns          813 ns       865624
test_global_fence_backend_native        652 ns          652 ns      1070945
allocate_fence_message                 21.1 ns         21.1 ns     32601159

Interestingly, the global fence is faster than the instance fence in that case.

@romintomasetti
Copy link
Contributor

Not sure how I came with my numbers... It seems (68941-53325)/100 ≃ 689 ns is wrong! (68941-53325)/100 ≃ 156 ns...

I've run your code with --benchmark_min_time=3s (with the parallel regions commented out) and got the following results:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
test_fence_with_kokkos                  442 ns          441 ns      9231328
test_fence_backend_native               323 ns          322 ns     12924138
test_global_fence_with_kokkos           612 ns          611 ns      7390350
test_global_fence_backend_native        345 ns          345 ns     12118007
allocate_fence_message                 11.9 ns         11.9 ns    353373974

which seem quite consistent with what I presented earlier. So I'm really seeing an overhead in the order of 100 ns per call. This still seems quite a lot, right?

However, I'm convinced that depending on the CPU, GPU and motherboard considered, these results can significantly vary (for the same compile flags)...

@masterleinad
Copy link
Contributor Author

I'm arguing that the numbers without parallel regions don't really matter. We shouldn't fence if there is nothing to fence anyway.

@romintomasetti
Copy link
Contributor

OK, I've run the benchmark on several machines (VOLTA70, AMPERE86 and VEGA906) and from the results, it seems the overhead is not a reproducible behavior (and is very run-to-run dependent), as you noticed on your A100.

It seems the overhead will be more or less, depending on CPU (caching e.g. of the message might influence our benchmarks here), GPU, Cuda/HIP version and compiler.

Note that my initial concern was about the possible difference between a native backend fence (no message at all) and a Kokkos fence without message (that will create the default message, even when Kokkos Tools is not enabled).

@ajpowelsnl ajpowelsnl added Question For Kokkos internal and external contributors and users Performance Code showing unusually slow performance for an architecture and/or backend labels Mar 25, 2024
@ajpowelsnl
Copy link
Contributor

@romintomasetti - many thanks for submitting this issue. Given the significant variation / difficulty in reproducing the pattern, is there any work to be done on this issue? In any case, @vlkale should be aware of the issue.

@vlkale
Copy link

vlkale commented Mar 25, 2024

@ajpowelsnl Thanks for pointing this out. I have been following this, and I am trying to see the implications to performance of a Kokkos Tools connectors when the tool's global fences are supported and enabled by the Kokkos Tools user (the user enables this by typing export KOKKOS_TOOLS_GLOBALFENCES=1).

The part in Roman's data of importance to me is the timing difference 612 ns - 345 ns = 267 ns between test_global_fence_with_kokkos and test_global_fence_backend_native.

It also may be good to quantify 'noticeable overhead' with respect to a few Kokkos benchmarks or mini-apps.

@ajpowelsnl
Copy link
Contributor

@ajpowelsnl Thanks for pointing this out. I have been following this, and I am trying to see the implications to performance of a Kokkos Tools connectors when the tool's global fences are supported and enabled by the Kokkos Tools user (the user enables this by typing export KOKKOS_TOOLS_GLOBALFENCES=1).

The part in Roman's data of importance to me is the timing difference 612 ns - 345 ns = 267 ns between test_global_fence_with_kokkos and test_global_fence_backend_native.

It also may be good to quantify 'noticeable overhead' with respect to a few Kokkos benchmarks or mini-apps.

Thanks @vlkale -- any way to mitigate this overhead, significant variation / difficulty reproducing notwithstanding?

@vlkale
Copy link

vlkale commented Mar 25, 2024

While I am not certain this comes from Kokkos Tools primarily, lines 219-230 in the following may be a culprit: https://github.com/kokkos/kokkos/blob/master/core/src/impl/Kokkos_Profiling.cpp

Basically, fencing should happen at a finer-granularity per execution instance in Kokkos Tools.
I need to think about this but I will create a Kokkos Tools Github Issue for this in any case.

@vlkale
Copy link

vlkale commented Mar 27, 2024 via email

@ajpowelsnl
Copy link
Contributor

ajpowelsnl commented Apr 2, 2024

TODO:

  • Investigate the possibility of "over fencing" in Kokkos-Tools / profiling capabilities (e.g., kernel logger)
    • CAVEAT: "over fencing" can have performance costs
    • See fencing-relevant tests in Kokkos Core to understand basic assumptions about fencing;
  • @romintomasetti , do you have any other concerns about Kokkos performance overhead, beyond fences?
  • Is it of interest to quantify the overhead of Kokkos data structures , parallel constructs, atomics, etc., with respect to raw backend baselines, @crtrott , @dalg24 , @masterleinad, as a Kokkos Tools exercise ?
  • Close issue if understanding Kokkos overheads (for main backends in Kokkos Core) is not a worthwhile activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Code showing unusually slow performance for an architecture and/or backend Question For Kokkos internal and external contributors and users
Projects
None yet
Development

No branches or pull requests

4 participants