CUDA&HIP stream asynchronicity #163

DejvBayer · 2024-03-09T14:40:16Z

Hi,

this is a snippet of launch of a CUDA kernel from DispatchPlan module.

...

if (app->configuration.num_streams >= 1) {
	result = cuLaunchKernel(axis->VkFFTKernel, ..., app->configuration.stream[app->configuration.streamID], args, 0);
}
else {
	result = cuLaunchKernel(axis->VkFFTKernel, ..., 0, args, 0);
}

// result check

if (app->configuration.num_streams > 1) {
	app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams;
	if (app->configuration.streamCounter == 0) {
		cudaError_t res2 = cudaEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]);
		if (res2 != cudaSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD;
	}
	app->configuration.streamCounter++;
}

...

I do not understand several things about this code:

Why is the kernel launched every time into different stream? I see that in the RunApp module you call VkFFTSync after each kernel launch. I think that it is not necessary unless you want to execut the work in parallel.
Is it correct that only the event at index 0 is ever recorded to a stream because streamCount? It seems more like a mistake.

Then here is a snippet from a VkFFTSync function.

...

if (app->configuration.num_streams > 1) {
  cudaError_t res = cudaSuccess;
  for (pfUINT s = 0; s < app->configuration.num_streams; s++) {
      res = cudaEventSynchronize(app->configuration.stream_event[s]);
      if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
  }
  app->configuration.streamCounter = 0;
}

...

Here is the synchronization of multiple CUDA streams. If I am not wrong, the it synchronizes events that were never launched into a stream. Also it makes the application synchronous, I guess that cudaStreamWaitEvent function would be more suitable in this case.

But overall I feel like that the whole design of using multiple streams is wrong. What I think is right would be:

When the plan is created, same number of events as is the stream count should be created.
Then when the VkFFTAppend function is called this should happen:
1. Events should be recorded into each except the first stream via cudaEventRecord.
2. The first stream should wait for all of the work in other streams to finish by calling cudaStreamWaitEvent on each except the first event.
3. All of the work should be launched into the first stream.
4. When everything is done, the first event shall be recorded into the first stream via cudaEventRecord
5. All of the streams except the first one shall call cudaStreamWaitEvent on the first event.
The user launch more work into the streams.

This attitude should work fine and even allow the usage of CUDA Graphs via stream capture. HIP has the exact same story.

Thanks!

David

The text was updated successfully, but these errors were encountered:

DTolm · 2024-03-22T13:43:30Z

Hello,

multiple streams was a test to mimic the Vulkan behavior of shader dispatches to the pipeline, where unless synchronized they launch without waiting for completion of the last shader - unlike the kernel model of CUDA, where kernels wait for previous kernels. The usability of it turned out to be very limited - only if there are multiple dispatches of kernel when the grid dimensions go out device limits (65k for y and z). However, these workloads are typically big and utilize GPU fully by themselves with low CPU overhead, so using multiple streams was not useful at all. I think you are correct that the synchronization is messed up currently for this version, I will need to check in detail your changes when I have more time.

Best regards,
Dmitrii

DejvBayer · 2024-03-22T15:18:43Z

Sure, the mechanism is described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

It is just extended to work with arbitrary number of streams.

David

DejvBayer mentioned this issue Mar 9, 2024

Async multistream synchronization for CUDA and HIP backends #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA&HIP stream asynchronicity #163

CUDA&HIP stream asynchronicity #163

DejvBayer commented Mar 9, 2024

DTolm commented Mar 22, 2024

DejvBayer commented Mar 22, 2024 •

edited

CUDA&HIP stream asynchronicity #163

CUDA&HIP stream asynchronicity #163

Comments

DejvBayer commented Mar 9, 2024

DTolm commented Mar 22, 2024

DejvBayer commented Mar 22, 2024 • edited

DejvBayer commented Mar 22, 2024 •

edited