Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA&HIP stream asynchronicity #163

Open
DejvBayer opened this issue Mar 9, 2024 · 2 comments
Open

CUDA&HIP stream asynchronicity #163

DejvBayer opened this issue Mar 9, 2024 · 2 comments

Comments

@DejvBayer
Copy link

Hi,

this is a snippet of launch of a CUDA kernel from DispatchPlan module.

...

if (app->configuration.num_streams >= 1) {
	result = cuLaunchKernel(axis->VkFFTKernel, ..., app->configuration.stream[app->configuration.streamID], args, 0);
}
else {
	result = cuLaunchKernel(axis->VkFFTKernel, ..., 0, args, 0);
}

// result check

if (app->configuration.num_streams > 1) {
	app->configuration.streamID = app->configuration.streamCounter % app->configuration.num_streams;
	if (app->configuration.streamCounter == 0) {
		cudaError_t res2 = cudaEventRecord(app->configuration.stream_event[app->configuration.streamID], app->configuration.stream[app->configuration.streamID]);
		if (res2 != cudaSuccess) return VKFFT_ERROR_FAILED_TO_EVENT_RECORD;
	}
	app->configuration.streamCounter++;
}

...

I do not understand several things about this code:

  1. Why is the kernel launched every time into different stream? I see that in the RunApp module you call VkFFTSync after each kernel launch. I think that it is not necessary unless you want to execut the work in parallel.
  2. Is it correct that only the event at index 0 is ever recorded to a stream because streamCount? It seems more like a mistake.

Then here is a snippet from a VkFFTSync function.

...

if (app->configuration.num_streams > 1) {
  cudaError_t res = cudaSuccess;
  for (pfUINT s = 0; s < app->configuration.num_streams; s++) {
      res = cudaEventSynchronize(app->configuration.stream_event[s]);
      if (res != cudaSuccess) return VKFFT_ERROR_FAILED_TO_SYNCHRONIZE;
  }
  app->configuration.streamCounter = 0;
}

...

Here is the synchronization of multiple CUDA streams. If I am not wrong, the it synchronizes events that were never launched into a stream. Also it makes the application synchronous, I guess that cudaStreamWaitEvent function would be more suitable in this case.

But overall I feel like that the whole design of using multiple streams is wrong. What I think is right would be:

  1. When the plan is created, same number of events as is the stream count should be created.
  2. Then when the VkFFTAppend function is called this should happen:
    1. Events should be recorded into each except the first stream via cudaEventRecord.
    2. The first stream should wait for all of the work in other streams to finish by calling cudaStreamWaitEvent on each except the first event.
    3. All of the work should be launched into the first stream.
    4. When everything is done, the first event shall be recorded into the first stream via cudaEventRecord
    5. All of the streams except the first one shall call cudaStreamWaitEvent on the first event.
  3. The user launch more work into the streams.

This attitude should work fine and even allow the usage of CUDA Graphs via stream capture. HIP has the exact same story.

Thanks!

David

@DTolm
Copy link
Owner

DTolm commented Mar 22, 2024

Hello,

multiple streams was a test to mimic the Vulkan behavior of shader dispatches to the pipeline, where unless synchronized they launch without waiting for completion of the last shader - unlike the kernel model of CUDA, where kernels wait for previous kernels. The usability of it turned out to be very limited - only if there are multiple dispatches of kernel when the grid dimensions go out device limits (65k for y and z). However, these workloads are typically big and utilize GPU fully by themselves with low CPU overhead, so using multiple streams was not useful at all. I think you are correct that the synchronization is messed up currently for this version, I will need to check in detail your changes when I have more time.

Best regards,
Dmitrii

@DejvBayer
Copy link
Author

DejvBayer commented Mar 22, 2024

Sure, the mechanism is described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

It is just extended to work with arbitrary number of streams.

David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants