Replies: 1 comment 7 replies
-
Yes, this is possible. But don't use host tasks for this purpose. They are broken beyond repair for this purpose, as they are executed when the SYCL task graph is executed, not when it is submitted. So you would enqueue additional cublas operations while your kernels are already running (note the additional synchronization at the end of the host task!). AdaptiveCpp has something better which can substantially outperform host-task-based code patterns. What you want is the custom operation extension: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md Note that there is already AdaptiveCpp support for CUDA and HIP backends in upstream oneMKL (which then dispatches calls to cuBLAS or rocBLAS). I'm not sure whether it still works at the moment, as there were some CI issues. The code there basically does exactly what you want, and also uses our extension. |
Beta Was this translation helpful? Give feedback.
-
Hello,
In some case there is a possibility that when porting a code some operations like linear algebra do not have equivalents. SO one is left use for example cu/hipblas which are non-portable.
In cuda/hip the blas (and other libraries) calls are asynchronous and can be associated with a stream. So one could launch kernels, make the blas call, then launch more and only synchronize at the end.
In SYCL one could do some like launch kernels, synchronize, call blas, synchronize, launch the rest of the kernels. This is not optimal. I found this code in a CodePlay repository
My understanding of the code is that the stream info can be obtained from a queue and then the blas calls would be associated with the stream and consequently with the queue. So one would be able to do a series of calls: launch kernels, call blas, launch kernel and they would all run in the same queue.
The above code seems to be using onepi extensions. Is there something equivalent in AdaptiveCpp?
Best,
Cristian
Beta Was this translation helpful? Give feedback.
All reactions