-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results with CUDA backend #1408
Comments
While investigating this problem today, I've noticed that the Also, it still is a mystery why in a bigger application similar algorithm works in the most of cases. May be this because in that app I use more complex algorithm which runs slower? May be this is some sort of synchronization problem? Although, I tried to add explicit synchronization points using |
Hi, I cannot reproduce with
Please provide more information - AdaptiveCpp version, how you have built things, the compilation flow you are using, output of run with EDIT: Note that one important difference between CPU backend and all other backends (including CUDA) is that CPU will directly operate on the host-side allocation of |
Thank you very much for your response and for taking the time to look into this issue. The version of AdaptiveCpp I am currently using is the latest, v24, with this commit: 5777309 (HEAD -> develop, origin/develop, origin/HEAD). Initially, I encountered this error while using v23.
This is CMakeLists.txt for this test
Here is the full output with
|
Thank you. I also cannot reproduce with
It's not unlikely that this might be something that can only be reproduced on Windows, as things are pretty experimental there throughout. Nothing immediately stands out to me in your output. One thing that could perhaps help is if you could try to narrow down the issue further. If there is some UB on our side that does not show on Linux, the shorter your reproducer is the easier will it be to spot potential issues. |
Thank you, This is the new code that I test:
This is what I get (should be
This is quite strange since I have used member functions before without problems and that my project where I encountered this problem stores its main buffers in a class and processes them using member functions, and I have never noticed any problems with these long-living buffers. I am wondering if this issue could be related to the fact that I use CUDA 12.4 (or 12.2 recently). When I compile any AdaptiveCpp project, I get this warning:
|
And this issue isn't related to how buffers are passed and returned. If I add data output (using host accessor) right into the body of member So something doesn't work right in the body of that function. This is the debug log of what occurs there:
|
I don't know what the options on Windows are - can you use |
Thanks for suggestion, I've got some interesting information! I've tried to use Nsight VSE Debugger which is integrated to VS. I've tried to compile my Release build with So, to sum up, when I use |
How have you built LLVM and AdaptiveCpp? I remember vaguely that the optimisation levels between LLVM and AdaptiveCpp have to match on Windows, maybe this is a similar problem? |
I compiled them both using |
Additional, probably, related information. Now I compile all projects with
Interesting that I have synchronous SYCL exception handlers as well, but they catch nothing. |
Hello,
I've been working with AdaptiveCpp and oneAPI, targeting multiple backends including CPU and CUDA through AdaptiveCpp, as well as CPU@OpenCL and Level Zero through oneAPI. Across all configurations, I'm encountering a peculiar issue specifically with the CUDA backend via AdaptiveCpp, where about 30% of the runs produce incorrect results. This inconsistency rises to 100% when using a simplified reproducer program that I've written for demonstration purposes.
One of algorithms of application I am working on involves constructing and deconstructing an image pyramid. When executing this process using the CUDA backend, I'm observing incorrect data manipulation, as outlined by the sample output below. Notably, this issue does not manifest under the CPU backend or when using oneAPI's alternatives.
While constructing and deconstructing the pyramid, reproducer that I've written construct and deconstruct an image pyramid, and reads and prints data at (0, 0) coordinates of each image level to console.
This is the output (correct) when using CPU-backend:
And this is the output (incorrect) when using CUDA-backend:
I've tried to add explicit synchronization points using
q.wait()
to various places of the program, but this didn't help.Thank you in advance for your assistance and for the great work on AdaptiveCpp.
Code:
The text was updated successfully, but these errors were encountered: