-
-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
low performance on MI250X in certain cases #100
Comments
Dear @nschaeff, yes, this is another case of L2 cache port serialization of AMD GPUs. It has been present for more than 10 years - https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-optimization.html#channel-conflicts. Shortly it can be described as for large strides (like 20480), just coalescing of 32 bytes doesn't help - MMU often is not smart and will calculate address values and issue load/store instructions to the same memory pin. I don't know how to solve it not knowing the full port calculation logic of AMD GPUs (I have been trying to come up with something for two years). What can be done is changing the batch number. Here is the scan of the execution times of the nearby batches: 20480 just seems to be an unlucky number. Though other nearby batches are still 2x slower than peak memory bandwidth. Another solution is batching in the outer dimension - this way each system 1024x20480 takes 0.5ms per FFT - peak bw of MI250. Best regards, |
Thanks a lot! (Batching in the outer dimension requires that I transpose my data ... which would cost another 0.5ms at least, so that in the end it is not much faster. (It would have been without the stride control)). |
The stride approach is another solution, indeed. With more algorithms being ported to AMD GPUs, there probably will be more information about this in the future, as this behavior is not only related to FFT. |
Hello,
I observed slower execution time on MI250X than on MI100, for "strided" transforms.
Example for Nfft = 1024, and 20480 batched complex to complex transforms (double precision).
time on MI100 = 3.5ms
time on MI250X (1gcd) = 4.0ms
Instead, as the bandwidth is about 1.5 times larger on MI250X (1 gcd), I would expect around 2-2.5ms.
Could it be a bank-conflict issue, for instance if bank rules have changed since MI100 ?
The text was updated successfully, but these errors were encountered: