Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low performance on MI250X in certain cases #100

Open
nschaeff opened this issue Jan 14, 2023 · 3 comments
Open

low performance on MI250X in certain cases #100

nschaeff opened this issue Jan 14, 2023 · 3 comments

Comments

@nschaeff
Copy link

Hello,

I observed slower execution time on MI250X than on MI100, for "strided" transforms.
Example for Nfft = 1024, and 20480 batched complex to complex transforms (double precision).
time on MI100 = 3.5ms
time on MI250X (1gcd) = 4.0ms
Instead, as the bandwidth is about 1.5 times larger on MI250X (1 gcd), I would expect around 2-2.5ms.
Could it be a bank-conflict issue, for instance if bank rules have changed since MI100 ?

@DTolm
Copy link
Owner

DTolm commented Jan 14, 2023

Dear @nschaeff,

yes, this is another case of L2 cache port serialization of AMD GPUs. It has been present for more than 10 years - https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-optimization.html#channel-conflicts. Shortly it can be described as for large strides (like 20480), just coalescing of 32 bytes doesn't help - MMU often is not smart and will calculate address values and issue load/store instructions to the same memory pin. I don't know how to solve it not knowing the full port calculation logic of AMD GPUs (I have been trying to come up with something for two years).

What can be done is changing the batch number. Here is the scan of the execution times of the nearby batches:

image

20480 just seems to be an unlucky number. Though other nearby batches are still 2x slower than peak memory bandwidth.

Another solution is batching in the outer dimension - this way each system 1024x20480 takes 0.5ms per FFT - peak bw of MI250.

Best regards,
Dmitrii

@nschaeff
Copy link
Author

Thanks a lot!
Without changing the batch size, I can increase the stride just a little (+8 works well) and VkFFT delivers the transform in 1.1ms.
Rule of thumb that seem to work OK: if the stride is a multiple of 256 (for data type "double", 2048 bytes), it is a good idea to add 8.
It does not solve all issues, but seems to keep times within 25% instead of 400%.

(Batching in the outer dimension requires that I transpose my data ... which would cost another 0.5ms at least, so that in the end it is not much faster. (It would have been without the stride control)).

@DTolm
Copy link
Owner

DTolm commented Jan 15, 2023

The stride approach is another solution, indeed. With more algorithms being ported to AMD GPUs, there probably will be more information about this in the future, as this behavior is not only related to FFT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants