low performance on MI250X in certain cases #100

nschaeff · 2023-01-14T14:16:50Z

Hello,

I observed slower execution time on MI250X than on MI100, for "strided" transforms.
Example for Nfft = 1024, and 20480 batched complex to complex transforms (double precision).
time on MI100 = 3.5ms
time on MI250X (1gcd) = 4.0ms
Instead, as the bandwidth is about 1.5 times larger on MI250X (1 gcd), I would expect around 2-2.5ms.
Could it be a bank-conflict issue, for instance if bank rules have changed since MI100 ?

DTolm · 2023-01-14T16:58:28Z

Dear @nschaeff,

yes, this is another case of L2 cache port serialization of AMD GPUs. It has been present for more than 10 years - https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-optimization.html#channel-conflicts. Shortly it can be described as for large strides (like 20480), just coalescing of 32 bytes doesn't help - MMU often is not smart and will calculate address values and issue load/store instructions to the same memory pin. I don't know how to solve it not knowing the full port calculation logic of AMD GPUs (I have been trying to come up with something for two years).

What can be done is changing the batch number. Here is the scan of the execution times of the nearby batches:

20480 just seems to be an unlucky number. Though other nearby batches are still 2x slower than peak memory bandwidth.

Another solution is batching in the outer dimension - this way each system 1024x20480 takes 0.5ms per FFT - peak bw of MI250.

Best regards,
Dmitrii

nschaeff · 2023-01-14T22:40:52Z

Thanks a lot!
Without changing the batch size, I can increase the stride just a little (+8 works well) and VkFFT delivers the transform in 1.1ms.
Rule of thumb that seem to work OK: if the stride is a multiple of 256 (for data type "double", 2048 bytes), it is a good idea to add 8.
It does not solve all issues, but seems to keep times within 25% instead of 400%.

(Batching in the outer dimension requires that I transpose my data ... which would cost another 0.5ms at least, so that in the end it is not much faster. (It would have been without the stride control)).

DTolm · 2023-01-15T16:54:10Z

The stride approach is another solution, indeed. With more algorithms being ported to AMD GPUs, there probably will be more information about this in the future, as this behavior is not only related to FFT.

DTolm mentioned this issue Jan 16, 2023

VkFFT calculation issues (mostly non-radix) #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

low performance on MI250X in certain cases #100

low performance on MI250X in certain cases #100

nschaeff commented Jan 14, 2023

DTolm commented Jan 14, 2023

nschaeff commented Jan 14, 2023

DTolm commented Jan 15, 2023

low performance on MI250X in certain cases #100

low performance on MI250X in certain cases #100

Comments

nschaeff commented Jan 14, 2023

DTolm commented Jan 14, 2023

nschaeff commented Jan 14, 2023

DTolm commented Jan 15, 2023