Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Detect transposed tensor on reductions and switch to fast path #482

Open
cliffburdick opened this issue Sep 8, 2023 · 0 comments
Open

Comments

@cliffburdick
Copy link
Collaborator

Currently a reduction across columns vs rows is slow since it naively takes in a transpose operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.

We experimented with an einsum transpose, followed by a reduction, and the results were as follows on an A30:

permute/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|-----------|-------|-----------|-------|---------|--------------|--------|---------|-----------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    608x |  1.534 ms | 6.09% |  1.446 ms | 0.63% | 44.248K |  46.397 GB/s |  4.97% |    609x |  1.457 ms |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |     17x | 31.070 ms | 0.31% | 30.981 ms | 0.09% |  4.132K |  34.658 GB/s |  3.71% |     18x | 30.965 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |    254x |  2.058 ms | 4.40% |  1.971 ms | 0.18% | 32.467K |  68.088 GB/s |  7.30% |    267x |  1.944 ms |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |     16x | 32.480 ms | 0.29% | 32.391 ms | 0.07% |  3.952K |  66.298 GB/s |  7.11% |     17x | 32.367 ms |

einsum/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-----|-------------|-------------|-------------|---------|------------|--------|------------|--------|----------|--------------|--------|---------|------------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    944x | 585.716 us | 14.30% | 577.762 us | 14.06% | 110.772K | 116.153 GB/s | 12.45% |    991x | 504.905 us |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |   1846x |   8.025 ms |  0.82% |   8.017 ms |  0.81% |  15.966K | 133.933 GB/s | 14.35% |   1847x |   8.007 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |   3136x | 668.751 us |  1.46% | 661.259 us |  0.91% |  96.785K | 202.973 GB/s | 21.75% |   3137x | 649.249 us |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |   1380x |  10.776 ms |  1.44% |  10.768 ms |  1.44% |  11.887K | 199.424 GB/s | 21.37% |   1381x |  10.760 ms |

The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant