Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto vectorize std::count #4673

Closed

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented May 12, 2024

Auto vectorize std count in addition to manual vectorization.

Auto vectorization is engaged with _USE_STD_VECTOR_ALGORITHMS=0 on x86, x64, and possibly on Arm64

Towards #4653

I tried to do also count_if, but for some reason auto-vectorization only worked for 8-bit elements. Might look into this later.

Is this worth doing? Drop manual vectorization or let them coexist?


Benchmark results of auto vectorization and no vectorization are produced by editing benchmark\CMakeFile.txt accordingly, i.e. by adding /arch:AVX2 option and _USE_STD_VECTOR_ALGORITHMS=0 define. default vectorization means SSE2 vectorization.

Note that the uint64_t was auto-vectorized even before the change, the results reflect that.

Benchmark None Manual Auto, default Auto, AVX2
bm<uint8_t, Op::Count>/8021/3056 1971 ns 75.4 ns 196 ns 179 ns
bm<uint8_t, Op::Count>/63/62 26.4 ns 4.77 ns 13.7 ns 30.4 ns
bm<uint8_t, Op::Count>/31/30 10.3 ns 9.63 ns 15.2 ns 15.0 ns
bm<uint8_t, Op::Count>/15/14 5.94 ns 6.23 ns 7.51 ns 8.53 ns
bm<uint8_t, Op::Count>/7/6 3.10 ns 3.82 ns 4.02 ns 4.15 ns
bm<uint16_t, Op::Count>/8021/3056 1984 ns 132 ns 245 ns 126 ns
bm<uint16_t, Op::Count>/63/62 30.1 ns 4.93 ns 8.67 ns 16.8 ns
bm<uint16_t, Op::Count>/31/30 19.4 ns 4.75 ns 7.92 ns 14.3 ns
bm<uint16_t, Op::Count>/15/14 5.06 ns 5.03 ns 8.26 ns 8.49 ns
bm<uint16_t, Op::Count>/7/6 2.91 ns 3.74 ns 3.80 ns 4.06 ns
bm<uint32_t, Op::Count>/8021/3056 1926 ns 254 ns 411 ns 241 ns
bm<uint32_t, Op::Count>/63/62 26.6 ns 5.02 ns 9.73 ns 9.60 ns
bm<uint32_t, Op::Count>/31/30 9.97 ns 4.35 ns 5.09 ns 8.82 ns
bm<uint32_t, Op::Count>/15/14 5.82 ns 4.04 ns 4.46 ns 8.34 ns
bm<uint32_t, Op::Count>/7/6 3.00 ns 3.95 ns 3.69 ns 3.75 ns
bm<uint64_t, Op::Count>/8021/3056 776 ns 754 ns 785 ns 617 ns
bm<uint64_t, Op::Count>/63/62 11.7 ns 13.2 ns 10.2 ns 8.10 ns
bm<uint64_t, Op::Count>/31/30 5.05 ns 4.73 ns 5.07 ns 4.84 ns
bm<uint64_t, Op::Count>/15/14 3.74 ns 3.82 ns 3.63 ns 4.73 ns
bm<uint64_t, Op::Count>/7/6 2.89 ns 3.59 ns 3.33 ns 3.42 ns

Results interpretation:

  • bm<uint8_t, Op::Count>/8021/3056 is way faster in manual vectorization than in auto AVX2, Auto vectorization has less efficient reduction, which is especially noticeable for 8-bit case, when it is needed often. It is less efficient because:
    • Not very efficient instructions choice for 8-bit case, in particular, no sadbw and no wider horizontal add, reported as DevCom-10657464
    • Not utilizing AVX2 width, using xmm# regs most of the time
    • Need to reduce often, each 0x80 count, not each 0xFF count. Note that if a count like 0xFE selected, there would be scalar tail on each portion. Without knowing the vector reg size and the loop unrolling factor, it is not possible to know maximum count that avoids scalar part, so we resort to power of two count.
  • Other 8021/3056 cases are a bit faster for auto AVX2 due to some loop unrolling
  • Small length tail benchmark results are terrible for auto vectorization, which does the tail in scalar way, rather than mask. Due to the loop unrolling, the tail is longer than a vector register.
  • Default SSE2 is somewhat slower than AVX2, almost twice for smaller elements.
  • No vectorization uint64_t is close to default vectorization, because both are equally vectorized.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 12, 2024 09:43
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 12, 2024
@StephanTLavavej StephanTLavavej added the decision needed We need to choose something before working on this label May 16, 2024
@StephanTLavavej
Copy link
Member

Marking as "decision needed" for the next STL maintainers meeting. Thanks for looking into this - after reviewing the benchmark results and looking at the code, I think that we should have only the manual vectorization here. I don't think that the benefits for (very rare) users of the escape hatch are worth the maintenance costs of this codepath. (For iota and ranges::iota, I feel differently because (1) the auto-vectorization-friendly codepath is much simpler and (2) there is no manually vectorized version, so all users benefit).

While this may benefit ARM64, I think such performance work is premature, as we haven't really started looking into ARM64 benchmarking, nor have we added ARM64 manually vectorized algorithms.

@AlexGuteniev
Copy link
Contributor Author

The main difference between iota auto vectorization and this one is that iota auto vectorization does not attempt to handle large ranges which would wrap counter. This makes iota auto vectorization simpler.

I don't think iota should handle large ranges, because it is squirely to wrap counter (and UB for signed counter). Also it is not very efficient to use large iota arrays instead of iota_view.

In contrast, counting large number of small elements is equally reasonable as counting small numbers of them.

@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label May 22, 2024
@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting and decided not to merge this, for the reasons that I previously explained. Thanks again for looking into this.

@AlexGuteniev AlexGuteniev deleted the counting_on_compiler branch May 23, 2024 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants