auto vectorize `std::count` #4673

AlexGuteniev · 2024-05-12T09:43:02Z

Auto vectorize std count in addition to manual vectorization.

Auto vectorization is engaged with _USE_STD_VECTOR_ALGORITHMS=0 on x86, x64, and possibly on Arm64

Towards #4653

I tried to do also count_if, but for some reason auto-vectorization only worked for 8-bit elements. Might look into this later.

Is this worth doing? Drop manual vectorization or let them coexist?

Benchmark results of auto vectorization and no vectorization are produced by editing benchmark\CMakeFile.txt accordingly, i.e. by adding /arch:AVX2 option and _USE_STD_VECTOR_ALGORITHMS=0 define. default vectorization means SSE2 vectorization.

Note that the uint64_t was auto-vectorized even before the change, the results reflect that.

Benchmark	None	Manual	Auto, default	Auto, AVX2
bm<uint8_t, Op::Count>/8021/3056	1971 ns	75.4 ns	196 ns	179 ns
bm<uint8_t, Op::Count>/63/62	26.4 ns	4.77 ns	13.7 ns	30.4 ns
bm<uint8_t, Op::Count>/31/30	10.3 ns	9.63 ns	15.2 ns	15.0 ns
bm<uint8_t, Op::Count>/15/14	5.94 ns	6.23 ns	7.51 ns	8.53 ns
bm<uint8_t, Op::Count>/7/6	3.10 ns	3.82 ns	4.02 ns	4.15 ns
bm<uint16_t, Op::Count>/8021/3056	1984 ns	132 ns	245 ns	126 ns
bm<uint16_t, Op::Count>/63/62	30.1 ns	4.93 ns	8.67 ns	16.8 ns
bm<uint16_t, Op::Count>/31/30	19.4 ns	4.75 ns	7.92 ns	14.3 ns
bm<uint16_t, Op::Count>/15/14	5.06 ns	5.03 ns	8.26 ns	8.49 ns
bm<uint16_t, Op::Count>/7/6	2.91 ns	3.74 ns	3.80 ns	4.06 ns
bm<uint32_t, Op::Count>/8021/3056	1926 ns	254 ns	411 ns	241 ns
bm<uint32_t, Op::Count>/63/62	26.6 ns	5.02 ns	9.73 ns	9.60 ns
bm<uint32_t, Op::Count>/31/30	9.97 ns	4.35 ns	5.09 ns	8.82 ns
bm<uint32_t, Op::Count>/15/14	5.82 ns	4.04 ns	4.46 ns	8.34 ns
bm<uint32_t, Op::Count>/7/6	3.00 ns	3.95 ns	3.69 ns	3.75 ns
bm<uint64_t, Op::Count>/8021/3056	776 ns	754 ns	785 ns	617 ns
bm<uint64_t, Op::Count>/63/62	11.7 ns	13.2 ns	10.2 ns	8.10 ns
bm<uint64_t, Op::Count>/31/30	5.05 ns	4.73 ns	5.07 ns	4.84 ns
bm<uint64_t, Op::Count>/15/14	3.74 ns	3.82 ns	3.63 ns	4.73 ns
bm<uint64_t, Op::Count>/7/6	2.89 ns	3.59 ns	3.33 ns	3.42 ns

Results interpretation:

bm<uint8_t, Op::Count>/8021/3056 is way faster in manual vectorization than in auto AVX2, Auto vectorization has less efficient reduction, which is especially noticeable for 8-bit case, when it is needed often. It is less efficient because:
- Not very efficient instructions choice for 8-bit case, in particular, no sadbw and no wider horizontal add, reported as DevCom-10657464
- Not utilizing AVX2 width, using xmm# regs most of the time
- Need to reduce often, each 0x80 count, not each 0xFF count. Note that if a count like 0xFE selected, there would be scalar tail on each portion. Without knowing the vector reg size and the loop unrolling factor, it is not possible to know maximum count that avoids scalar part, so we resort to power of two count.
Other 8021/3056 cases are a bit faster for auto AVX2 due to some loop unrolling
Small length tail benchmark results are terrible for auto vectorization, which does the tail in scalar way, rather than mask. Due to the loop unrolling, the tail is longer than a vector register.
Default SSE2 is somewhat slower than AVX2, almost twice for smaller elements.
No vectorization uint64_t is close to default vectorization, because both are equally vectorized.

StephanTLavavej · 2024-05-16T23:45:05Z

Marking as "decision needed" for the next STL maintainers meeting. Thanks for looking into this - after reviewing the benchmark results and looking at the code, I think that we should have only the manual vectorization here. I don't think that the benefits for (very rare) users of the escape hatch are worth the maintenance costs of this codepath. (For iota and ranges::iota, I feel differently because (1) the auto-vectorization-friendly codepath is much simpler and (2) there is no manually vectorized version, so all users benefit).

While this may benefit ARM64, I think such performance work is premature, as we haven't really started looking into ARM64 benchmarking, nor have we added ARM64 manually vectorized algorithms.

AlexGuteniev · 2024-05-17T13:03:09Z

The main difference between iota auto vectorization and this one is that iota auto vectorization does not attempt to handle large ranges which would wrap counter. This makes iota auto vectorization simpler.

I don't think iota should handle large ranges, because it is squirely to wrap counter (and UB for signed counter). Also it is not very efficient to use large iota arrays instead of iota_view.

In contrast, counting large number of small elements is equally reasonable as counting small numbers of them.

StephanTLavavej · 2024-05-22T21:23:45Z

We talked about this at the weekly maintainer meeting and decided not to merge this, for the reasons that I previously explained. Thanks again for looking into this.

auto vectorize std::count

4e5136c

AlexGuteniev requested a review from a team as a code owner May 12, 2024 09:43

avoid unvectorized tail for portions

ae43611

StephanTLavavej added the performance Must go faster label May 12, 2024

StephanTLavavej assigned StephanTLavavej and unassigned StephanTLavavej May 12, 2024

StephanTLavavej added the decision needed We need to choose something before working on this label May 16, 2024

StephanTLavavej removed the decision needed We need to choose something before working on this label May 22, 2024

StephanTLavavej closed this May 22, 2024

AlexGuteniev deleted the counting_on_compiler branch May 23, 2024 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto vectorize `std::count` #4673

auto vectorize `std::count` #4673

AlexGuteniev commented May 12, 2024 •

edited

StephanTLavavej commented May 16, 2024

AlexGuteniev commented May 17, 2024

StephanTLavavej commented May 22, 2024

auto vectorize std::count #4673

auto vectorize std::count #4673

Conversation

AlexGuteniev commented May 12, 2024 • edited

StephanTLavavej commented May 16, 2024

AlexGuteniev commented May 17, 2024

StephanTLavavej commented May 22, 2024

auto vectorize `std::count` #4673

auto vectorize `std::count` #4673

AlexGuteniev commented May 12, 2024 •

edited