Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Auto vectorize std count in addition to manual vectorization.
Auto vectorization is engaged with
_USE_STD_VECTOR_ALGORITHMS=0
on x86, x64, and possibly on Arm64Towards #4653
I tried to do also
count_if
, but for some reason auto-vectorization only worked for 8-bit elements. Might look into this later.Is this worth doing? Drop manual vectorization or let them coexist?
Benchmark results of auto vectorization and no vectorization are produced by editing
benchmark\CMakeFile.txt
accordingly, i.e. by adding/arch:AVX2
option and_USE_STD_VECTOR_ALGORITHMS=0
define. default vectorization means SSE2 vectorization.Note that the
uint64_t
was auto-vectorized even before the change, the results reflect that.Results interpretation:
bm<uint8_t, Op::Count>/8021/3056
is way faster in manual vectorization than in auto AVX2, Auto vectorization has less efficient reduction, which is especially noticeable for 8-bit case, when it is needed often. It is less efficient because:sadbw
and no wider horizontal add, reported as DevCom-10657464xmm#
regs most of the time8021/3056
cases are a bit faster for auto AVX2 due to some loop unrollinguint64_t
is close to default vectorization, because both are equally vectorized.