-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace bitsConsumed with bitsLeft to optimize BitStream #4047
base: dev
Are you sure you want to change the base?
Conversation
This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once https://gcc.godbolt.org/z/dMY3j6rEa
Friendly ping |
Some holistic decompression speed benchmarks to begin this analysis,
As usual, it is pretty difficult to make sense, due to the sheer quantity of signal. What's clear is that it's not always positive. So let's summarize:
There is an interesting inversion between So sure, some compiler versions are clearly better, but others aren't, hence it's not a clear win. The best argument in favor of this PR so far is the |
By the way, I also noticed that, // new formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movl 8(%rdi), %ecx
subl %esi, %ecx
shrxq %rcx, (%rdi), %rax
bzhiq %rsi, %rax, %rax
movl %ecx, 8(%rdi)
retq // old formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movl 8(%rdi), %ecx
addl %esi, %ecx
movl %ecx, %eax
negb %al
shrxq %rax, (%rdi), %rax
bzhiq %rsi, %rax, %rax
movl %ecx, 8(%rdi)
retq // Old formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movq (%rdi), %rdx
movl 8(%rdi), %r8d
addl %esi, %r8d
movl %r8d, %ecx
negb %cl
shrq %cl, %rdx
movq $-1, %rax
movl %esi, %ecx
shlq %cl, %rax
movl %r8d, 8(%rdi)
notq %rax
andq %rdx, %rax
retq
BIT_readBitsFast(BIT_DStream_t*, unsigned int): # @BIT_readBitsFast(BIT_DStream_t*, unsigned int)
movq (%rdi), %rax
movl 8(%rdi), %edx
movl %edx, %ecx
shlq %cl, %rax
movl %esi, %ecx
negb %cl
shrq %cl, %rax
addl %esi, %edx
movl %edx, 8(%rdi)
retq // New formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int): # @BIT_readBits(BIT_DStream_t*, unsigned int)
movq (%rdi), %r8
movl 8(%rdi), %edx
subl %esi, %edx
movl %edx, %ecx
shrq %cl, %r8
movq $-1, %rax
movl %esi, %ecx
shlq %cl, %rax
notq %rax
andq %r8, %rax
movl %edx, 8(%rdi)
retq It's about on par with the old formulation of Anyway, the question is: what happens when edit : As a verification test, I compared |
I've looked into this as well, and I think something else that's worth discussing is the usage pattern of bitstreams and reloads. Specifically (for example in FSE / Huffman) we'd make ~4 reads from the stream and them reload it (when decoding we sometimes make less reads per reload). Here (godbolt) we can see an example that reads 4 elements from the bitstream and reloads it, and the base version is significantly shorter mainly due to the reload. Interestingly the reading operation actually takes the same number of opcodes in both versions as the base version can add two registers into a third one using Eventually, it's not clear to me that one is better than the other, and my guess would be that it really depends on context and pattern of usage. |
This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once
https://gcc.godbolt.org/z/dMY3j6rEa
Before:
after:
Results. On Arm processors I got regression up to 1% but on Intel Xeon I got really nice uplifts. AMD was less sensistive but also got 1-2%. It's much better seen for well compressed data when we change FSE states a lot. clang is clang 16, gcc is gcc 13.2.0.
Intel(R) Xeon(R) CPU @ 2.00GHz (Skylake)
Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
AMD EPYC 7B13 Zen3
In https://github.com/google/fleetbench where we hand out our production corpora, compression ratios, levels, statistics for our top 10 biggest workloads, we have the following benchmarks (CPU per byte):
Intel Skylake:
AMD Zen3:
Hope you can benchmark on your own and validate that it's better :)