Replace bitsConsumed with bitsLeft to optimize BitStream #4047

danlark1 · 2024-05-14T10:41:14Z

This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once

https://gcc.godbolt.org/z/dMY3j6rEa

Before:

BIT_readBitsFast(BIT_DStream_t*, unsigned int):  # @BIT_readBitsFast(BIT_DStream_t*, unsigned int)
        movl    8(%rdi), %ecx
        shlxq   %rcx, (%rdi), %rax
        addl    %esi, %ecx
        movl    %esi, %edx
        negb    %dl
        shrxq   %rdx, %rax, %rax
        movl    %ecx, 8(%rdi)
        retq

after:

BIT_readBits(BIT_DStream_t*, unsigned int):      # @BIT_readBits(BIT_DStream_t*, unsigned int)
        movl    8(%rdi), %ecx
        subl    %esi, %ecx
        shrxq   %rcx, (%rdi), %rax
        bzhiq   %rsi, %rax, %rax
        movl    %ecx, 8(%rdi)
        retq

Results. On Arm processors I got regression up to 1% but on Intel Xeon I got really nice uplifts. AMD was less sensistive but also got 1-2%. It's much better seen for well compressed data when we change FSE states a lot. clang is clang 16, gcc is gcc 13.2.0.

Intel(R) Xeon(R) CPU @ 2.00GHz (Skylake)

File/Level	-1	1	5	11
silesia.tar (clang pr)	1163.0 MB/s	937.3 MB/s	733.7 MB/s	728.9 MB/s
silesia.tar (clang dev)	1042.6 MB/s (0.90x)	902.1 MB/s (0.96x)	681.3 MB/s (0.93x)	714.5 MB/s (0.98x)
silesia.tar (gcc pr)	1112.4 MB/s	942.0 MB/s	729.4 MB/s	742.5 MB/s
silesia.tar (gcc dev)	1079.2 MB/s (0.97x)	931.4 MB/s (0.99x)	714.4 MB/s (0.98x)	733.5 MB/s (0.99x)
enwik8 (clang pr)	968.9 MB/s	830.0 MB/s	565.9 MB/s	538.1 MB/s
enwik8 (clang dev)	902.2 MB/s (0.93x)	793.8 MB/s (0.96x)	543.9 MB/s (0.96x)	514.2 MB/s (0.96x)
enwik8 (gcc pr)	969.1 MB/s	845.2 MB/s	565.9 MB/s	527.7 MB/s
enwik8 (gcc dev)	942.1 MB/s (0.97x)	819.8 MB/s (0.97x)	554.9 MB/s (0.98x)	539.6 MB/s (1.02x)

Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

File/Level	-1	1	5	11
silesia.tar (clang pr)	2126.4 MB/s	1674.5 MB/s	1471.4 MB/s	1178.6 MB/s
silesia.tar (clang dev)	1967.4 MB/s (0.93x)	1589.7 MB/s (0.95x)	1334.5 MB/s (0.91x)	1076.8 MB/s (0.91x)
silesia.tar (gcc pr)	2031.2 MB/s	1704.6 MB/s	1426.7 MB/s	1161.9 MB/s
silesia.tar (gcc dev)	2057.1 MB/s (1.01x)	1605.8 MB/s (0.94x)	1371.9 MB/s (0.96x)	1141.6 MB/s (0.98x)
enwik8 (clang pr)	1748.6 MB/s	1449.4 MB/s	1279.0 MB/s	1356.5 MB/s
enwik8 (clang dev)	1663.2 MB/s (0.95x)	1387.6 MB/s (0.96x)	1174.0 MB/s (0.92x)	1287.2 MB/s (0.95x)
enwik8 (gcc pr)	1733.2 MB/s	1391.6 MB/s	1195.8 MB/s	1342.2 MB/s
enwik8 (gcc dev)	1681.2 MB/s (0.97x)	1426.4 MB/s (1.02x)	1137.8 MB/s (0.95x)	1299.6 MB/s (0.97x)

AMD EPYC 7B13 Zen3

File/Level	-1	1	5	11
silesia.tar (clang pr)	1719.7 MB/s	1299.1 MB/s	1189.2 MB/s	1390.5 MB/s
silesia.tar (clang dev)	1704.2 MB/s (0.99x)	1284.9 MB/s (0.99x)	1183.0 MB/s (0.99x)	1360.7 MB/s (0.98x)
enwik8 (clang pr)	1502.0 MB/s	1143.0 MB/s	970.1 MB/s	1131.1 MB/s
enwik8 (clang dev)	1459.7 MB/s (0.97x)	1125.9 MB/s (0.99x)	951.8 MB/s (0.98x)	1113.2 MB/s (0.98x)

In https://github.com/google/fleetbench where we hand out our production corpora, compression ratios, levels, statistics for our top 10 biggest workloads, we have the following benchmarks (CPU per byte):

Intel Skylake:

                                                        old cpu/op   new cpu/op
BM_ZSTD_DECOMPRESS_fleet                                0.59ns ± 3%  0.58ns ± 3%  -1.00%  (p=0.000 n=45+49)
BM_ZSTD_DECOMPRESS_0               [ZSTD_DECOMPRESS_0]  0.47ns ± 3%  0.46ns ± 2%  -0.85%  (p=0.000 n=46+45)
BM_ZSTD_DECOMPRESS_1               [ZSTD_DECOMPRESS_1]  0.65ns ± 2%  0.64ns ± 3%  -1.60%  (p=0.000 n=49+50)
BM_ZSTD_DECOMPRESS_2               [ZSTD_DECOMPRESS_2]  0.72ns ± 2%  0.70ns ± 3%  -1.74%  (p=0.000 n=47+50)
BM_ZSTD_DECOMPRESS_3               [ZSTD_DECOMPRESS_3]  0.93ns ± 2%  0.91ns ± 3%  -2.37%  (p=0.000 n=48+50)
BM_ZSTD_DECOMPRESS_4               [ZSTD_DECOMPRESS_4]  0.63ns ± 3%  0.62ns ± 3%  -0.55%  (p=0.015 n=49+50)
BM_ZSTD_DECOMPRESS_5               [ZSTD_DECOMPRESS_5]  0.88ns ± 2%  0.86ns ± 3%  -2.42%  (p=0.000 n=48+47)
BM_ZSTD_DECOMPRESS_6               [ZSTD_DECOMPRESS_6]  0.48ns ± 3%  0.48ns ± 3%  -0.89%  (p=0.000 n=49+48)
BM_ZSTD_DECOMPRESS_7               [ZSTD_DECOMPRESS_7]  0.72ns ± 2%  0.71ns ± 3%  -1.30%  (p=0.000 n=45+47)
BM_ZSTD_DECOMPRESS_8               [ZSTD_DECOMPRESS_8]  0.56ns ± 3%  0.56ns ± 3%  -0.57%  (p=0.005 n=48+48)
BM_ZSTD_DECOMPRESS_9               [ZSTD_DECOMPRESS_9]  0.66ns ± 3%  0.64ns ± 4%  -1.94%  (p=0.000 n=49+49)

AMD Zen3:

                                                        old cpu/op   new cpu/op
BM_ZSTD_DECOMPRESS_fleet                                0.60ns ± 2%  0.60ns ± 1%  -0.59%  (p=0.000 n=46+46)
BM_ZSTD_DECOMPRESS_0               [ZSTD_DECOMPRESS_0]  0.46ns ± 1%  0.45ns ± 1%  -0.47%  (p=0.000 n=48+46)
BM_ZSTD_DECOMPRESS_1               [ZSTD_DECOMPRESS_1]  0.68ns ± 2%  0.68ns ± 1%    ~     (p=0.970 n=47+48)
BM_ZSTD_DECOMPRESS_2               [ZSTD_DECOMPRESS_2]  0.74ns ± 2%  0.74ns ± 1%    ~     (p=0.180 n=48+48)
BM_ZSTD_DECOMPRESS_3               [ZSTD_DECOMPRESS_3]  0.95ns ± 1%  0.95ns ± 1%    ~     (p=0.101 n=48+48)
BM_ZSTD_DECOMPRESS_4               [ZSTD_DECOMPRESS_4]  0.63ns ± 1%  0.63ns ± 1%    ~     (p=0.082 n=49+48)
BM_ZSTD_DECOMPRESS_5               [ZSTD_DECOMPRESS_5]  0.91ns ± 2%  0.91ns ± 1%  -0.52%  (p=0.000 n=47+46)
BM_ZSTD_DECOMPRESS_6               [ZSTD_DECOMPRESS_6]  0.48ns ± 2%  0.48ns ± 1%  -0.57%  (p=0.000 n=49+48)
BM_ZSTD_DECOMPRESS_7               [ZSTD_DECOMPRESS_7]  0.75ns ± 1%  0.74ns ± 2%  -0.42%  (p=0.000 n=48+48)
BM_ZSTD_DECOMPRESS_8               [ZSTD_DECOMPRESS_8]  0.56ns ± 1%  0.55ns ± 1%  -0.58%  (p=0.000 n=49+49)
BM_ZSTD_DECOMPRESS_9               [ZSTD_DECOMPRESS_9]  0.67ns ± 1%  0.67ns ± 1%    ~     (p=0.219 n=49+47)

Hope you can benchmark on your own and validate that it's better :)

This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once https://gcc.godbolt.org/z/dMY3j6rEa

danlark1 · 2024-05-28T12:10:29Z

Friendly ping

Cyan4973 · 2024-05-29T23:03:02Z

Some holistic decompression speed benchmarks to begin this analysis,
resulting in a pretty long list of measurements,
comparing the decompression speed of this PR with dev
on an i7-9700k (~skylake) @3.6GHz :

___PR 4047___                                                                    | ___dev___
compile zstd with gcc-7                                                          │compile zstd with gcc-7
5ed291d2ce0db23f3e21cdce9ad8ab1c  zstd                                           │b5fbfb378ec0cab50bb650dc0d3f17f6  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  710.2 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  723.9 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  870.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  881.7 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  388.0 MB/s  1172.8 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  387.8 MB/s  1130.9 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  287.8 MB/s, 1029.9 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  285.8 MB/s,  995.4 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  212.4 MB/s, 1000.0 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  211.3 MB/s,  977.4 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  151.6 MB/s,  794.2 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  150.7 MB/s,  785.4 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  107.4 MB/s,  967.6 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  107.5 MB/s,  949.0 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   84.3 MB/s,  757.6 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   84.0 MB/s,  750.8 MB/s
compile zstd with gcc-8                                                          │compile zstd with gcc-8
ffdfbe49f85bdaded7d6c9fa9a28a36e  zstd                                           │38acac6630368e177868a8c7062b38b5  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  671.3 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  730.9 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  892.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  886.7 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  381.2 MB/s, 1141.4 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  384.1 MB/s  1117.5 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  281.1 MB/s,  989.9 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  284.9 MB/s,  975.1 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  203.7 MB/s,  983.2 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  210.9 MB/s,  980.2 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  144.7 MB/s,  780.7 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  150.4 MB/s,  796.4 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),   98.6 MB/s,  971.5 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  102.6 MB/s,  961.9 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   79.0 MB/s,  765.6 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   81.1 MB/s,  769.1 MB/s
compile zstd with gcc-9                                                          │compile zstd with gcc-9
fe7174be9085dd2de14f2313f53f38eb  zstd                                           │7552b0103fd0ac632503026e2f350928  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  742.9 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  658.3 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  860.9 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  918.3 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  376.7 MB/s, 1156.8 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  374.6 MB/s, 1172.4 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  277.2 MB/s, 1010.8 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  277.6 MB/s, 1011.6 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  208.4 MB/s,  978.3 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  211.9 MB/s, 1021.1 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  149.3 MB/s,  773.6 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  151.4 MB/s,  814.5 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  115.5 MB/s,  955.4 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  113.8 MB/s, 1003.3 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   90.7 MB/s,  748.7 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   88.9 MB/s,  789.8 MB/s
compile zstd with gcc-10                                                         │compile zstd with gcc-10
e05d07e00b848ce7b2cfbabf915c3a1d  zstd                                           │adce803b576bc6a9b0f695ed8826889a  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  728.1 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  722.7 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  887.3 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  910.8 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  376.6 MB/s, 1121.4 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  375.8 MB/s, 1136.2 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  277.6 MB/s,  981.9 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  278.6 MB/s,  994.9 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  209.9 MB/s,  976.9 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  217.5 MB/s, 1002.5 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  149.7 MB/s,  788.9 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  155.6 MB/s,  821.3 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  113.4 MB/s,  963.3 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  113.9 MB/s,  987.0 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   89.1 MB/s,  768.8 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   88.8 MB/s,  795.8 MB/s
compile zstd with gcc-11                                                         │compile zstd with gcc-11
e4f2395a06eb34943d9842230b1487bd  zstd                                           │d6315f4985d2cedeba8aae3c50001723  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  746.1 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  713.6 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  900.6 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  906.6 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  377.1 MB/s, 1165.0 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  381.6 MB/s  1144.1 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  277.4 MB/s, 1028.4 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  282.8 MB/s,  993.1 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  211.7 MB/s, 1004.8 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  211.9 MB/s, 1001.5 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  152.1 MB/s,  813.7 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  152.7 MB/s,  814.7 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  112.1 MB/s,  994.8 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  112.7 MB/s,  994.2 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   87.6 MB/s,  793.9 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   87.9 MB/s,  798.4 MB/s
compile zstd with clang-6.0                                                      │compile zstd with clang-6.0
1326dd6d47f0cf1038405509e4dcc8a2  zstd                                           │b9ae98f025959f945f496a6c2399fc44  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  766.2 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  742.0 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  912.0 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  908.0 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  388.7 MB/s  1156.9 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  387.4 MB/s  1172.2 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  285.7 MB/s, 1021.7 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  286.2 MB/s, 1025.7 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  210.6 MB/s, 1021.5 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  205.6 MB/s, 1017.3 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  151.5 MB/s,  843.1 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  148.9 MB/s,  819.9 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  119.7 MB/s, 1004.4 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  117.2 MB/s,  993.4 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   93.8 MB/s,  816.8 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   91.2 MB/s,  785.9 MB/s
compile zstd with clang-7                                                        │compile zstd with clang-7
ba63e5b28b30123584877e6e54ec4a30  zstd                                           │83696df6ca4a6135c084ae318ba83f46  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  753.8 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  732.8 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  939.1 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  935.2 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  383.2 MB/s  1196.9 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  375.9 MB/s, 1227.0 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  282.1 MB/s, 1061.3 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  277.6 MB/s, 1075.5 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  215.1 MB/s, 1054.6 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  206.7 MB/s, 1057.7 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  153.3 MB/s,  873.9 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  148.0 MB/s,  845.2 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  117.0 MB/s, 1033.6 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  117.0 MB/s, 1045.3 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   91.0 MB/s,  840.5 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   91.5 MB/s,  829.1 MB/s
compile zstd with clang-8                                                        │compile zstd with clang-8
871bd27ac7ee239cf9eacb8f246a2978  zstd                                           │dd0ce6aa5779a3f6d7fdaedf981ba91c  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  781.8 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  742.0 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  939.7 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  957.0 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  395.4 MB/s  1196.2 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  391.6 MB/s  1215.5 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  291.8 MB/s, 1058.9 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  288.4 MB/s, 1076.2 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  212.8 MB/s, 1056.7 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  207.0 MB/s, 1061.3 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  153.4 MB/s,  872.9 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  149.2 MB/s,  863.1 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  116.7 MB/s, 1021.9 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  120.6 MB/s, 1052.2 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   89.6 MB/s,  823.9 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   93.9 MB/s,  847.0 MB/s
compile zstd with clang-9                                                        │compile zstd with clang-9
c18dfa5ea7e05ea520ace70cb9943911  zstd                                           │351e250243d8a66260a9d863737a8ba2  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  713.5 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  757.6 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  886.0 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  929.9 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  385.8 MB/s  1137.6 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  380.1 MB/s, 1179.1 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  283.1 MB/s,  985.8 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  281.1 MB/s, 1028.8 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  207.2 MB/s,  988.0 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  211.0 MB/s, 1032.1 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  149.3 MB/s,  795.3 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  151.6 MB/s,  836.2 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  117.3 MB/s,  975.7 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  114.7 MB/s, 1015.5 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   92.7 MB/s,  774.3 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   91.4 MB/s,  810.8 MB/s
compile zstd with clang-10                                                       │compile zstd with clang-10
93d31c296b04e79ab3f727ced8e0c316  zstd                                           │31d95bf85a5c4c80762b2d65c114b21e  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  712.9 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  732.4 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  916.4 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  936.1 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  396.3 MB/s  1168.8 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  391.4 MB/s  1193.7 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  291.9 MB/s, 1016.3 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  289.9 MB/s, 1050.0 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  209.7 MB/s, 1022.1 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  210.7 MB/s, 1049.8 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  151.6 MB/s,  824.2 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  152.5 MB/s,  857.4 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  119.7 MB/s, 1003.5 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  116.5 MB/s, 1016.8 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   93.1 MB/s,  798.0 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   90.8 MB/s,  813.3 MB/s
compile zstd with clang-11                                                       │compile zstd with clang-11
bf96e2cd714cae9fc6b803952a44499b  zstd                                           │aab7a34b2c2d36d77b3255b8779cde5c  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  746.3 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  767.6 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  937.4 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  959.0 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  382.5 MB/s  1205.4 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  381.8 MB/s  1240.4 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  283.9 MB/s, 1056.9 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  283.1 MB/s, 1080.7 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  203.7 MB/s, 1055.9 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  205.6 MB/s, 1080.3 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  147.8 MB/s,  859.0 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  148.7 MB/s,  864.2 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  116.2 MB/s, 1033.9 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  119.3 MB/s, 1068.0 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   90.0 MB/s,  828.0 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   92.5 MB/s,  844.6 MB/s
compile zstd with clang-12                                                       │compile zstd with clang-12
211d0fcffd09f0b1203b34b48b14c3bc  zstd                                           │1ceac2fcef50a68677e0f0484df1f84c  zstd
 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  764.5 MB/s │ 3#enwik9.L22.zst    :1000000000 -> 215031773 (x4.650),   0.00 MB/s,  744.2 MB/s
 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  932.1 MB/s │ 3#lesia.tar.L19.zst : 211957760 ->  52990423 (x4.000),   0.00 MB/s,  909.5 MB/s
 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  397.6 MB/s  1192.5 MB/s │ 1#silesia.tar       : 211957760 ->  73422067 (x2.887),  399.4 MB/s  1162.0 MB/s
 1#enwik8            : 100000000 ->  40667563 (x2.459),  291.7 MB/s, 1045.6 MB/s │ 1#enwik8            : 100000000 ->  40667563 (x2.459),  295.5 MB/s, 1003.1 MB/s
 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  217.8 MB/s, 1046.5 MB/s │ 3#silesia.tar       : 211957760 ->  66523984 (x3.186),  207.7 MB/s, 1007.7 MB/s
 3#enwik8            : 100000000 ->  35461800 (x2.820),  157.0 MB/s,  847.7 MB/s │ 3#enwik8            : 100000000 ->  35461800 (x2.820),  148.5 MB/s,  803.4 MB/s
 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  118.3 MB/s, 1026.7 MB/s │ 5#silesia.tar       : 211957760 ->  63040521 (x3.362),  116.0 MB/s,  988.5 MB/s
 5#enwik8            : 100000000 ->  33702880 (x2.967),   92.3 MB/s,  819.5 MB/s │ 5#enwik8            : 100000000 ->  33702880 (x2.967),   90.7 MB/s,  777.6 MB/s

As usual, it is pretty difficult to make sense, due to the sheer quantity of signal.

What's clear is that it's not always positive.
But this is probably due to reasons outside of the responsibility of this PR,
with typically random instruction alignment differences resulting in measurable speed differences.

So let's summarize:

compilers	PR decompression speed impact
`gcc-9`, `clang-9`, `clang-10`, `clang-11`	worse, all levels
`gcc-7`, `gcc-8`, `gcc-10`	worse at level 22, better at lower levels
`clang-6`, `clang-7`, `clang-8`	better at level 22, worse at lower levels
`gcc-11`, `clang-12`	better, all levels

There is an interesting inversion between gcc and clang when it comes to the impact at level 22 vs other levels.
Even then, I'm not sure that it's really related to this PR : level 22 uses a different decompression function, due to the potential of long distances matches, triggering prefetching, which is absent from lower levels. Since it's a different function, instruction alignment are different, and may explain the performance differences. It's just weird that, for each compiler, the direction is always the same, across multiple versions.

So sure, some compiler versions are clearly better, but others aren't, hence it's not a clear win.

The best argument in favor of this PR so far is the godbolt trace, which shows a neat reduction in assembly instruction count for BIT_readBits(). Sure, instruction count is not everything, but in this case, there is no hidden branch, additional fetch nor long instruction, so the reduction in nb of instructions is expected to be beneficial for performance.

Cyan4973 · 2024-05-30T01:12:20Z

By the way, I also noticed that,
while the new formulation of BIT_readBits() is more concise with BMI2,
it's not the case when BMI2 is not available:

// new formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int):      # @BIT_readBits(BIT_DStream_t*, unsigned int)
        movl    8(%rdi), %ecx
        subl    %esi, %ecx
        shrxq   %rcx, (%rdi), %rax
        bzhiq   %rsi, %rax, %rax
        movl    %ecx, 8(%rdi)
        retq

// old formulation, with BMI2
BIT_readBits(BIT_DStream_t*, unsigned int):      # @BIT_readBits(BIT_DStream_t*, unsigned int)
        movl    8(%rdi), %ecx
        addl    %esi, %ecx
        movl    %ecx, %eax
        negb    %al
        shrxq   %rax, (%rdi), %rax
        bzhiq   %rsi, %rax, %rax
        movl    %ecx, 8(%rdi)
        retq

// Old formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int):      # @BIT_readBits(BIT_DStream_t*, unsigned int)
        movq    (%rdi), %rdx
        movl    8(%rdi), %r8d
        addl    %esi, %r8d
        movl    %r8d, %ecx
        negb    %cl
        shrq    %cl, %rdx
        movq    $-1, %rax
        movl    %esi, %ecx
        shlq    %cl, %rax
        movl    %r8d, 8(%rdi)
        notq    %rax
        andq    %rdx, %rax
        retq

BIT_readBitsFast(BIT_DStream_t*, unsigned int):  # @BIT_readBitsFast(BIT_DStream_t*, unsigned int)
        movq    (%rdi), %rax
        movl    8(%rdi), %edx
        movl    %edx, %ecx
        shlq    %cl, %rax
        movl    %esi, %ecx
        negb    %cl
        shrq    %cl, %rax
        addl    %esi, %edx
        movl    %edx, 8(%rdi)
        retq

// New formulation, no BMI support
BIT_readBits(BIT_DStream_t*, unsigned int):      # @BIT_readBits(BIT_DStream_t*, unsigned int)
        movq    (%rdi), %r8
        movl    8(%rdi), %edx
        subl    %esi, %edx
        movl    %edx, %ecx
        shrq    %cl, %r8
        movq    $-1, %rax
        movl    %esi, %ecx
        shlq    %cl, %rax
        notq    %rax
        andq    %r8, %rax
        movl    %edx, 8(%rdi)
        retq

It's about on par with the old formulation of BIT_readBits(), but heavier than BIT_readBitsFast().
I guess that's the scenario for which BIT_readBitsFast() matters,
and I would expect the old formulation to win in this case when BIT_readBitsFast() is employed.

Anyway, the question is: what happens when BMI2 is not available?
Does the code revert back to the old formulation? Or only the new formulation remains available?
I guess the topic might be similar for non-x64 platforms, without equivalent of BMI2 available.

edit : As a verification test, I compared dev and PR4047 with -mno-bmi2 and DYNAMIC_BMI2 disabled, on the same i7-9700k platform, and indeed, the comparison is largely favorable to the old formulation in this case.

yoniko · 2024-05-30T04:54:34Z

I've looked into this as well, and I think something else that's worth discussing is the usage pattern of bitstreams and reloads. Specifically (for example in FSE / Huffman) we'd make ~4 reads from the stream and them reload it (when decoding we sometimes make less reads per reload).

Here (godbolt) we can see an example that reads 4 elements from the bitstream and reloads it, and the base version is significantly shorter mainly due to the reload. Interestingly the reading operation actually takes the same number of opcodes in both versions as the base version can add two registers into a third one using lea and the pr versions requires both mov and sub.

Eventually, it's not clear to me that one is better than the other, and my guess would be that it really depends on context and pattern of usage.

Replace bitsConsumed for bitsLeft to optimize BitStream

f31a0eb

This change improves performance for x86 with BMI2 because before we were doing subtraction from 64 to shift but if we move it to bitsLeft, we are going to subtract only once https://gcc.godbolt.org/z/dMY3j6rEa

facebook-github-bot added the CLA Signed label May 14, 2024

Cyan4973 self-assigned this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace bitsConsumed with bitsLeft to optimize BitStream #4047

Replace bitsConsumed with bitsLeft to optimize BitStream #4047

danlark1 commented May 14, 2024 •

edited

danlark1 commented May 28, 2024

Cyan4973 commented May 29, 2024 •

edited

Cyan4973 commented May 30, 2024 •

edited

yoniko commented May 30, 2024

Replace bitsConsumed with bitsLeft to optimize BitStream #4047

Are you sure you want to change the base?

Replace bitsConsumed with bitsLeft to optimize BitStream #4047

Conversation

danlark1 commented May 14, 2024 • edited

danlark1 commented May 28, 2024

Cyan4973 commented May 29, 2024 • edited

Cyan4973 commented May 30, 2024 • edited

yoniko commented May 30, 2024

danlark1 commented May 14, 2024 •

edited

Cyan4973 commented May 29, 2024 •

edited

Cyan4973 commented May 30, 2024 •

edited