Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve calculation of when to use wide residual computation #700

Merged
merged 1 commit into from
May 15, 2024

Conversation

ktmf01
Copy link
Collaborator

@ktmf01 ktmf01 commented May 14, 2024

This change should make 24-bit encoding faster, because the limit_residual variant of residual computation is used less often.

@ktmf01
Copy link
Collaborator Author

ktmf01 commented May 15, 2024

Results for Intel Xeon E-2224G. Input data is this set of samples concatenated into a single file for the 16-bit input. The 24-bit input is created with sox, upsampling to 96000Hz in 24-bit. The upsampling process 'fills' the 8 extra bits.

Command Mean [s] Min [s] Max [s] Relative
./mulbits -5 -j2 -c ../Rarewares-16bit.flac 1.210 ± 0.018 1.189 1.242 1.00
./current -5 -j2 -c ../Rarewares-16bit.flac 1.214 ± 0.025 1.189 1.253 1.00 ± 0.03
./mulbits -8 -j2 -c ../Rarewares-16bit.flac 1.988 ± 0.055 1.933 2.089 1.64 ± 0.05
./current -8 -j2 -c ../Rarewares-16bit.flac 2.008 ± 0.047 1.948 2.068 1.66 ± 0.05
./mulbits -8p -j2 -c ../Rarewares-16bit.flac 6.319 ± 0.198 6.111 6.786 5.22 ± 0.18
./current -8p -j2 -c ../Rarewares-16bit.flac 6.306 ± 0.199 6.171 6.851 5.21 ± 0.18
./mulbits -5 -j2 -c ../Rarewares-24bit.flac 2.638 ± 0.062 2.567 2.772 2.18 ± 0.06
./current -5 -j2 -c ../Rarewares-24bit.flac 2.917 ± 0.069 2.828 3.040 2.41 ± 0.07
./mulbits -8 -j2 -c ../Rarewares-24bit.flac 10.796 ± 0.199 10.622 11.180 8.92 ± 0.21
./current -8 -j2 -c ../Rarewares-24bit.flac 10.860 ± 0.135 10.728 11.171 8.98 ± 0.18
./mulbits -8p -j2 -c ../Rarewares-24bit.flac 76.075 ± 1.140 75.110 78.038 62.88 ± 1.34
./current -8p -j2 -c ../Rarewares-24bit.flac 83.079 ± 1.171 81.800 85.769 68.67 ± 1.43

Largest difference is for preset 5 with 24-bit input: 11% faster. For 8p the difference is also large, 9% faster.

I'm somewhat baffled the difference is so large for preset 5, I am not sure why that is. A few weeks ago I was under the impression the limit residual functions were only seldomly used, and only at higher presets (because of the higher orders). So I reran this benchmark with preset -5p, and indeed the difference is huge

Command Mean [s] Min [s] Max [s] Relative
./mulbits -5p -j2 -c ../Rarewares-24bit.flac 5.941 ± 0.060 5.883 6.050 1.00
./current -5p -j2 -c ../Rarewares-24bit.flac 8.780 ± 0.036 8.737 8.836 1.48 ± 0.02

@ktmf01
Copy link
Collaborator Author

ktmf01 commented May 15, 2024

Results of running gprof against the -5p preset mentioned above

Without this patch (running over the input thrice)

 time   seconds   seconds    calls   s/call   s/call  name
 71.02     36.51    36.51  3122034     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 [...]
  0.56     49.75     0.29   218268     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
[...]
  0.21     50.87     0.11    25962     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2

With this patch

 time   seconds   seconds    calls   s/call   s/call  name
 42.65     14.50    14.50  1274172     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 11.15     18.29     3.79  1376313     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
[...]
  3.03     27.57     1.03   715779     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2

So, without the patch, the wide_intrin_avx2 function variant is only called 1% of the time, but with the patch, it is called more than half of the time. The number of times the non-wide variant is called is more than tripled.

@ktmf01 ktmf01 merged commit 1ab3c8e into xiph:master May 15, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant