Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When is criterion stable between runs, how can it be made more so? #115

Open
rrnewton opened this issue Aug 16, 2016 · 1 comment
Open
Labels

Comments

@rrnewton
Copy link
Member

Criterion gives highly precise measurements. Given two measurements of a simple microbenchmark A, A1,A2, if:

  • both were taking starting from a similar machine state,
  • both report very high R^2 values, and
  • both were run for a long -L20 or higher,

then A1/A2 should be close estimates, right? No, unfortunately.

When measuring 1279 benchmarks on Stackage, we have found that it's very common to have greater than 10% variation between consecutive runs of the same, small, deterministic benchmark.

Anecdotally, we seem to get more stable numbers from individual high --iters runs, than from linear regression. I don't have a good explanation yet. Perhaps the non-determinism in the selection of data points (on the X axis) is having more of an effect than we expected? Certainly, when there is a bad R^2, we've seen exactly where it starts running has a big effect.

@RyanGlScott and @vollmerm have been working on this.

(On a related note, it would be great to have some assistance when using criterion with the kinds of things Krun controls, like waiting for the machine to cool down to a baseline temperature before starting a run.)

@pkkm
Copy link

pkkm commented Apr 16, 2018

we seem to get more stable numbers from individual high --iters runs, than from linear regression.

This matches my experience. I've done a simple experiment as an example: 60 benchmarks of bash -c "a=0; for i in {1..500000}; do (( a += RANDOM )); done" with the bench tool (which uses criterion under the hood). The code is on gist. Results from a computer without CPU frequency scaling, nonessential daemons or a desktop environment running:

boxplot

  Interquartile range / Median Range / Median
Least-squares slope 0.3% 1.7%
Theil-Sen slope 0.3% 1.2%
Mean 0.2% 0.9%
Median of means 0.2% 0.9%
Minimum of means 0.1% 0.5%
Quartile 1 of means 0.1% 0.5%
Quartile 3 of means 0.3% 1.0%

(Note that the minimum, median and quartiles are not of individual runs, but of the mean loop iteration times. Getting the true quartiles is currently impossible — #165 tracks this.)

I remember the relative reliability of various statistics in other, less contrived benchmarks being similar to this. So while R² can be useful for checking whether there are anomalies, the slope from linear regression seems useless since the mean provides the same information but with much less variation between benchmarks. Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants