When is criterion stable between runs, how can it be made more so? #115

rrnewton · 2016-08-16T14:12:54Z

Criterion gives highly precise measurements. Given two measurements of a simple microbenchmark A, A1,A2, if:

both were taking starting from a similar machine state,
both report very high R^2 values, and
both were run for a long -L20 or higher,

then A1/A2 should be close estimates, right? No, unfortunately.

When measuring 1279 benchmarks on Stackage, we have found that it's very common to have greater than 10% variation between consecutive runs of the same, small, deterministic benchmark.

Anecdotally, we seem to get more stable numbers from individual high --iters runs, than from linear regression. I don't have a good explanation yet. Perhaps the non-determinism in the selection of data points (on the X axis) is having more of an effect than we expected? Certainly, when there is a bad R^2, we've seen exactly where it starts running has a big effect.

@RyanGlScott and @vollmerm have been working on this.

(On a related note, it would be great to have some assistance when using criterion with the kinds of things Krun controls, like waiting for the machine to cool down to a baseline temperature before starting a run.)

The text was updated successfully, but these errors were encountered:

pkkm · 2018-04-16T22:42:24Z

we seem to get more stable numbers from individual high --iters runs, than from linear regression.

This matches my experience. I've done a simple experiment as an example: 60 benchmarks of bash -c "a=0; for i in {1..500000}; do (( a += RANDOM )); done" with the bench tool (which uses criterion under the hood). The code is on gist. Results from a computer without CPU frequency scaling, nonessential daemons or a desktop environment running:

	Interquartile range / Median	Range / Median
Least-squares slope	0.3%	1.7%
Theil-Sen slope	0.3%	1.2%
Mean	0.2%	0.9%
Median of means	0.2%	0.9%
Minimum of means	0.1%	0.5%
Quartile 1 of means	0.1%	0.5%
Quartile 3 of means	0.3%	1.0%

(Note that the minimum, median and quartiles are not of individual runs, but of the mean loop iteration times. Getting the true quartiles is currently impossible — #165 tracks this.)

I remember the relative reliability of various statistics in other, less contrived benchmarks being similar to this. So while R² can be useful for checking whether there are anomalies, the slope from linear regression seems useless since the mean provides the same information but with much less variation between benchmarks. Am I missing something?

RyanGlScott added the Question label Jun 23, 2017

ethercrow mentioned this issue Aug 20, 2017

Use SmallArray# primops haskell-unordered-containers/unordered-containers#165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When is criterion stable between runs, how can it be made more so? #115

When is criterion stable between runs, how can it be made more so? #115

rrnewton commented Aug 16, 2016

pkkm commented Apr 16, 2018

When is criterion stable between runs, how can it be made more so? #115

When is criterion stable between runs, how can it be made more so? #115

Comments

rrnewton commented Aug 16, 2016

pkkm commented Apr 16, 2018