Open Bug 1269489 Opened 4 years ago Updated 2 years ago

Automatically track standard deviation in performance tests over time

Categories

(Tree Management :: Perfherder, defect, P5)

defect

Tracking

(Not tracked)

People

(Reporter: wlach, Unassigned)

Details

It might be interesting to automatically track the standard deviation in performance tests over time with Perfherder. Might it be possible to automatically detect when a test or platform has gotten noisier? Could we then generate "noise" alerts?
the current method for doing this is to:
* push to try and collect 12 data points for all talos tests
* push to try again (same revision) and collect 12 more data points for all talos tests

compare both try pushes against themselves:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e93ebef962dd&newProject=try&newRevision=0ad2d3cc82086352841c758010ff08f0e73714fa&framework=1&showOnlyImportant=0

I found that summarizing the abs(deltaPercentages) and the stddev yields a decent metric:
	        Delta %	avg delta %	stddev	avg stddev
linux64	        20.02	0.4	        71.59	1.43

^ as seen in https://bugzilla.mozilla.org/show_bug.cgi?id=1253341#c98

I would imagine we should be tracking one of the above numbers, maybe all of them as subtests and use a geometric mean to generate a noise summary.


The challenge I see here is that we would need to collect this data on a regular basis automatically.  Possibly we could push to try with a single push and collect 24 data points and use the first 12 vs the last 12?  If that is the case we could then do this whenever we see 24+ data points on the same revision- and we could have a nightly job which does this automatically.
another thought here is that we could track the stddev for each signature when we have enough data and summarize it as the noise level for a platform, so the summary is the geomean of all stddev (or just the sum) and each of the subtests would be the summary signatures stddev.

This way we don't have to compare against a previous push or set of data.  The question then becomes, how do we get enough data for a single revision to help benchmark the noise.
I wouldn't run an entirely new set of jobs to get this data-- we should be able to derive a result from what we're currently running. Disambiguating noise from changes that are actually regressions/improvements might be a little tricky, but not impossible I don't think.
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.