Closed Bug 1196419 Opened 9 years ago Closed 8 years ago

Figure out how to detect/prevent differences in data that aren't due to a change in the product

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: wlach, Unassigned)

References

Details

Attachments

(2 files)

We spent quite a bit of time chasing ghosts in bug 1190877 because the numbers produced by talos were different depending on what time the tests were run. It is currently unclear whether this was due to a machine configuration change or something else.

The individual replicates for a test run should form a "signature" and we should be able to compare them between runs. In theory, I feel like we should be able to detect when there are large variations in data produced for the same revision, and drop runs which seem to follow a different pattern from what we'd expect.

I wrote up a simple script to compare the replicates for tresize against each other, and try to figure out which ones were outside the norm by using a t-test. Unfortunately the results were pretty uninspiring, even with a very large number of retriggers. We found only one result which seemed off (pvalue of 0.44 or something). Eliminating it did seem to bring the two sets of results closer together, but I think that's mostly by coincidence:

http://nbviewer.ipython.org/url/people.mozilla.org/~wlachance/tresize%20original%20comparison.ipynb

I suspect there is something else we could be doing here, but I'm not sure what it is. Filing my work to date and CCing a few people, hopefully we can figure something out in the next couple weeks.
Attached image 2015-08-23 00-12-06.png
perfherder
Attached image 2015-08-23 01-34-40.png
Test performance did regress around the given time, and did so by adding a second (slower) mode; it became bi-modal.  This bi-modal behavior invalidates the assumptions of the t-test; which makes the t-test prone to blaming the wrong push.  The t-test will often consider samples in the new mode as outliers until an accidental proportion of the samples land in that new mode.  

Looking at the chart attached:  We are charting the -log(t-test) to give us a feel for how the t-test behaves as the new (slower) mode becomes more prevalent.  The double-peak on the right starts with 5e130ad70aa7 (the one that was blamed), but it is an illusion.  

First, as a minor point, this confidence only exists because the same push was tested multiple times in rapid succession; this gave the test results consistency and weight that the t-test detects as a regression because it assumes samples are not related.  This can be resolved by picking only one statistic per push:  Combine the samples from multiple test runs on the same push  to make a single statistic.  When this is done the regression at 5e130ad70aa7 disappears.   But it does not matter where the t-test detected the regression in this case, because it will often be wrong anyway.

Second, and most important, we see that there are test results from the new (slower) mode much earlier than 5e130ad70aa7;  as early as July 27th (ad59ccd84735).  Due to the nature of bi-modal results, the regression could have been even earlier.

The solution to this particular problem is to use a mixture of normals to characterize the bi-modal behavior, and account for the fact that rare modes are not outliers.  It can be used to know how many pushes you need to go back to find the blame.

Assuming the code is at fault.
It would be nice if all our tests were annotated with the revision of the test harness used to perform the run.  Along with the push date, we can slice the test results by harness revision over time, and get a clearer picture of regressions caused by the harness.
Roberto might also have good ideas, adding him as CC.

> and drop runs which seem to follow a different pattern from what we'd expect.

"what we'd expect" is not trivial IMO. I think it's also important here to not assume that the earlier results are good and the later ones are broken (though I'm not saying someone did assume that).

So I think it means few things:

1. Check if the results cluster, e.g. bi-modal or other clustering (earlier/later, etc). Either between revisions, or between runs of the same revision, or even between cycles of a single run*.

2. Raise some flag once it's detected.

3. Understand where it comes from. While test setup/environment/hardware/phase-of-the-moon might have an effect, other possibilities are that the test itself behaves inconsistently under some conditions, or that new code causes clustered behavior.


* There's already at least one case we know and already act upon: between the cycles of the same run, the first is typically worse, hence most of our filters are of the type remove first N, then use the rest. This is not an outlier in general since it behaves pretty consistently, but we still decide to ignore it to have more stable data.


Off topic - I really like that data correctness is being perused actively.
Blocks: 1194333
I used the old dzAlerts ETL function to add all Talos perf results to ActiveData.

http://activedata.allizom.org/tools/query.html#query_id=AJLTEpYi
Summary: Figure out how to detect/prevent differences in data → Figure out how to detect/prevent differences in data that aren't due to a change in the product
there is not really anything actionable here.  Much experimentation has been done on this topic and using kmeans, and other methods work for some cases, but not for all.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: