Closed Bug 879903 Opened 11 years ago Closed 11 years ago

Edge effects cause spooky action at a distance, messing up Talos regression blame

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mbrubeck, Assigned: mbrubeck)

References

Details

(Whiteboard: [SfN][regression-detection][mentor=mbrubeck][lang=python])

Attachments

(1 file, 1 obsolete file)

In this dataset, it's clear that 627af04199a5 caused a regression.  But the regression finder blamed the previous push, 8dc5770ad79d:

http://graphs.mozilla.org/graph.html#tests=[[29,63,8]]&sel=1370399938255.4023,1370451655771.7488&displayrange=7&datatype=running

This happened because MaxHeap is bimodal, with occasional outliers significantly below the general trend.  In this time range there were two outliers: one was 7 pushes after the regression, and the other 12 pushes after the regression.

Because that second outlier was just at the edge of the 12-push window (bug 858877), it was included in the stats for the actually regression changeset, but not for the preceding push.  This made the actual push look like a less likely candidate, and so the preceding push got blamed instead.

We could mostly avoid edge effects like these by using a smooth window function to weight the items in the backward and forward statistics for each datapoint.  Basically, give heavier weights to points near the regression, with a gradual falloff for more distant points.  I think this would have other beneficial effects too, though I'm not sure how the math would work out for performing the t-test or interpreting the score.
Whiteboard: [SfN] → [SfN][regression-detection]
I probably won't have time to work on this soon.  If anyone else is interested, I'm happy to give some pointers.  The code is here:

http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze.py
Whiteboard: [SfN][regression-detection] → [SfN][regression-detection][mentor=mbrubeck][lang=python]
Absolutely detecting a change within a noisy sequence is hard. Fine tuning the algorithm to get better detection over an identical data-set could go on forever (and could indeed improve statistically) but can never be 100% bullet proof - a perfectly placed within-normal-range noisy data point has a great chance to defeat any absolute detection.


Two suggestions:

1. If detection certainty doesn't go above some threshold while pointing at a specific changeset as the cause, then increase the range at which the algorithm points.

E.g. if the sequence is 10,12,10,11,10,12*,13*,12,13,12, then the cause is highly likely to be one of the '*' points, but pointing at one of them has a great chance to be wrong. So we won't choose one since we really can't. But pointing at those two together would drastically increase the chance to have a correct range, even if bigger than we'd hope for. It's still better IMO than pointing at the incorrect set.

2. Gather more data: i.e. re-trigger the talos/whatever tests around detection points with low certainly. This depends on how consistent the values would be from more runs on identical builds. If the values don't really change, then not much a point here, but if another run or two would increase the detection certainty to above our threshold, then it's a win.
(In reply to Avi Halachmi (:avih) from comment #2)
> Absolutely detecting a change within a noisy sequence is hard. Fine tuning
> the algorithm to get better detection over an identical data-set could go on
> forever (and could indeed improve statistically) but can never be 100%
> bullet proof

Well, of course this sort of statistical analysis will never have 100% certainty, but I think I have a good track record at this point of making specific improvements to the algorithm that result in noticeably higher precision and recall.  I do test all my changes against real-world data to make sure they result in net improvements.

This bug is about a flaw in the algorithm that prevents it from working in specific cases that occur in real-world data, sort of like the flaw I fixed in bug 858735.  The latest real-world case is:
https://groups.google.com/d/topic/mozilla.dev.tree-management/BwXpmvAQzng/discussion

The flaw is in the current window function, which is a rectangular step function.  This function is discontinuous in two places: The point being analyzed, and another place 12 points away.  This means that it "cares" about changes in both of these locations.  For example, when there are two real changes approximately 12 pushes apart, there is about a 50% chance that the algorithm will produce a false positive (emailing an innocent patch author) and/or a false negative (completely missing a real regression).

What we actually need is a window function that is discontinuous at the point being analyzed, and continuous elsewhere.  Using a window with continuous fall-off is a simple way of making the analysis more accurate and robust.

Your two suggestions in comment 2 are good, and we sort of do both of them manually using self-service tools and human inspection.  But they are orthogonal to fixing this flaw.  Fixing this bug will improve our results whether or not we implement your suggestions too.
Here's a simple implementation of the weighted average idea.  It fixes the failures mentioned in comment 0 and comment 3, and data from the latter to the regression test suite.

The results for the existing test data are unchanged except for one possible regression in a11y.json which was previously barely above the detection threshold and is now barely under.  It looks like that change was in fact a transient fluctuation [1], so this may actually be an improvement.  I'm going to take a closer look to double-check before finishing this patch, though.

[1]: https://groups.google.com/d/msg/mozilla.dev.tree-management/57IAPzOoA2o/fCJwOb56UDkJ
Assignee: nobody → mbrubeck
Status: NEW → ASSIGNED
Attached patch patch v2Splinter Review
I've seen several more cases of this bug; the latest is:
https://groups.google.com/d/topic/mozilla.dev.tree-management/SjmYKuKu20Y/discussion

So I think this is a big enough problem that it's worth fixing.  Here's a cleaned up version of the code from the previous patch.
Attachment #773649 - Attachment is obsolete: true
Attachment #781140 - Flags: review?(catlee)
Product: mozilla.org → Release Engineering
Comment on attachment 781140 [details] [diff] [review]
patch v2

Review of attachment 781140 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Just a nit about parameter names.

::: server/analysis/analyze.py
@@ +8,5 @@
> +
> +    `window` is a function that takes a list index and a window width, and
> +    returns a weight that is used to calculate a weighted average.  For example,
> +    see `default_window` or `linear_window` below.  If no function is passed,
> +    `default_window` is used and the average will be uniformly weighted.

I think window as a parameter name is a bit confusing. Perhaps something like 'weight_func' would be better?
Attachment #781140 - Flags: review?(catlee) → review+
(In reply to Chris AtLee [:catlee] from comment #6)
> I think window as a parameter name is a bit confusing. Perhaps something
> like 'weight_func' would be better?

Changed the name to weight_fn:
http://hg.mozilla.org/graphs/rev/f273330aa8da

Thanks for the reviews!  Next I'll start work on a blog post about this and other analysis improvements.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.