Closed Bug 879903 Opened 12 years ago Closed 12 years ago

Edge effects cause spooky action at a distance, messing up Talos regression blame

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mbrubeck, Assigned: mbrubeck)

References

Details

(Whiteboard: [SfN][regression-detection][mentor=mbrubeck][lang=python])

Attachments

(1 file, 1 obsolete file)

use a weighted average based on a linear envelope for t-tests 12 years ago Matt Brubeck (:mbrubeck) 199.46 KB, patch		Details \| Diff \| Splinter Review
patch v2 12 years ago Matt Brubeck (:mbrubeck) 200.25 KB, patch	catlee : review+	Details \| Diff \| Splinter Review

Matt Brubeck (:mbrubeck)

Assignee

Description

•

12 years ago

In this dataset, it's clear that 627af04199a5 caused a regression. But the regression finder blamed the previous push, 8dc5770ad79d: http://graphs.mozilla.org/graph.html#tests=[[29,63,8]]&sel=1370399938255.4023,1370451655771.7488&displayrange=7&datatype=running This happened because MaxHeap is bimodal, with occasional outliers significantly below the general trend. In this time range there were two outliers: one was 7 pushes after the regression, and the other 12 pushes after the regression. Because that second outlier was just at the edge of the 12-push window (bug 858877), it was included in the stats for the actually regression changeset, but not for the preceding push. This made the actual push look like a less likely candidate, and so the preceding push got blamed instead. We could mostly avoid edge effects like these by using a smooth window function to weight the items in the backward and forward statistics for each datapoint. Basically, give heavier weights to points near the regression, with a gradual falloff for more distant points. I think this would have other beneficial effects too, though I'm not sure how the math would work out for performing the t-test or interpreting the score.

Matt Brubeck (:mbrubeck)

Assignee

Updated

•

12 years ago

Whiteboard: [SfN] → [SfN][regression-detection]

Matt Brubeck (:mbrubeck)

Assignee

Comment 1

•

12 years ago

I probably won't have time to work on this soon. If anyone else is interested, I'm happy to give some pointers. The code is here: http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze.py

Whiteboard: [SfN][regression-detection] → [SfN][regression-detection][mentor=mbrubeck][lang=python]

Avi Halachmi (:avih)

Comment 2

•

12 years ago

Absolutely detecting a change within a noisy sequence is hard. Fine tuning the algorithm to get better detection over an identical data-set could go on forever (and could indeed improve statistically) but can never be 100% bullet proof - a perfectly placed within-normal-range noisy data point has a great chance to defeat any absolute detection. Two suggestions: 1. If detection certainty doesn't go above some threshold while pointing at a specific changeset as the cause, then increase the range at which the algorithm points. E.g. if the sequence is 10,12,10,11,10,12*,13*,12,13,12, then the cause is highly likely to be one of the '*' points, but pointing at one of them has a great chance to be wrong. So we won't choose one since we really can't. But pointing at those two together would drastically increase the chance to have a correct range, even if bigger than we'd hope for. It's still better IMO than pointing at the incorrect set. 2. Gather more data: i.e. re-trigger the talos/whatever tests around detection points with low certainly. This depends on how consistent the values would be from more runs on identical builds. If the values don't really change, then not much a point here, but if another run or two would increase the detection certainty to above our threshold, then it's a win.

Matt Brubeck (:mbrubeck)

Assignee

Comment 3

•

12 years ago

(In reply to Avi Halachmi (:avih) from comment #2) > Absolutely detecting a change within a noisy sequence is hard. Fine tuning > the algorithm to get better detection over an identical data-set could go on > forever (and could indeed improve statistically) but can never be 100% > bullet proof Well, of course this sort of statistical analysis will never have 100% certainty, but I think I have a good track record at this point of making specific improvements to the algorithm that result in noticeably higher precision and recall. I do test all my changes against real-world data to make sure they result in net improvements. This bug is about a flaw in the algorithm that prevents it from working in specific cases that occur in real-world data, sort of like the flaw I fixed in bug 858735. The latest real-world case is: https://groups.google.com/d/topic/mozilla.dev.tree-management/BwXpmvAQzng/discussion The flaw is in the current window function, which is a rectangular step function. This function is discontinuous in two places: The point being analyzed, and another place 12 points away. This means that it "cares" about changes in both of these locations. For example, when there are two real changes approximately 12 pushes apart, there is about a 50% chance that the algorithm will produce a false positive (emailing an innocent patch author) and/or a false negative (completely missing a real regression). What we actually need is a window function that is discontinuous at the point being analyzed, and continuous elsewhere. Using a window with continuous fall-off is a simple way of making the analysis more accurate and robust. Your two suggestions in comment 2 are good, and we sort of do both of them manually using self-service tools and human inspection. But they are orthogonal to fixing this flaw. Fixing this bug will improve our results whether or not we implement your suggestions too.

Matt Brubeck (:mbrubeck)

Assignee

Comment 4

•

12 years ago

Attached patch use a weighted average based on a linear envelope for t-tests (obsolete) — Details — Splinter Review

Here's a simple implementation of the weighted average idea. It fixes the failures mentioned in comment 0 and comment 3, and data from the latter to the regression test suite. The results for the existing test data are unchanged except for one possible regression in a11y.json which was previously barely above the detection threshold and is now barely under. It looks like that change was in fact a transient fluctuation [1], so this may actually be an improvement. I'm going to take a closer look to double-check before finishing this patch, though. [1]: https://groups.google.com/d/msg/mozilla.dev.tree-management/57IAPzOoA2o/fCJwOb56UDkJ

Assignee: nobody → mbrubeck

Status: NEW → ASSIGNED

Matt Brubeck (:mbrubeck)

Assignee

Comment 5

•

12 years ago

Attached patch patch v2 — Details — Splinter Review

I've seen several more cases of this bug; the latest is: https://groups.google.com/d/topic/mozilla.dev.tree-management/SjmYKuKu20Y/discussion So I think this is a big enough problem that it's worth fixing. Here's a cleaned up version of the code from the previous patch.

Attachment #773649 - Attachment is obsolete: true

Attachment #781140 - Flags: review?(catlee)

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Chris AtLee [:catlee]

Comment 6

•

12 years ago

Comment on attachment 781140 [details] [diff] [review] patch v2 Review of attachment 781140 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! Just a nit about parameter names. ::: server/analysis/analyze.py @@ +8,5 @@ > + > + `window` is a function that takes a list index and a window width, and > + returns a weight that is used to calculate a weighted average. For example, > + see `default_window` or `linear_window` below. If no function is passed, > + `default_window` is used and the average will be uniformly weighted. I think window as a parameter name is a bit confusing. Perhaps something like 'weight_func' would be better?

Attachment #781140 - Flags: review?(catlee) → review+

Matt Brubeck (:mbrubeck)

Assignee

Comment 7

•

12 years ago

(In reply to Chris AtLee [:catlee] from comment #6) > I think window as a parameter name is a bit confusing. Perhaps something > like 'weight_func' would be better? Changed the name to weight_fn: http://hg.mozilla.org/graphs/rev/f273330aa8da Thanks for the reviews! Next I'll start work on a blog post about this and other analysis improvements.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Edge effects cause spooky action at a distance, messing up Talos regression blame

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: mbrubeck, Assigned: mbrubeck)

References

Details

(Whiteboard: [SfN][regression-detection][mentor=mbrubeck][lang=python])

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Updated

Attachment

General

Description

File Name

Content Type