Closed Bug 858877 Opened 11 years ago Closed 11 years ago

Tune the values of fore_window, back_window, and threshold in analysis.cfg

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mbrubeck, Assigned: mbrubeck)

References

Details

(Whiteboard: [regression-detection])

Attachments

(3 files, 2 obsolete files)

Attached file test script (obsolete) —
One cause of false positives in analyze_talos.py seems to be the specific window sizes that are used.  The current defaults are:

back_window = 30
fore_window = 5
threshold = 9

Having back_window much larger than fore_window creates a lot of false positives.  Conceptually this makes sense, since the window can slide several points past the actual culprit and still appear to have a good fit for a regression.  A few "bad" points in the back_window section are easily overwhelmed by the large number of "good" points.

In the example datasets I've played with, a window of (30, 5) almost always produces multiple false positives, while a window of (10, 10) or (12, 12) produces a single true positive and no false positives.

For reference, here's a script (attached) that I'm using to experiment with different window sizes.  Save it alongside analyze.py from http://hg.mozilla.org/graphs/ and use it as follows:

wget -O runs.json http://graphs.mozilla.org/api/test/runs?id=232&branchid=63&platformid=14

test.py 30 5 runs.json
Attached patch patchSplinter Review
I downloaded a handful of example datasets from recent real-life regressions in various different Talos benchmarks, and analyzed them with a range of settings.

With the patches from bug 858735 and bug 858756 applied, I get perfect results on my example datasets (zero false positives or false negatives) using any window sizes between (10, 10) and (17, 17) inclusive.  I think that a good default setting is about (12, 12) since a smaller fore_window will let us detect regressions slightly earlier.

I'm working on testing this with more sample data, and I'll let you know if I find any more interesting results.  I'd also like to add some automated regression tests that use real-world data.
Attachment #734183 - Attachment is obsolete: true
Attachment #734199 - Flags: review?(catlee)
Attached patch real-world test data (obsolete) — Splinter Review
Here are some regression tests based on some of the datasets I've been using.  All four of these datasets had incorrect changesets blamed without these patches, and now have the correct blame as far as I can tell.
Attachment #734203 - Flags: review?(catlee)
Update the test config and add some named constants for clarity.
Attachment #734203 - Attachment is obsolete: true
Attachment #734203 - Flags: review?(catlee)
Attachment #734697 - Flags: review?(catlee)
The reason for the larger back_window and smaller fore_window was to be able to have a reasonable representation of the "correct" mean/variance while at the same time being able to detect regressions quickly. Increasing the fore_window in particular will make us slower to detect regressions. Is this ok?
(In reply to Chris AtLee [:catlee] from comment #4)
> The reason for the larger back_window and smaller fore_window was to be able
> to have a reasonable representation of the "correct" mean/variance while at
> the same time being able to detect regressions quickly.

Unfortunately, a fore_window of 5 just seems too short to avoid false alarms from random noise, regardless of the back_window size.  In my test datasets, any fore_window setting below 10 produced frequent incorrect results.  So I think this trade-off is unfortunate but necessary.

> Increasing the fore_window in particular will make us slower to detect
> regressions. Is this ok?

Yes, I think the improved accuracy is well worth the added delay for consumers of these alerts.  Also, we are now pushing changes more than twice as fast as when this code was written, so the actual latency might not be any worse than it was then.  :)
Comment on attachment 734199 [details] [diff] [review]
patch

Review of attachment 734199 [details] [diff] [review]:
-----------------------------------------------------------------

alright, let's give it a shot!
Attachment #734199 - Flags: review?(catlee) → review+
Comment on attachment 734199 [details] [diff] [review]
patch

http://hg.mozilla.org/graphs/rev/c9dc630d0749

This changes only the example analysis.cfg.template.  To deploy this, someone will need to edit the production analysis.cfg file (which is not part of the Mercurial repo).
Attachment #734199 - Flags: checked-in+
deployed
Attachment #734697 - Flags: review?(catlee) → review+
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Attached file test script (v2)
This is an updated version of the test script from comment 0.
Product: mozilla.org → Release Engineering
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/640cefe17a69af8dcf7970b73d0de7f3d143209b
Bug 858877 - Tune fore_window and back_window config in the analysis.cfg template [r=catlee]

https://github.com/mozilla/treeherder/commit/ae7d940e69b9deccfa4e1b6f003932d096174eac
Bug 858877 (2/2) - Add regression tests using real-world Talos data [r=catlee]
You need to log in before you can comment on or make changes to this bug.