analyze.py blames the wrong regression in Talos tests with very low variance

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: mbrubeck, Assigned: mbrubeck)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [regression-detection])

Attachments

(4 attachments)

(Assignee)

Description

6 years ago
Created attachment 734018 [details] [diff] [review]
1/4: basic tests

There are some specific bugs in analyze.py that make it return incorrect results when dealing with non-noisy tests, like "Number of Constructors" (see bug 721387 comment 0).  I wrote some patches to improve this.

This first patch just adds some very basic tests so I won't change anything unintentionally.
Attachment #734018 - Flags: review?(catlee)
(Assignee)

Comment 1

6 years ago
Created attachment 734020 [details] [diff] [review]
2/4: minor cleanup

Just removing some unused code, and simplifying the analyze() function.  This patch is completely optional and should not change any results.  It's mostly just to help me understand the code.
Attachment #734020 - Flags: review?(catlee)
(Assignee)

Comment 2

6 years ago
Created attachment 734022 [details] [diff] [review]
3/4: don't ignore the first j datapoints

analyze_t should detect a potential regression at timestamp 6, but it doesn't because it totally ignores the first 5 datapoints and so it can't perform t-tests for points 6-10.

We ought to process from the very beginning of the results.  We still can't be perform t-tests for the first j points, but it can add them to the good_data array so that the next j points can be tested correctly.
Attachment #734022 - Flags: review?(catlee)
(Assignee)

Comment 3

6 years ago
Created attachment 734025 [details] [diff] [review]
4/4: fix t-test when both series have variance == 0

Here's the main bug:  The actual regression in my test data happens at timestamp 8, but analyze_t return "good" for that timestamp and blames the surrounding datapoints instead!

This is because calc_t bails out to avoid a divide-by-zero in this case.  Instead of returning 0 when bailing out, it should return infinity, because this should give us the maximum possible confidence of a regression.
Attachment #734025 - Flags: review?(catlee)
Attachment #734018 - Flags: review?(catlee) → review+
Attachment #734020 - Flags: review?(catlee) → review+
Attachment #734022 - Flags: review?(catlee) → review+
Comment on attachment 734025 [details] [diff] [review]
4/4: fix t-test when both series have variance == 0

Review of attachment 734025 [details] [diff] [review]:
-----------------------------------------------------------------

this looks fine, but how does the rest of the code handle inf? for example, does it make sense to append di to good_data at the end of analyze_t in the case of a regression?
(Assignee)

Comment 5

6 years ago
(In reply to Chris AtLee [:catlee] from comment #4)
> this looks fine, but how does the rest of the code handle inf? for example,
> does it make sense to append di to good_data at the end of analyze_t in the
> case of a regression?

These "t" values are never used in except in comparisons.  (The current code compares the t score to a threshold, and bug 858756 adds code that compares t scores to other t scores.)  So "inf" should be as safe as any value.

One quirk, which I think is harmless: If two infinite values end up next to each other (I don't think this can even happen in practice), the code from bug 858756 will find that neither of them is greater than the other (because "float('inf') > float('inf')" is false), so it will flag both of them as regressions rather than choosing one as the "highest" regression.

It still makes sense to append di to good_data (as the current code does), since di.value should still form part of the new "baseline" average.
Attachment #734025 - Flags: review?(catlee) → review+
(Assignee)

Comment 6

6 years ago
Thanks for the reviews!  Since these patches are all in easily-tested code, I'm confident enough to just push them.  In some other bugs where I make changes to analyze_talos.py, I'd like some advice on testing/staging/deployment.

https://hg.mozilla.org/graphs/rev/d4af2bc04ce7
https://hg.mozilla.org/graphs/rev/f8ce18a0db88
https://hg.mozilla.org/graphs/rev/baad446523cf
https://hg.mozilla.org/graphs/rev/6caa2530736d
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
This has been deployed now.
Product: mozilla.org → Release Engineering

Comment 8

4 years ago
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/8a091b499bb164ae271b054f1bef07ae151eb32e
Bug 858735 (1/4) - Add some tests for analyze.py [r=catlee]

https://github.com/mozilla/treeherder/commit/71046ff403b43dad9a58f934789e1da38537c51a
Bug 858735 (2/4) - Some minor code cleanup in analyze.py [r=catlee]

https://github.com/mozilla/treeherder/commit/7993d122dcd62a616dff5e2c4e76bc43dff585ae
Bug 858735 (3/4) - Don't skip the first j results in analyze_t [r=catlee]

https://github.com/mozilla/treeherder/commit/ceb5cedbe27d19f42cb2749dabbf9e0f8e11ec99
Bug 858735 (4/4) - Correctly process t-tests for series with zero variance [r=catlee]
You need to log in before you can comment on or make changes to this bug.