When a regression is detected, blame the changeset with the highest t-test score



6 years ago
4 years ago


(Reporter: mbrubeck, Assigned: mbrubeck)


Firefox Tracking Flags

(Not tracked)


(Whiteboard: [regression-detection])


(2 attachments, 1 obsolete attachment)



6 years ago
Currently, analyze.py walks through a data series and flags any points where regressions might have occurred according to a t-test.  Then analyze_talos.py looks at each of those points, and sends an alert about the *first* point where each regression might have occurred.  It ignores later possible points for the same regression, even if the later point has a higher t-value:

Instead, the alert should either flag *all* of the points where the t-value is over the threshold, or it should choose the point with the *highest* t-value (instead of the first point with a "high enough" t-value).


6 years ago
Summary: When a regression is detected, blame find the changeset with the highest t-test score → When a regression is detected, blame the changeset with the highest t-test score

Comment 1

6 years ago
There's also a similar case where we blame a *later* changeset than we should, because the changeset with the highest score is in the warning_history.json file, but some later changeset has a score over the threshold.  Each time we process new data, we may warn on yet another changeset for the same regression, until we fall back below the threshold.

Comment 2

6 years ago
Created attachment 734197 [details] [diff] [review]
1/2: add tests

This refactors analyze_talos.py to make it possible to add some very basic tests.
Attachment #734197 - Flags: review?(catlee)

Comment 3

6 years ago
Created attachment 734198 [details] [diff] [review]
2/2: find local maxima in the t-tests

Don't blame a changeset if there's better candidate right next to it.  Basically this finds the "high point" within any region where a regression is expected.

Now we no longer need to set "skip = True" for consecutive 'regression' changesets in analyze_talos.py, since the maximum-t-value logic in analyze.py has already reduced each of these ranges to a single changeset.
Attachment #734198 - Flags: review?(catlee)
Attachment #734197 - Flags: review?(catlee) → review+

Comment 4

6 years ago
Created attachment 736590 [details] [diff] [review]
2/2: find local maxima in the t-tests (v2)

From IRC:

  <catlee> I'm not sure how attachment 734198 [details] [diff] [review] will behave as it processes new data. wouldn't we possibly get multiple regression notices?

  <mbrubeck> We use warning_history.json to prevent re-sending the exact same warning email -- that will still work the same after this patch.  In fact, it will work better, since currently we might flag several changesets in a row for the same regression.  On the first run, we'll send an email for the first changeset.  On the next run, we skip that but send an email for the second changset... and so on.  With the patch, we will choose just one of those changesets to flag, send an email for it, and put it in warning_history.json

  <catlee> right, but won't the worst t-value change as you get more data?

I thought about this some more.  For the case where new data arrives out of order (i.e. retriggers add new datapoints for an old changeset), the new data can definitely change the score of existing changesets.  This is expected and desired, and we don't need to prevent this, since it may help us identify previously-missed regressions.

For the case where new data arrives *in order* (i.e. only for new changesets), it can change a previous score only if it's within the fore_window of the previous changeset.  But we don't perform t-tests for a changeset until it already has a full fore_window datapoints following it, so new data added after those can't change its t-score.

However, with the old version of this patch, new data could *could* change the final state of the very last changeset in the good_data array.  We can't yet compare its t-score to its "next neighbor" because the next neighbor hasn't been tested yet.  Instead, we should ignore it for this run.  When new data is added and the next neighbor is tested, then we'll be able to make a decision on it that won't change.

I modified the patch to set di.state only when di *and both its neighbors* has had its t-test (which means that di and its next neighbor already have at least k datapoints after them).
Attachment #734198 - Attachment is obsolete: true
Attachment #734198 - Flags: review?(catlee)
Attachment #736590 - Flags: review?(catlee)
Comment on attachment 736590 [details] [diff] [review]
2/2: find local maxima in the t-tests (v2)

Review of attachment 736590 [details] [diff] [review]:

Great, thanks!
Attachment #736590 - Flags: review?(catlee) → review+

Comment 6

6 years ago

This has less automated test coverage than my other patches and I couldn't fully test it locally, so please do what you can to verify this before or after deploying it.  Let me know if I can help!
Last Resolved: 6 years ago
Resolution: --- → FIXED
How confident are you in the results? I can deploy now and we can keep an eye on things.

Or, I can set up a parallel instance with the new code and have it mail just you the results.

Comment 8

6 years ago
I'm confident in the results; I'm more concerned that I broke some of the unrelated code (like the code to send emails or read the config file) without knowing it.  So I think deploying now and keeping an eye on it seems reasonable.
Ok, I've just deployed the new code.

Comment 10

6 years ago
It's getting to be suspiciously long since the last regression email (though there's always a chance we just haven't regressed anything since this morning).  Is there any way to check logs to make sure analyze_talos.py is running successfully?  Or can it be run manually on the graph server and its output checked?
I was just looking at the logs when it sent mail
 (Improvement) Mozilla-Inbound - Tp5 Optimized (Main RSS) - Ubuntu HW 12.04 x64 - 2.66%

Can't see any errors there.
Product: mozilla.org → Release Engineering
Commits pushed to master at https://github.com/mozilla/treeherder

Bug 858756 (1/2): Add basic tests for analyze_talos.py [r=catlee]

Bug 858756 (2/2) - Blame the changeset whose t score is a local maximum [r=catlee]

When several changesets in a row are potential causes of a regression,
blame the one with the highest t-test score rather than the first one.


4 years ago
QA Contact: mshal
You need to log in before you can comment on or make changes to this bug.