Closed Bug 1119444 Opened 9 years ago Closed 8 years ago

write a tool to simulate alternative window sizes for calculating regressions

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(3 files, 2 obsolete files)

in bug 1109243 we started down the path of making talos turn orange on try server.  This is something new an experimental.

In doing this, it brings up the question- how many data points do we need before we can detect a regression.  

Currently we generate alerts based off of 12 previous data point and 12 future data points.  We do know that sometimes a data point or two can be randomly misrepresenting the code it is testing, so multiple data points are needed.  The question is how many?

In bug 1109243, the assertion is with 3 data points we can have high enough confidence to notify the user.  This should have data to back it up.  We do know that there is noise in our existing system of 12 future data points (it would be nice to quantify this, probably 10-20% noise).

What accuracy do we have for 6 data points, 3 data points?

Some issues to be aware of:
1) we might miss data points, this means regressions could go missed or we categorize them incorrectly on a future data point.
2) we might overly alert. this causes more noise for people to investigate issues when there really isn't one.

Graph server has an API where we can get the data for the last year.  I would like to pull this data down, run the calculations on the data to get a baseline, then adjust the window size to see what damage we end up with by adjusting the window size.

We don't need 100% accuracy since the existing data is already error prone.  

The graph server analysis code is here:
http://hg.mozilla.org/graphs/file/b00f49d8a764/server/analysis

We could copy the files, take the code/logic as a starting point for a script, or just put a script in graph server that uses these files.

A few caveats:
* graph server supresses duplicates via a cache, so there is a good chance that calculating alerts will generate >1 in many cases for the same real alert.
* we should stick to a single branch (mozilla-inbound is best)
* ideally each test should have a list of dates/revisions and a link to the graph server, this would allow us to verify the list easier.

When this patch is done we might be able to:
* shorten the regression window to speed up alerts (this is critical on lower volume branches, and it will help patch authors who land on integration branches to get notified much faster)
* have higher confidence in what is required to validate a fix or report on a regression/fix when running on try server
Attached file compare_windows.py (obsolete) —
Attachment #8549699 - Attachment is obsolete: true
Attached file compare_windows.py
Refactored the code. Usage now is python compare_windows.py TESTNAME. It will generate a table with all the tests in about 1 hour without a TESTNAME.
Attachment #8549839 - Attachment is obsolete: true
Can be used to find the worst tests to run with compare_windows.py. Keep in mind that 98% is actually a terrible precision, because of the low incidence of regressions.
using this locally I have determined that 6 is the magic number for data points.  We now have changed our tooling to use perfherder and have taken our data learned from here and applied it towards code added to perfherder.

Closing this out!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: