Closed Bug 1119444 Opened 9 years ago Closed 8 years ago

write a tool to simulate alternative window sizes for calculating regressions

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(3 files, 2 obsolete files)

An updated version of analyze_graphapi 9 years ago Alice Scarpa [:adusca] 4.05 KB, text/x-python		Details
Script to generate window sizes comparison tables 9 years ago Alice Scarpa [:adusca] 3.11 KB, text/x-python		Details
compare_windows.py 9 years ago Alice Scarpa [:adusca] 3.13 KB, text/x-python		Details
compare_windows.py 9 years ago Alice Scarpa [:adusca] 4.28 KB, text/x-python		Details
Tests ordered by precision of window size = 3 9 years ago Alice Scarpa [:adusca] 6.76 KB, text/csv		Details

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

9 years ago

in bug 1109243 we started down the path of making talos turn orange on try server. This is something new an experimental.

In doing this, it brings up the question- how many data points do we need before we can detect a regression.

Currently we generate alerts based off of 12 previous data point and 12 future data points. We do know that sometimes a data point or two can be randomly misrepresenting the code it is testing, so multiple data points are needed. The question is how many?

In bug 1109243, the assertion is with 3 data points we can have high enough confidence to notify the user. This should have data to back it up. We do know that there is noise in our existing system of 12 future data points (it would be nice to quantify this, probably 10-20% noise).

What accuracy do we have for 6 data points, 3 data points?

Some issues to be aware of:
1) we might miss data points, this means regressions could go missed or we categorize them incorrectly on a future data point.
2) we might overly alert. this causes more noise for people to investigate issues when there really isn't one.

Graph server has an API where we can get the data for the last year. I would like to pull this data down, run the calculations on the data to get a baseline, then adjust the window size to see what damage we end up with by adjusting the window size.

We don't need 100% accuracy since the existing data is already error prone.

The graph server analysis code is here:
http://hg.mozilla.org/graphs/file/b00f49d8a764/server/analysis

We could copy the files, take the code/logic as a starting point for a script, or just put a script in graph server that uses these files.

A few caveats:
* graph server supresses duplicates via a cache, so there is a good chance that calculating alerts will generate >1 in many cases for the same real alert.
* we should stick to a single branch (mozilla-inbound is best)
* ideally each test should have a list of dates/revisions and a link to the graph server, this would allow us to verify the list easier.

When this patch is done we might be able to:
* shorten the regression window to speed up alerts (this is critical on lower volume branches, and it will help patch authors who land on integration branches to get notified much faster)
* have higher confidence in what is required to validate a fix or report on a regression/fix when running on try server

Alice Scarpa [:adusca]

Comment 1

•

9 years ago

Attached file An updated version of analyze_graphapi — Details

Alice Scarpa [:adusca]

Comment 2

•

9 years ago

Attached file Script to generate window sizes comparison tables (obsolete) — Details

Alice Scarpa [:adusca]

Comment 3

•

9 years ago

Attached file compare_windows.py (obsolete) — Details

Attachment #8549699 - Attachment is obsolete: true

Alice Scarpa [:adusca]

Comment 4

•

9 years ago

Attached file compare_windows.py — Details

Refactored the code. Usage now is python compare_windows.py TESTNAME. It will generate a table with all the tests in about 1 hour without a TESTNAME.

Attachment #8549839 - Attachment is obsolete: true

Alice Scarpa [:adusca]

Comment 5

•

9 years ago

Attached file Tests ordered by precision of window size = 3 — Details

Can be used to find the worst tests to run with compare_windows.py. Keep in mind that 98% is actually a terrible precision, because of the low incidence of regressions.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

8 years ago

using this locally I have determined that 6 is the magic number for data points.  We now have changed our tooling to use perfherder and have taken our data learned from here and applied it towards code added to perfherder.

Closing this out!

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

write a tool to simulate alternative window sizes for calculating regressions

Categories

(Testing :: Talos, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files, 2 obsolete files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Attachment

General

Description

File Name

Content Type