create a script when tells us if we have confidence that the regression we see includes all related regressions/improvements

RESOLVED WONTFIX

Status

Testing
Talos
RESOLVED WONTFIX
3 years ago
2 months ago

People

(Reporter: jmaher, Unassigned, Mentored)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
this is an odd request, but an important one.  When we investigate a performance regression which occurs it is important to document all the tests+platforms that regressed and improved.

Ideally this script would not be part of graph server (although if it is, maybe we can make an api that returns the data we need), but would query graph server.

I imagine this:
python get_alert_confidence.py --test tp5o --branch mozilla-inbound --revision 31415926535 --platform linux32

this would then query all the tp5 results for mozilla-inbound-non-pgo and all platforms, then I see it generating a date range for +-12 revisions from the input revision for the target platform.  This would be repeated for all the other platforms, and finally we would determine if the date ranges are similar.  

example:
linux32: 2015-01-15 12:00 -> 2015-01-15 20:00 <- baseline
linux64: 2015-01-15 12:00 -> 2015-01-15 22:00 <- +2 hours, less confidence
winxp:   2015-01-15 11:30 -> 2015-01-15 21:00 <- -.5,+1 hour, similar confidence
osx10.8: 2015-01-15 10:15 -> 2015-01-15 20:30 <- -1.75,+1.5, not confident
osx10.6: 2015-01-15 10:15 -> 2015-01-15 23:30 <- -1.75,_3.5, no confident

the idea here is that any deviation from start/end by 1 hour or more reduces the confidence, probably in 30 minute increments.  The output in addition to the above could be:
linux32: 1.0
linux64: 0.6 (.4 off for the end)
winxp:   0.7 (.1 off for the start, .2 off for the end)
osx10.8: 0.4 (.3 off for the start, .3 off for the end)
osx10.6: 0.0 (.3 off for the start, .7 off for the end)

summary: 3.5 / 5.0 (accept a score of 0.8 as good, so linux64 reduces it by .2, winxp by .1, 10.8 by .4, 10.6 by .8, total of 1.5 points deducted).

total confidence would be:
5.0  = full confidence
4.5+ = enough confidence - should verify each platform
<4.0 = need to backfill data.
(Reporter)

Updated

3 years ago
Blocks: 1088251
No longer blocks: 1093942

Comment 1

3 years ago
Adding a bit of background for this task:

When filing a regression bug, usually we state a one or few regression-per-single platform values which we have (e.g. tscrollx regression: 5% on linux 32 and 4% on linux 64).

However, this information is many times incomplete since we don't always have enough data points (test runs) to declare a detected change on all platforms since some platforms/branches are built much less frequently than others.

With the example above of tscrollx above, it could, for instance, be possible that some time later we also discover other regressions or other improvements on windows XP.

So in order for the developers to have a better context of the initial bug report (and possibly later updated reports once more results come in) which will allow a more informed decision on what action should be taken, it would be very valuable to be able express the scope at which the report was generated.

Which is where this bug comes in.

I think the ultimate output of this process (without any implementation considerations), is that whenever we report a regression status (be it at a new bug, or elsewhere), it's accompanied, or possibly replaced, with a table of all the data we have right now.

This could be a matrix of platforms/tests, where each item is either:
- Improvement percentage value (e.g. always denoted with a '-' sign)
- Regression percentage value (e.g. always denoted with a '+' sign)
- Indication that we don't yet have enough data to post a value (e.g. "--")

To make this table easier to look at, we could decide to always display rounded values (it doesn't matter _that_ much if a regression is 3.75% or 3.98%, so +4 would be fine for this)

For instance, if we had only 4 platforms and 3 tests, and we have enough data points on 3 of the platforms and still don't have enough data for windows XP:

          test1  test2  test3
Win 8 32    0     -8      0
Win XP     ..     ..     ..
Linux 32   +3      0      0
Linux 64   +4     +1      0

(we could decide on maybe better visual representation of this, e.g. maybe not using + for positive numbers, etc).

This essentially represents the exact context of the data we report. Most importantly, it says that we still don't have numbers on windows XP, and it says that test2 actually improved very nicelyon windows 8 - something which we might have not reported so far.

This is very valuable information when a developer needs to decide what to do about this regression.
(Reporter)

Comment 2

3 years ago
Avi, I like what you have here.  It is a lot of background information.

My goal for this script is a subset of what you are looking for.  Having the confidence for a given alert will help us determine what needs backfilling and what to do with the bug (file, wait, add a comment, etc.)

I suspect once we have full confidence in the alert, we could file the bug.  Right now we have the ability to view regressions/improvements in a tabular form:
http://alertmanager.allizom.org:8080/alerts.html?rev=150c9fed433b&table=1

This data can come from the list of alerts and could be converted into a cut/paste format for a nice bug comment.

Showing the missing data would only show us if and only if we were missing data for that specific revision.  While that is important, it is also important to know if there is data missing from previous and future revisions as that helps determine the generation of an alert.

So one way to do this is forget about historical/future data and just run the calculation for all tests and all platforms.  This is roughly 150 data points to consolidate into a graph/report.  We could narrow the scope down to make it more digestable.  My concern with doing this is that we will only get data and results as we would normally from graph server.  So the alerts would have already been generated.  The only difference is that it would say if we didn't collect data for this specific revision.

I am afraid there is no right or easy answer to solving this mystery of "what is the total impact of a given patch".  I do believe that we need to do something and my original proposal would help us quantify the effects of coalescing/build breakage/scheduling/noise.
(Reporter)

Comment 3

2 months ago
good idea, but hasn't been needed.
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.