Open Bug 1265381 Opened 8 years ago Updated 1 year ago

Consider alerting with less than 12 future data points if we're very confident there is a regression


(Tree Management :: Perfherder, enhancement, P3)



(Not tracked)


(Reporter: wlach, Unassigned)


(Depends on 1 open bug)


(Whiteboard: [fxp])

We currently have the problem that it takes too long to get a complete set of alerts for a push that has regressed Firefox. Some tests and platforms are only run very infrequently (e.g. winxp or pgo) due to resource constraints. This is annoying for perf sheriffs as it potentially means spending extra time waiting for everything to come in before filing a bug. 

Joel and I have been brainstorming ways to work around this and have come up with several ideas. 

The first is to just alert with less data. In some cases, we can be very confident that there is a regression even when we don't have the "full" set of future datapoints. For these cases, maybe we can generate an alert if the confidence is very high? This would get us to the point of filing a bug and getting a developer to act on the full set of data sooner.

Currently our setting is to alert with 12 future datapoints if the value of t-statistic is 7:

I'd propose perhaps "smoothly" reducing that to either 6 revisions if the confidence is very very high (perhaps a t-statistic of 14 or so).

The second idea we had was to automatically retrigger jobs if we are somewhat confident there is an alert -- perhaps retriggering at 3 if we're fairly confident, then again at 6 if we still think something might be there (but not yet extremely confident). I think this approach is complementary to the first, so we should probably implement it afterwards.

Kyle, Saptarshi, would be curious to hear your thoughts on this if you have time.
Flags: needinfo?(sguha)
Flags: needinfo?(klahnakoski)
If we can get a "suspected" regression to retrigger testing then I believe we will have the start of a good system:  We will continue to get false positives with less samples, but if the machine can generate more samples for itself, then this problem is mitigated.  The biggest problem will be prioritizing all the possible suspected regressions, and retriggering tests on most likely candidates.

Using the alert database (with size and confidence of the level shifts), should give have enough information to make this prioritization.
Flags: needinfo?(klahnakoski)
We fully expect some regressions to show up and be invalid- that is the case with 12 future data points, and it will be with whatever we choose.

What I like is that for 6 data points with a high t_threshold (t=12), we can get about 1/3 of the alerts matched with only 2.5% misses (alerting but it really isn't an issue) with almost all extra alerts seen on linux.  This is assuming we ignore kraken.  bumping up t=14, we ignore 10 of the extra alerts on linux64- although we could have a lower hit rate on the 32% of existing alerts.

regarding re-triggers, I think we can come up with a system like this:
1) on 3 future data points and t=14, retrigger suspect revision 3 times (so we end up with 6 data points on about 20% of the real alerts, and 6 data points on about 12% of revisions that don't really have alerts)
** we need to be really careful here on 3 retriggers- many jobs have >1 test, so we could end up retriggering a job 20 times- our goal is to ensure a job has 4 data points.
2) after retriggers or future data, we will have 6 data points much faster
3) on 6 future data points and t=12, send out alerts
4) on 12 future data points and t=7, send out remaining alerts

Overall, we will waste a chunk of jobs on #1, and some randomization on #2, but will get all alerts faster due to additional retriggers.  I think once we have this, we can look at the system and see how many alerts are showing up after 24 hours and analyze those to determine if we can apply some other technique.

One other thing I am looking into is scheduling duplicate jobs for pgo builds, and possibly winxp.  This would reduce the time for future data points greatly.
[de-needinfo'ing saptarshi, I think we're ok here]

Let's revisit this.

I think it would be interesting to start by changing the alert code to alert on 6 or 9 future data points and a high t threshold (say 12), and scaling up from there. I think this should be a fairly trivial change.
Flags: needinfo?(sguha)
Hey Akhilesh, thanks for looking at this. Here's how I might start tackling this bug:

* Set up a jupyter notebook ( and install the treeherder-client into it. I recommend a python virtualenv for this.
* Import the existing set of performance alerting functions into the notebook (they are in `treeherder/perfalert/perfalert/` and have no external dependencies, you should be able to just copy and paste the contents into a cell)
* Download an example of performance data with a very visible regression with the treeherder client (e.g.,34025020e068f8204ca2174832747ef815aa3b65,1,1%5D) and load it into the jupyter notebook. You can see an example of this here Try and see if you can run the alerts code on this data and generate a comparable set of alerts.
* Experiment with modifying the alerting code to have a "minimum" forward window as well as a default one, and a parameter for the lower confidence.
* Try and see if you can reproduce the regression detection with fewer datapoints by slicing off the end of the data

Once you've got that working, we can see about modifying treeherder itself and testing on more data. :)
Assignee: nobody → akhileshpillai
Thanks Will, sounds like a plan.  I will start working on it.
Got the data into jupyter notebook, will now try to run the alerts code against it.
Just an update.

To generate alert against the data
I was trying to do the following 

signature = PerformanceSignature.objects.get(id=82922)

however, since I got an error 
DoesNotExist: PerformanceSignature matching query does not exist.I realized that my local models need relevant data

I tried this to populate data for the above: ( this did not work)
./ import_perf_data mozilla-inbound --time-interval=12909600 -v 3 --filter-props="signatures:'34025020e068f8204ca2174832747ef815aa3b65'"

So now, I am doing this :( This seems to be working)
./ import_perf_data mozilla-inbound --time-interval=12909600 -v 3
Yes, that should be fine, though it might be slow. I think you might do better to just use the rest API to download the data for testing purposes:

From a jupyter notebook, you should be able to do that with python requests. 

Helpful hint: You can use the web console from Firefox or Chrome to see which API endpoints are being hit to display any given set of data. :)
I have been using the api end point to get the data into the jupyter notebook.

signature1 = '34025020e068f8204ca2174832747ef815aa3b65'
series2 = cli.get_performance_data('mozilla-inbound', interval=12909600,signatures=signature1)[signature]

However, on trying to 
generate new alerts via generate_new_alerts_in_series(signature) it seemed an object of type PerformanceSignature was needed, passing the signature1 to generated_new_alerts was not working. 

is there any other way to generate alerts than calling generate_new_alerts_in_series
Ah yes, if you want to actually generate new alerts in the series, you'd need to import stuff into the database. I was thinking you'd just call the various functions in the standalone alert code and inspect the results from the jupyter notebook. 

If we're happy with the results that giving us, I think we can just go ahead and tweak the code (making sure that the treeherder unit tests still pass, of course:
Hi Akhilesh, are you still working on this? If you don't have time, no worries, but I'd like to be able to give this to someone else in that case.
Flags: needinfo?(akhileshpillai)
Please reassign it, I should have pinged you earlier, sorry about that.
Flags: needinfo?(akhileshpillai)
(In reply to Akhilesh Pillai from comment #12)
> Please reassign it, I should have pinged you earlier, sorry about that.

No problem! Thanks for your work on it.
Assignee: akhileshpillai → nobody
Priority: -- → P5
Priority: P5 → P3
Type: defect → enhancement

Have we resumed similar discussions on this matter? I'm asking because of the recent interest in improving the alerting algorithms.

Flags: needinfo?(klahnakoski)
Flags: needinfo?(dave.hunt)

Reducing the number of datapoints can be done with the Mann-Whitney-U (MWU), but it will not be a fabulous as you might imagine. The new MWU bases it's decision on p-value, and it will dictate the minimum number of points it needs to detect a step.

Furthermore, if we use numpy, then we should be able to remove the upper bound too with little cpu increase.

Assignee: nobody → klahnakoski
Depends on: 1601952
Flags: needinfo?(klahnakoski)
Flags: needinfo?(dave.hunt)
Assignee: klahnakoski → nobody
Whiteboard: [fxp]
You need to log in before you can comment on or make changes to this bug.