Open
Bug 1265381
Opened 8 years ago
Updated 5 months ago
Consider alerting with less than 12 future data points if we're very confident there is a regression
Categories
(Tree Management :: Perfherder, enhancement, P3)
Tree Management
Perfherder
Tracking
(Not tracked)
NEW
People
(Reporter: wlach, Unassigned)
References
(Depends on 1 open bug)
Details
(Whiteboard: [fxp])
We currently have the problem that it takes too long to get a complete set of alerts for a push that has regressed Firefox. Some tests and platforms are only run very infrequently (e.g. winxp or pgo) due to resource constraints. This is annoying for perf sheriffs as it potentially means spending extra time waiting for everything to come in before filing a bug. Joel and I have been brainstorming ways to work around this and have come up with several ideas. The first is to just alert with less data. In some cases, we can be very confident that there is a regression even when we don't have the "full" set of future datapoints. For these cases, maybe we can generate an alert if the confidence is very high? This would get us to the point of filing a bug and getting a developer to act on the full set of data sooner. Currently our setting is to alert with 12 future datapoints if the value of t-statistic is 7: https://github.com/mozilla/treeherder/blob/4a357b297fde5d5ba3f93c27a53aea53292f53a9/treeherder/perfalert/perfalert/__init__.py#L100 I'd propose perhaps "smoothly" reducing that to either 6 revisions if the confidence is very very high (perhaps a t-statistic of 14 or so). The second idea we had was to automatically retrigger jobs if we are somewhat confident there is an alert -- perhaps retriggering at 3 if we're fairly confident, then again at 6 if we still think something might be there (but not yet extremely confident). I think this approach is complementary to the first, so we should probably implement it afterwards. Kyle, Saptarshi, would be curious to hear your thoughts on this if you have time.
Flags: needinfo?(sguha)
Flags: needinfo?(klahnakoski)
Comment 1•8 years ago
|
||
If we can get a "suspected" regression to retrigger testing then I believe we will have the start of a good system: We will continue to get false positives with less samples, but if the machine can generate more samples for itself, then this problem is mitigated. The biggest problem will be prioritizing all the possible suspected regressions, and retriggering tests on most likely candidates. Using the alert database (with size and confidence of the level shifts), should give have enough information to make this prioritization.
Flags: needinfo?(klahnakoski)
Comment 2•8 years ago
|
||
We fully expect some regressions to show up and be invalid- that is the case with 12 future data points, and it will be with whatever we choose. What I like is that for 6 data points with a high t_threshold (t=12), we can get about 1/3 of the alerts matched with only 2.5% misses (alerting but it really isn't an issue) with almost all extra alerts seen on linux. This is assuming we ignore kraken. bumping up t=14, we ignore 10 of the extra alerts on linux64- although we could have a lower hit rate on the 32% of existing alerts. regarding re-triggers, I think we can come up with a system like this: 1) on 3 future data points and t=14, retrigger suspect revision 3 times (so we end up with 6 data points on about 20% of the real alerts, and 6 data points on about 12% of revisions that don't really have alerts) ** we need to be really careful here on 3 retriggers- many jobs have >1 test, so we could end up retriggering a job 20 times- our goal is to ensure a job has 4 data points. 2) after retriggers or future data, we will have 6 data points much faster 3) on 6 future data points and t=12, send out alerts 4) on 12 future data points and t=7, send out remaining alerts Overall, we will waste a chunk of jobs on #1, and some randomization on #2, but will get all alerts faster due to additional retriggers. I think once we have this, we can look at the system and see how many alerts are showing up after 24 hours and analyze those to determine if we can apply some other technique. One other thing I am looking into is scheduling duplicate jobs for pgo builds, and possibly winxp. This would reduce the time for future data points greatly.
Reporter | ||
Comment 3•7 years ago
|
||
[de-needinfo'ing saptarshi, I think we're ok here] Let's revisit this. I think it would be interesting to start by changing the alert code to alert on 6 or 9 future data points and a high t threshold (say 12), and scaling up from there. I think this should be a fairly trivial change.
Flags: needinfo?(sguha)
Reporter | ||
Comment 4•7 years ago
|
||
Hey Akhilesh, thanks for looking at this. Here's how I might start tackling this bug: * Set up a jupyter notebook (http://jupyter.org/) and install the treeherder-client into it. I recommend a python virtualenv for this. * Import the existing set of performance alerting functions into the notebook (they are in `treeherder/perfalert/perfalert/__init__.py` and have no external dependencies, you should be able to just copy and paste the contents into a cell) * Download an example of performance data with a very visible regression with the treeherder client (e.g. https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,34025020e068f8204ca2174832747ef815aa3b65,1,1%5D) and load it into the jupyter notebook. You can see an example of this here http://nbviewer.jupyter.org/url/people.mozilla.org/~wlachance/Bokeh%20test.ipynb. Try and see if you can run the alerts code on this data and generate a comparable set of alerts. * Experiment with modifying the alerting code to have a "minimum" forward window as well as a default one, and a parameter for the lower confidence. * Try and see if you can reproduce the regression detection with fewer datapoints by slicing off the end of the data Once you've got that working, we can see about modifying treeherder itself and testing on more data. :)
Assignee: nobody → akhileshpillai
Comment 5•7 years ago
|
||
Thanks Will, sounds like a plan. I will start working on it.
Comment 6•7 years ago
|
||
Got the data into jupyter notebook, will now try to run the alerts code against it.
Comment 7•7 years ago
|
||
Just an update. To generate alert against the data I was trying to do the following signature = PerformanceSignature.objects.get(id=82922) generate_new_alerts_in_series(signature) however, since I got an error DoesNotExist: PerformanceSignature matching query does not exist.I realized that my local models need relevant data So, I tried this to populate data for the above: ( this did not work) ./manage.py import_perf_data mozilla-inbound --time-interval=12909600 -v 3 --filter-props="signatures:'34025020e068f8204ca2174832747ef815aa3b65'" So now, I am doing this :( This seems to be working) ./manage.py import_perf_data mozilla-inbound --time-interval=12909600 -v 3
Reporter | ||
Comment 8•7 years ago
|
||
Yes, that should be fine, though it might be slow. I think you might do better to just use the rest API to download the data for testing purposes: https://treeherder.mozilla.org/api/project/mozilla-inbound/performance/data/?framework=1&interval=2592000&signatures=34025020e068f8204ca2174832747ef815aa3b65 From a jupyter notebook, you should be able to do that with python requests. Helpful hint: You can use the web console from Firefox or Chrome to see which API endpoints are being hit to display any given set of data. :)
Comment 9•7 years ago
|
||
I have been using the api end point to get the data into the jupyter notebook. signature1 = '34025020e068f8204ca2174832747ef815aa3b65' series2 = cli.get_performance_data('mozilla-inbound', interval=12909600,signatures=signature1)[signature] However, on trying to generate new alerts via generate_new_alerts_in_series(signature) it seemed an object of type PerformanceSignature was needed, passing the signature1 to generated_new_alerts was not working. is there any other way to generate alerts than calling generate_new_alerts_in_series
Reporter | ||
Comment 10•7 years ago
|
||
Ah yes, if you want to actually generate new alerts in the series, you'd need to import stuff into the database. I was thinking you'd just call the various functions in the standalone alert code and inspect the results from the jupyter notebook. If we're happy with the results that giving us, I think we can just go ahead and tweak the code (making sure that the treeherder unit tests still pass, of course: http://treeherder.readthedocs.io/common_tasks.html#running-the-tests).
Reporter | ||
Comment 11•7 years ago
|
||
Hi Akhilesh, are you still working on this? If you don't have time, no worries, but I'd like to be able to give this to someone else in that case.
Flags: needinfo?(akhileshpillai)
Comment 12•7 years ago
|
||
Please reassign it, I should have pinged you earlier, sorry about that.
Flags: needinfo?(akhileshpillai)
Reporter | ||
Comment 13•7 years ago
|
||
(In reply to Akhilesh Pillai from comment #12) > Please reassign it, I should have pinged you earlier, sorry about that. No problem! Thanks for your work on it.
Assignee: akhileshpillai → nobody
Updated•6 years ago
|
Priority: -- → P5
Updated•4 years ago
|
Priority: P5 → P3
Updated•4 years ago
|
Type: defect → enhancement
Comment 14•4 years ago
|
||
Have we resumed similar discussions on this matter? I'm asking because of the recent interest in improving the alerting algorithms.
Flags: needinfo?(klahnakoski)
Flags: needinfo?(dave.hunt)
Comment 15•4 years ago
|
||
Reducing the number of datapoints can be done with the Mann-Whitney-U (MWU), but it will not be a fabulous as you might imagine. The new MWU bases it's decision on p-value, and it will dictate the minimum number of points it needs to detect a step.
Furthermore, if we use numpy, then we should be able to remove the upper bound too with little cpu increase.
Updated•4 years ago
|
Flags: needinfo?(dave.hunt)
Updated•3 years ago
|
Assignee: klahnakoski → nobody
Updated•5 months ago
|
Whiteboard: [fxp]
Updated•5 months ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•