perf-reftests should fail if any of the individual subtests fail
Categories
(Testing :: Talos, task, P3)
Tracking
(firefox70 fixed)
| Tracking | Status | |
|---|---|---|
| firefox70 | --- | fixed |
People
(Reporter: heycam, Assigned: heycam)
Details
Attachments
(1 file)
Currently subtest_alerts = False for perf-reftest and perf-reftest-singletons. I think that means that we only report regressions in the single value that combines the results from all of the subtests (is that the geomean of those values?). Which I think means that as we add more subtests, their influence on the final result is diminished, and if a single subtest starts regressing, we probably won't notice this. The subtests all test very different, individual optimizations that we want to ensure don't regress.
Is the solution to this just to flip subtest_alerts to True?
Comment 1•6 years ago
|
||
The suite level result is a geometric mean, which normalizes the differently-ranged values and therefore allows us to be alerted when any of the subtest values regress or improve rather than needing to monitor them individually. We can turn on alerting for individual subtests, however this will increase the volume of alerts and the workload of the perf sheriffs, potentially without any improvement to the rate of detection for regressions. Do you have evidence to suggest that regressions to subtests have missed missed by monitoring the geometric mean?
Updated•6 years ago
|
| Assignee | ||
Comment 2•6 years ago
|
||
I do know that in the past I have noticed Talos try runs I've done where the overall regression didn't count as significant on perfherder, but when I dug into the subtests, there was one that was a significant regression.
So let's say we have 100 existing perf-reftest singletons, and they all have a reported time of 100 ms. The geomean is 100 ms. If one test regresses so that it takes 120 ms -- so that's a 20% regression, which should count as significant -- then the geomean is 100.18 ms, which is only a 0.18% regression, and so won't be flagged.
Or say we have 99 existing tests reporting 1000 ms and one test reporting 100 ms. The geomean is 977.24 ms. If the 100 ms test regresses and now takes 200 ms, the geomean is 984.03 ms, which is a 0.69% regression, and also won't be flagged.
| Assignee | ||
Comment 3•6 years ago
|
||
Anyway, it feels like if I add a new test, I don't have any confidence that a regression in that test will be caught, because of the averaging going on. (Unless the regression is large enough to overwhelm the effect of the averaging.)
Is there work that the perf sheriffs need to do to manage these alerts beyond what I would normally see in bugs being filed for regressions? If it results in more bugs being filed for these regression subtests, then I think that's a good thing, as it really indicates a regression we want to address. If there is additional work that happens before that step that doesn't result in bugs being filed (e.g. for regressions below the threshold? testing for intemittents?) then it's understandable that we'd like to keep this under control. Do you have any sense of the increase in workload this would bring?
Comment 4•6 years ago
|
||
It depends on a number of things. If we enable alerting for 100 tests, and a push causes a regression that affects all of them then we'll have 100 alerts for the sheriffs to deal with. Hopefully these will all be grouped in a single alert summary, but that's not guaranteed. I think in this particular case it's probably worth looking into the historic data for these results to see how alerting on subtests would affect the number of alerts. Another option is to use something other than geomean for the suite level value.
A quick look over the perf_reftest* results in Perfherder for linux64-shippable suggests that there's been 1 alert based on geomean (https://treeherder.mozilla.org/perf.html#/alerts?id=21642), and if we enabled all subtests for perf_reftest* then we would have seen 21 alerts in the last 90 days. This doesn't seem excessive, so perhaps it's worth considering if we feel there's value.
Comment 5•6 years ago
|
||
fyi I used https://github.com/davehunt/randomtools/tree/master/doalerts to see what alerts may have been generated if we enabled subtest alerting, with the command line: pipenv run python perfherder_alerts.py --framework talos --branch autoland --platform linux64-shippable --days 90 --test perf_reftest --subtests
| Assignee | ||
Comment 6•6 years ago
|
||
(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #4)
It depends on a number of things. If we enable alerting for 100 tests, and a push causes a regression that affects all of them then we'll have 100 alerts for the sheriffs to deal with. Hopefully these will all be grouped in a single alert summary, but that's not guaranteed. I think in this particular case it's probably worth looking into the historic data for these results to see how alerting on subtests would affect the number of alerts. Another option is to use something other than geomean for the suite level value.
I'm not really sure what value it makes sense to report for the suite level. If we do indeed care about individual subtests regressing, and the suite level number will regress if and only if there is at least one individual subtest regressing, then it might not really be useful to look at the one top level value at all. (But maybe having these set up as subtests, we have to have a single number?)
A quick look over the perf_reftest* results in Perfherder for linux64-shippable suggests that there's been 1 alert based on geomean (https://treeherder.mozilla.org/perf.html#/alerts?id=21642), and if we enabled all subtests for perf_reftest* then we would have seen 21 alerts in the last 90 days. This doesn't seem excessive, so perhaps it's worth considering if we feel there's value.
Thanks for digging in to the numbers. That one geomean alert there is just because I added a new perf-reftest singleton in bug 1554571.
If this doesn't seem excessive, can we try it? And if we find that the burden on the perf sheriffs is too great we can re-think.
Comment 7•6 years ago
|
||
(In reply to Cameron McCormack (:heycam) (away 5 July) from comment #6)
If this doesn't seem excessive, can we try it? And if we find that the burden on the perf sheriffs is too great we can re-think.
Go ahead and enable shouldAlert for the subtests. We'll monitor the volume of alerts and revisit as needed.
| Assignee | ||
Comment 8•6 years ago
|
||
Updated•6 years ago
|
Comment 10•6 years ago
|
||
| bugherder | ||
Description
•