Automatically retrigger performance measurements until statistical significance is reached
Categories
(Testing :: Performance, enhancement, P4)
Tracking
(Not tracked)
People
(Reporter: padenot, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [fxp])
If I make a change, I'd like to not have to guess how many retriggers I need, and instead have the number of iterations automatically determined.
This is what statistical benchmarking tools such as Haskell's criterion, its rust port tutorial criterion-rs do, and it's great.
Updated•3 years ago
|
Comment 1•3 years ago
|
||
This is something we looked into at one point and found that it's not valuable to implement.
One major issue we have is that the number of iterations you do will depend on some statistic. We tried using a simple metric like Z-score but the issue was with bimodal/modal tests. In those cases, it's impossible for us to determine when we are in the "right" mode since any Z-score or statistical technique will filter out one mode, or the other, but it won't keep both. This means that our bimodal would get even worse with a change here.
I was looking into the criterion module you linked, and they only gather measurements up to a threshold so they don't calculate this automatically either. It's only dynamic based on the sum of measurements being produced and not based on their statistical significance: https://github.com/haskell/criterion/blob/master/criterion-measurement/src/Criterion/Measurement.hs#L245-L253
They use this technique to help them estimate the point at which they expect to have enough data for a significant comparison, so they could still be wrong.
Something to note is that a lot of these issues are coming from the fact that we only use a single median value per task for the statistical comparisons. We have some work underway to improve this system and use the replicates themselves (15-25 points per task) which will help give you a better comparison. You can find the start of that work here which we hope to integrate into CI: https://github.com/mozilla/mozperftest-tools/blob/master/mozperftest_tools/mozperftest_tools/regression_detector.py
Updated•3 years ago
|
Updated•3 years ago
|
| Reporter | ||
Comment 2•3 years ago
|
||
Non-unimodal distribution need to be compared manually, and more importantly highlighted and investigated, this is where we'd need 1814359. Same for when there are outliers in the data.
It's clear that having a lot more data per run will help, that's cool to hear.
Comment 3•3 years ago
•
|
||
I agree, but this issue arises when we're running the test itself. While we're running, we can't tell if we're in one mode, or another, and it's impossible for us to know what the "right" mode is. This means that we'd inadvertently make the statistical comparisons worse since some tasks will end up in one mode, and others will end up in another.
EDIT: Another way of looking at this is if we start by analyzing a group of 5 data points to determine their distribution. This set of data points determines the distribution of the rest of the data points because we'll be removing "bad" data points that don't match with these. In other words, it's entirely possible for the first set of data points to be outliers, and for the rest to be the actual good data points. It's certainly possible to increase the size of our initial dataset from 5 to some number X, but then the question becomes what is the number X we should start with to get a representative sample size, and is that always a representative sample size.
Comment 4•3 years ago
|
||
Side note: we may be able to reduce the number of retriggers if we also compared via Mann-Whitney-U, which is better suited to bimodal etc distributions: See bug 1689373
Updated•22 days ago
|
Updated•12 days ago
|
Description
•