One major issue we have is that the number of iterations you do will depend on some statistic. We tried using a simple metric like Z-score but the issue was with bimodal/modal tests. In those cases, it's impossible for us to determine when we are in the "right" mode since any Z-score or statistical technique will filter out one mode, or the other, but it won't keep both. This means that our bimodal would get even worse with a change here.

I was looking into the criterion module you linked, and they only gather measurements up to a threshold so they don't calculate this automatically either. It's only dynamic based on the sum of measurements being produced and not based on their statistical significance: https://github.com/haskell/criterion/blob/master/criterion-measurement/src/Criterion/Measurement.hs#L245-L253

They use this technique to help them estimate the point at which they expect to have enough data for a significant comparison, so they could still be wrong.

Something to note is that a lot of these issues are coming from the fact that we only use a single median value per task for the statistical comparisons. We have some work underway to improve this system and use the replicates themselves (15-25 points per task) which will help give you a better comparison. You can find the start of that work here which we hope to integrate into CI: https://github.com/mozilla/mozperftest-tools/blob/master/mozperftest_tools/mozperftest_tools/regression_detector.py

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Whiteboard: [fxp]

Jira Integration Bot

Updated

•

3 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/FXP-2598

Paul Adenot (:padenot)

Reporter

Comment 2

•

3 years ago

Non-unimodal distribution need to be compared manually, and more importantly highlighted and investigated, this is where we'd need 1814359. Same for when there are outliers in the data.

It's clear that having a lot more data per run will help, that's cool to hear.

Greg Mierzwinski [:sparky]

Comment 3

•

3 years ago

•

Edited

I agree, but this issue arises when we're running the test itself. While we're running, we can't tell if we're in one mode, or another, and it's impossible for us to know what the "right" mode is. This means that we'd inadvertently make the statistical comparisons worse since some tasks will end up in one mode, and others will end up in another.

EDIT: Another way of looking at this is if we start by analyzing a group of 5 data points to determine their distribution. This set of data points determines the distribution of the rest of the data points because we'll be removing "bad" data points that don't match with these. In other words, it's entirely possible for the first set of data points to be outliers, and for the rest to be the actual good data points. It's certainly possible to increase the size of our initial dataset from 5 to some number X, but then the question becomes what is the number X we should start with to get a representative sample size, and is that always a representative sample size.

Andrew Creskey [:acreskey]

Comment 4

•

3 years ago

Side note: we may be able to reduce the number of retriggers if we also compared via Mann-Whitney-U, which is better suited to bimodal etc distributions: See bug 1689373

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

22 days ago

Priority: P3 → P4

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

12 days ago

Severity: S3 → S4

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Automatically retrigger performance measurements until statistical significance is reached

Categories

(Testing :: Performance, enhancement, P4)

Tracking

(Not tracked)

People

(Reporter: padenot, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxp])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated