Open Bug 1581533 Opened 5 years ago Updated 3 years ago

Discussion: Improving performance regression detection

Categories

(Tree Management :: Perfherder, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: ekyle, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

igoldan, sclements, kthiessen, davehunt,

Perfherder uses the t-test for detecting performance regressions. This test assumes the performance results have a Gaussian distribution; it assumes a (mean, standard deviation) pair can reasonably describe the performance results. This is NOT true; many of our performance results are bi-modal, or multi-modal, and have some number of outliers.

I have a small project to detect "deviant noise" https://github.com/mozilla/measure-noise The original intent was to measure deviation-from-Gaussian, and use that as a "noise statistic", or yet-another-statistic-we-can-track-and-alert-on. The idea being - performance results with large deviant noise are almost useless for detecting regressions.

Actually, while collecting tests for this statistic, I noticed the the t-threshold was set to 7. This is VERY insensitive; literally "off the chart". But, it is necessary because many of our performance tests have large deviant noise which makes the t-test generate false positives.

I would like your opinion on how to proceed:

  1. Add deviant noise as another statistic to be reported by Treeherder - This was the original plan, and it allows us to identify performance tests that are near-useless for detecting regressions. Unfortunately, this will effectively double the number of series being tracked by the performance sheriffs. Maybe this is not a problem, the deviant noise statistic may also be deviant, which means regression detection does not work, and we do not care to report on deviant noise regressions.
  2. Replace the t-test - The assumptions of the t-test is a problem. We can use the Mann-Whitney U test instead: It is based on rank (rather than value) and is less sensitive to outliers and modality. This will allow us to increase the sensitivity without increasing the false positive rate. It will increase the number of legitimate performance regressions detected. Is this a good option? Can the sheriffs handle more performance regressions? Also, notice my use of "false positive" and "legitimate performance regressions" is in the statistical sense only: I know we a plagued by performance regressions that, when bisected, point to changes that can not possibly have caused a slowdown - this project can not reduce those, only increase them.
  3. Do nothing - Any change in the code will have a business process impact. We should ensure the people and process are in place before we increase the number of regressions detected.

Other options? Thoughts?

Flags: needinfo?(sclements)
Flags: needinfo?(kthiessen)
Flags: needinfo?(igoldan)
Flags: needinfo?(dave.hunt)

I'll let Ionut and Dave comment on this, per the discussion in the TH meeting. If interested, we can set up an experiment on prototype over x number of days (but probably point to the production replica or at least update prototype's RDS snapshot to match latest prod) to determine what the impact of changing this would be.

Flags: needinfo?(sclements)

Good discussion in the Treeherder meeting this morning. I agree that Dave and Ionuț are at the sharp end of this, so we should definitely solicit their opinions.

Flags: needinfo?(kthiessen)

(In reply to Kyle Lahnakoski [:ekyle] from comment #0)

I would like your opinion on how to proceed:

  1. Add deviant noise as another statistic to be reported by Treeherder - This was the original plan, and it allows us to identify performance tests that are near-useless for detecting regressions. Unfortunately, this will effectively double the number of series being tracked by the performance sheriffs. Maybe this is not a problem, the deviant noise statistic may also be deviant, which means regression detection does not work, and we do not care to report on deviant noise regressions.

I need more examples to better understand how this could integrate into Perfherder. It's good that this allows us to identify near-useless perf tests. What I think it's even better is this is an objective tool that mechanically states "Test X should be revisited". It's a mathematically proven proof that a test misbehaves & is likely out of date. This is a very valuable asset for us.
I believe that with a tool like this, we'll be quickly burning through the unreliable perf tests, disable them temporarily or permanently & in time decrease the number of useless alerts.
I understand t-test isn't the best approach, but deviant noise could aid it quite a lot, right?

  1. Replace the t-test - The assumptions of the t-test is a problem. We can use the Mann-Whitney U test instead: It is based on rank (rather than value) and is less sensitive to outliers and modality. This will allow us to increase the sensitivity without increasing the false positive rate. It will increase the number of legitimate performance regressions detected. Is this a good option? Can the sheriffs handle more performance regressions? Also, notice my use of "false positive" and "legitimate performance regressions" is in the statistical sense only: I know we a plagued by performance regressions that, when bisected, point to changes that can not possibly have caused a slowdown - this project can not reduce those, only increase them.

I like we found an alternative to the existing t-test. Still, I believe multi-modality is a perf test illness. Data like this shouldn't appear in the first place in our charts. Sounds to me like Mann-Whitney U test is a good workaround, but I prefer we simply cut the problem right from the root (by updating or completely removing perf tests). Ill perf tests will give a really hard time to the Firefox Engineers when trying to fix perf regressions, even if we as Perf sheriffs correctly identify the culprit bug.

Questions I have here:

  • Assuming we employ the Mann-Whitney U test: could deviant noise go hand-in-hand with it, like deviant noise can do with the t-test? If yes, then this an extra reason to go with the 1st option & postpone Mann-Whitney U test for a later quarter.
  • What do you mean by increase the sensitivity? Does this mean sustain thresholds even lower than 2%? If yes, this doesn't weigh in too much. Mainly because small regressions tend to be ignored, WONTFIXed or postponed for a later fix. Developers are often puzzled by them & question our perf sheriffing process & precision.
  1. Do nothing - Any change in the code will have a business process impact. We should ensure the people and process are in place before we increase the number of regressions detected.

Other options? Thoughts?

My main requirement is to identify the problematic perf tests, so we can either stabilize or remove them. Perf sheriffs have a very hard time mainly because of the noise, which very likely comes from unstable/unreliable perf tests.
Given this, I'm inclined to go with the 1st option: enhance our existing t-test & clean things out.

Flags: needinfo?(igoldan)
Flags: needinfo?(klahnakoski)
Blocks: 1582711

One complication is the perf tests change behaviour over time; a perf test may exhibit deviant noise one day, and then be well-behaved the next. Any process that removes problematic tests should include a process that includes them too.

Almost all our performance tests are deviant, but that does not mean they are useless; we are still using the t-test to detect egregious regressions,.

We had a meeting to determine some action items.

  1. Create one-time report that lists performance tests with the most deviant noise, in order from most deviant to least - This will use the TH-readonly database; and maybe implemented with reDash, as a complicated query that can generate the perfherder URL, or some Python program that does the same. This list will be used to get a sense of what deviant noise looks like, and tell us if deviant noise is useful. It may inform us what tests are currently providing no value.
  2. Create a script that compares the t-test alerts to the Mann-Whitney-U alerts. This can be implemented as a React/JSX chart; using the perfherder API to show existing alerts; it could be a modified Perfherder frontend, with additional calculation of Mann-Whitney alerts; it could be a Python script that uses the TH-readlony DB, and throws up a simple chart highlighting both types of alerts.
  3. Track deviant noise over time, and alert when it changed - This may be difficult because deviant noise is a higher-order statistic and we do not understand its behaviour; it may be unstable. Larger time ranges will be required, probably days. This will be a Python script, run over the TH-readonly database; outputting a list of possible deviant noise "regressions". These "regressions" will be reviewed to determine if such alerts provide value.
Flags: needinfo?(klahnakoski)
Flags: needinfo?(dave.hunt)

We will include a focus on tp6, because that is what we are tracking most.

Priority: -- → P3

There is a Python program that scans Perfherder series, calculates a number of statistics, and assembles a small database. This database can be used to query for the best, or the worst, of the statistics and have them shown

Code: https://github.com/mozilla/measure-noise/blob/dev/measure_noise/analysis.py
Readme: https://github.com/mozilla/measure-noise/blob/dev/README.md#running-analysis

I have started a new document to summarize the patterns I have seen:
https://docs.google.com/document/d/1TMUhY4zaOA_DV6hzmNY3Z2v9Hm6XF3zlKvWOTRL4xJs/edit

Blocks: 1601952
Blocks: 1543952
You need to log in before you can comment on or make changes to this bug.