Open Bug 1601952 Opened 5 years ago Updated 4 years ago

Use The Mann-Whitney-U test

Categories

(Tree Management :: Perfherder, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: ekyle, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

As per https://bugzilla.mozilla.org/show_bug.cgi?id=1581533, we plan to replace the t-test with Mann-Whitney-U.

This will act as a drop-in replacement; the business processes should not need to change. The number of alerts will increase, if only because more steps can be detected.

Blocks: 1265381

I want to understand the scope of work required to implement this. Kyle, could you elaborate on what you see as amount of work required? Maybe highlight areas that could require discussion/decisions?

Flags: needinfo?(klahnakoski)

This bug represents a single Treeherder patch to add the algorithm and the supporting libraries (scipy) to replace the t-test logic. The amount of coding effort is no more than one week. The total time will probably be a month as we coordinate reviews and step through the PR process and step through staging.

There may be other work we want to do while we are in that part of the code; so some triage of related bugs should be done before we start. The only formal blocker to this work is an OK from the perf sheriffs (after they reviewed the comparisons provided by https://github.com/mozilla/measure-noise).

Flags: needinfo?(klahnakoski)

:davehunt, can you determine the priority for this and schedule time to review the comparison that Kyle provided?

Flags: needinfo?(dave.hunt)

:igoldan :sparky can you provide your review feedback for measure-noise? How do you feel that introducing MWU will impact performance sheriffing?

Flags: needinfo?(igoldan)
Flags: needinfo?(gmierz2)
Flags: needinfo?(dave.hunt)

r+ for MWU! I've reviewed Kyle's work on this last quarter and it looks great.

Flags: needinfo?(gmierz2)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #4)

:igoldan :sparky can you provide your review feedback for measure-noise? How do you feel that introducing MWU will impact performance sheriffing?

For almost a day, I compared old Raptor perf alerts vs MWU ones. As Kyle hinted some time ago, the alerts generated by MWU are considerably more. Gut feeling: around 10 times more.

If we didn't take any actions regarding expiring perf alerts & alert summaries, we need to consider prioritizing that too. Some of the backend queries (even frequent ones) assume that the tables for these 2 are smallish (several tens of thousands of rows). Replacing with MWU will pretty likely turn them into more than 100,000 rows, which in turn will increase the response times of Perfherder.

Flags: needinfo?(igoldan)
Priority: -- → P3

The larger number of alerts can be mitigated by prioritizing the important alerts. Alerts can be graded on two scales:

  1. The amount of the regression
  2. The statistical confidence there is a regression

Perfherder has thresholds for #1 on a per-suite basis; they can be changed to capture only the large regressions. Adjusting these suite thresholds is the easiest. The statistical confidence can be adjusted to prefer tests with less noise; tests with the clearest signal, but this is not a useful lever, since MWU is all about seeing performance regressions despite the noise.

A third option, outside this project, is to sort the performance regressions based on some combination of the two scales, and let Sheriffs work through the list, from largest clearest regression on down; making a best effort. Accept that a majority of the regressions found will never be looked at.

You need to log in before you can comment on or make changes to this bug.