Closed Bug 1437895 Opened 6 years ago Closed 6 years ago

Increase PerfStrip alert threshold

Categories

(Tree Management :: Perfherder, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: igoldan, Assigned: igoldan)

References

Details

(Whiteboard: [PI:April])

Attachments

(2 files)

Investigate the history of false PerfStrip alerts, to come up with a reasonable threshold.

Set it up, so to reduce the huge amount of invalid alerts that flow in.
Blocks: 1425845
Whiteboard: [PI:February]
I queried the production database for invalid PerfStrip* alerts, both regressions and improvements.
I came up with this percentage distribution:

<=5%    : 56  invalid alerts
<=8%    : 119 invalid alerts
<=10%   : 131 invalid alerts
<=15%   : 146 invalid alerts
<= 20%  : 174 invalid alerts
<= 100% : 320 invalid alerts

I'd say that if we increase the Strings PerfStrip* percentage threshold to 8%, we will silence a big chunk of falseys and
still keep an eye on other ones. If this change won't be enough, then we will further increase the threshold to 20%.

Joel, does this sound reasonable?
Flags: needinfo?(jmaher)
Assignee: nobody → igoldan
either 8 or 10- is this specific to a platform, or for the most part all platforms?
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #2)
> either 8 or 10- is this specific to a platform, or for the most part all
> platforms?

This is for all platforms.
You may also consider increasing the number of previous/next datapoints required via the minBackWindow/maxBackWindow/foreWindow options -- since the distribution of results is bimodal, using a larger number of datapoints increases the chances that the mean result will be somewhere between the two modes.

https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L47
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> You may also consider increasing the number of previous/next datapoints
> required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> distribution of results is bimodal, using a larger number of datapoints
> increases the chances that the mean result will be somewhere between the two
> modes.
> 
> https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> artifact.json#L47

To be specific, I would suggest the following values:

* minBackWindow: Increase to 24 (from 12)
* maxBackWindow: Increase to 48 (from 24)
* foreWindow: Increase to 24 (from 12)
(In reply to William Lachance (:wlach) (use needinfo!) from comment #6)
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> > You may also consider increasing the number of previous/next datapoints
> > required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> > distribution of results is bimodal, using a larger number of datapoints
> > increases the chances that the mean result will be somewhere between the two
> > modes.
> > 
> > https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> > artifact.json#L47
> 
> To be specific, I would suggest the following values:
> 
> * minBackWindow: Increase to 24 (from 12)
> * maxBackWindow: Increase to 48 (from 24)
> * foreWindow: Increase to 24 (from 12)

I've updated the PR with the new window values. Is it OK for merge?
Flags: needinfo?(wlachance)
Priority: -- → P1
Attached file Github PR
Attachment #8952721 - Flags: review?(wlachance)
Comment on attachment 8952721 [details] [review]
Github PR

We should modify the tests themselves to submit this data:

https://searchfox.org/mozilla-central/rev/47cb352984bac15c476dcd75f8360f902673cb98/testing/gtest/mozilla/MozGTestBench.cpp#49

I think generally it is probably correct to increase the window size for all the platform microbench tests, as they tend to be rather unstable (and my guess is that they are sheriffed infrequently enough that it won't matter if we're a little slower to alert, which is all that increasing the window size will cause).

Not sure about how to increase the alerting threshold specifically for perfstrip -- maybe we should increase it across the board here as well? Should probably look into which alertsummary alerts have led to actionable bug reports at this point, and what the change % threshold was there.
Flags: needinfo?(wlachance)
Attachment #8952721 - Flags: review?(wlachance) → review-
:igoldan- is this something you can finish up?
Flags: needinfo?(igoldan)
Whiteboard: [PI:February] → [PI:March]
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> :igoldan- is this something you can finish up?

To my current understanding, Perfherder doesn't support alert specific thresholds. I need to look more into this.

Still, even if we don't support this yet, we have to. Each alert varies to a certain degree. Because one of them is noisy and misbehaves, it should cause a threshold raise for the others.
Flags: needinfo?(igoldan)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> misbehaves, it should cause a threshold raise for the others.

Typo: it should not cause a threshold raise
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> > :igoldan- is this something you can finish up?
> 
> To my current understanding, Perfherder doesn't support alert specific
> thresholds. I need to look more into this.
> 
> Still, even if we don't support this yet, we have to. Each alert varies to a
> certain degree. Because one of them is noisy and misbehaves, it should cause
> a threshold raise for the others.

Perfherder supports this per suite/test via the alertThreshold property:

https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L41
here is an example of a build job changing the alert threshold:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1565

perfstrip uses a function for printing out the perfherder_data:
https://searchfox.org/mozilla-central/source/testing/gtest/mozilla/MozGTestBench.cpp#49

I would recommend extending that function to include alertThresholds (use a default value) and then for PerfStrip tests you could adjust it to a higher threshold.  This is C++, do ask for help if you are not familiar with it.
Comment on attachment 8963889 [details]
Bug 1437895 - Increase *PerfStrip* alert threshold

https://reviewboard.mozilla.org/r/232744/#review238186

This doesn't feel right to me, this sort of pattern always leads to an unsustainable number of special cases in my experience. It also obscures the fact that a particular test may have a non-default threshold by placing the information outside of its definition.

Instead, I think we should modify the function to take a parameter called `alertThreshold`. We could then modify the existing macro to call it with whatever threshold we like (I guess we can start with 2.0, the current default) and add a new macro like this: 

```cpp
#define MOZ_GTEST_BENCH_THRESHOLD_F(suite, test, alertThreshold, lambdaOrFunc) \
TEST_F(suite, test) { \
  mozilla::GTestBench(#suite, #test, alertThreshold, lambdaOrFunc); \
}
```

Then the test macro can just be updated like this:

```cpp
MOZ_GTEST_BENCH_F(Strings, PerfStripWhitespace, 8.0, [this] {
```

Incidentally it looks like the test that motivated this work may have have been fixed to be more stable in bug 1449148.
Attachment #8963889 - Flags: review?(wlachance) → review-
Whiteboard: [PI:March] → [PI:April]
Priority: P1 → P2
(In reply to William Lachance (:wlach) (use needinfo!) from comment #16)
> Incidentally it looks like the test that motivated this work may have have
> been fixed to be more stable in bug 1449148.

I agree. I monitored the platform_microbench tests for a week and indeed seem more stable now.
For the time being, marking this as WONTFIX. In case lots of falseys land again, I'll reopen this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: