Closed Bug 1437895 Opened 6 years ago Closed 6 years ago

Increase PerfStrip alert threshold

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: igoldan, Assigned: igoldan)

References

Details

(Whiteboard: [PI:April])

Attachments

(2 files)

Github PR 6 years ago Ionuț Goldan [:igoldan] 47 bytes, text/x-github-pull-request	wlach : review-	Details \| Review
Bug 1437895 - Increase PerfStrip alert threshold 6 years ago Ionuț Goldan [:igoldan] 59 bytes, text/x-review-board-request	wlach : review-	Details

Ionuț Goldan [:igoldan]

Assignee

Description

•

6 years ago

Investigate the history of false PerfStrip alerts, to come up with a reasonable threshold.

Set it up, so to reduce the huge amount of invalid alerts that flow in.

Ionuț Goldan [:igoldan]

Assignee

Updated

•

6 years ago

Blocks: 1425845

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Whiteboard: [PI:February]

Ionuț Goldan [:igoldan]

Assignee

Comment 1

•

6 years ago

I queried the production database for invalid PerfStrip* alerts, both regressions and improvements.
I came up with this percentage distribution:

<=5%    : 56  invalid alerts
<=8%    : 119 invalid alerts
<=10%   : 131 invalid alerts
<=15%   : 146 invalid alerts
<= 20%  : 174 invalid alerts
<= 100% : 320 invalid alerts

I'd say that if we increase the Strings PerfStrip* percentage threshold to 8%, we will silence a big chunk of falseys and
still keep an eye on other ones. If this change won't be enough, then we will further increase the threshold to 20%.

Joel, does this sound reasonable?

Flags: needinfo?(jmaher)

Ionuț Goldan [:igoldan]

Assignee

Updated

•

6 years ago

Assignee: nobody → igoldan

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

6 years ago

either 8 or 10- is this specific to a platform, or for the most part all platforms?

Flags: needinfo?(jmaher)

Ionuț Goldan [:igoldan]

Assignee

Comment 3

•

6 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #2)
> either 8 or 10- is this specific to a platform, or for the most part all
> platforms?

This is for all platforms.

Ionuț Goldan [:igoldan]

Assignee

Comment 4

•

6 years ago

The PR is here https://github.com/mozilla/treeherder/pull/3243

William Lachance (:wlach)

Comment 5

•

6 years ago

You may also consider increasing the number of previous/next datapoints required via the minBackWindow/maxBackWindow/foreWindow options -- since the distribution of results is bimodal, using a larger number of datapoints increases the chances that the mean result will be somewhere between the two modes.

https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L47

William Lachance (:wlach)

Comment 6

•

6 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> You may also consider increasing the number of previous/next datapoints
> required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> distribution of results is bimodal, using a larger number of datapoints
> increases the chances that the mean result will be somewhere between the two
> modes.
> 
> https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> artifact.json#L47

To be specific, I would suggest the following values:

* minBackWindow: Increase to 24 (from 12)
* maxBackWindow: Increase to 48 (from 24)
* foreWindow: Increase to 24 (from 12)

Ionuț Goldan [:igoldan]

Assignee

Comment 7

•

6 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #6)
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> > You may also consider increasing the number of previous/next datapoints
> > required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> > distribution of results is bimodal, using a larger number of datapoints
> > increases the chances that the mean result will be somewhere between the two
> > modes.
> > 
> > https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> > artifact.json#L47
> 
> To be specific, I would suggest the following values:
> 
> * minBackWindow: Increase to 24 (from 12)
> * maxBackWindow: Increase to 48 (from 24)
> * foreWindow: Increase to 24 (from 12)

I've updated the PR with the new window values. Is it OK for merge?

Flags: needinfo?(wlachance)

Ionuț Goldan [:igoldan]

Assignee

Updated

•

6 years ago

Priority: -- → P1

Ionuț Goldan [:igoldan]

Assignee

Comment 8

•

6 years ago

Attached file Github PR — Details

Attachment #8952721 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 9

•

6 years ago

Comment on attachment 8952721 [details] [review]
Github PR

We should modify the tests themselves to submit this data:

https://searchfox.org/mozilla-central/rev/47cb352984bac15c476dcd75f8360f902673cb98/testing/gtest/mozilla/MozGTestBench.cpp#49

I think generally it is probably correct to increase the window size for all the platform microbench tests, as they tend to be rather unstable (and my guess is that they are sheriffed infrequently enough that it won't matter if we're a little slower to alert, which is all that increasing the window size will cause).

Not sure about how to increase the alerting threshold specifically for perfstrip -- maybe we should increase it across the board here as well? Should probably look into which alertsummary alerts have led to actionable bug reports at this point, and what the change % threshold was there.

Flags: needinfo?(wlachance)

Attachment #8952721 - Flags: review?(wlachance) → review-

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

6 years ago

:igoldan- is this something you can finish up?

Flags: needinfo?(igoldan)

Whiteboard: [PI:February] → [PI:March]

Ionuț Goldan [:igoldan]

Assignee

Comment 11

•

6 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> :igoldan- is this something you can finish up?

To my current understanding, Perfherder doesn't support alert specific thresholds. I need to look more into this.

Still, even if we don't support this yet, we have to. Each alert varies to a certain degree. Because one of them is noisy and misbehaves, it should cause a threshold raise for the others.

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Assignee

Comment 12

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> misbehaves, it should cause a threshold raise for the others.

Typo: it should not cause a threshold raise

William Lachance (:wlach)

Comment 13

•

6 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> > :igoldan- is this something you can finish up?
> 
> To my current understanding, Perfherder doesn't support alert specific
> thresholds. I need to look more into this.
> 
> Still, even if we don't support this yet, we have to. Each alert varies to a
> certain degree. Because one of them is noisy and misbehaves, it should cause
> a threshold raise for the others.

Perfherder supports this per suite/test via the alertThreshold property:

https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L41

Joel Maher ( :jmaher ) (UTC -8)

Comment 14

•

6 years ago

here is an example of a build job changing the alert threshold:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1565

perfstrip uses a function for printing out the perfherder_data:
https://searchfox.org/mozilla-central/source/testing/gtest/mozilla/MozGTestBench.cpp#49

I would recommend extending that function to include alertThresholds (use a default value) and then for PerfStrip tests you could adjust it to a higher threshold.  This is C++, do ask for help if you are not familiar with it.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/232744/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/232744/

William Lachance (:wlach)

Comment 16

•

6 years ago

mozreview-review

Comment on attachment 8963889 [details]
Bug 1437895 - Increase *PerfStrip* alert threshold

https://reviewboard.mozilla.org/r/232744/#review238186

This doesn't feel right to me, this sort of pattern always leads to an unsustainable number of special cases in my experience. It also obscures the fact that a particular test may have a non-default threshold by placing the information outside of its definition.

Instead, I think we should modify the function to take a parameter called `alertThreshold`. We could then modify the existing macro to call it with whatever threshold we like (I guess we can start with 2.0, the current default) and add a new macro like this: 

```cpp
#define MOZ_GTEST_BENCH_THRESHOLD_F(suite, test, alertThreshold, lambdaOrFunc) \
TEST_F(suite, test) { \
  mozilla::GTestBench(#suite, #test, alertThreshold, lambdaOrFunc); \
}
```

Then the test macro can just be updated like this:

```cpp
MOZ_GTEST_BENCH_F(Strings, PerfStripWhitespace, 8.0, [this] {
```

Incidentally it looks like the test that motivated this work may have have been fixed to be more stable in bug 1449148.

Attachment #8963889 - Flags: review?(wlachance) → review-

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Whiteboard: [PI:March] → [PI:April]

Ionuț Goldan [:igoldan]

Assignee

Updated

•

6 years ago

Priority: P1 → P2

Ionuț Goldan [:igoldan]

Assignee

Comment 17

•

6 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #16)
> Incidentally it looks like the test that motivated this work may have have
> been fixed to be more stable in bug 1449148.

I agree. I monitored the platform_microbench tests for a week and indeed seem more stable now.
For the time being, marking this as WONTFIX. In case lots of falseys land again, I'll reopen this.

Ionuț Goldan [:igoldan]

Assignee

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.