Closed
Bug 1437895
Opened 7 years ago
Closed 7 years ago
Increase PerfStrip alert threshold
Categories
(Tree Management :: Perfherder, enhancement, P2)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: igoldan, Assigned: igoldan)
References
Details
(Whiteboard: [PI:April])
Attachments
(2 files)
Investigate the history of false PerfStrip alerts, to come up with a reasonable threshold.
Set it up, so to reduce the huge amount of invalid alerts that flow in.
Updated•7 years ago
|
Whiteboard: [PI:February]
Assignee | ||
Comment 1•7 years ago
|
||
I queried the production database for invalid PerfStrip* alerts, both regressions and improvements.
I came up with this percentage distribution:
<=5% : 56 invalid alerts
<=8% : 119 invalid alerts
<=10% : 131 invalid alerts
<=15% : 146 invalid alerts
<= 20% : 174 invalid alerts
<= 100% : 320 invalid alerts
I'd say that if we increase the Strings PerfStrip* percentage threshold to 8%, we will silence a big chunk of falseys and
still keep an eye on other ones. If this change won't be enough, then we will further increase the threshold to 20%.
Joel, does this sound reasonable?
Flags: needinfo?(jmaher)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → igoldan
Comment 2•7 years ago
|
||
either 8 or 10- is this specific to a platform, or for the most part all platforms?
Flags: needinfo?(jmaher)
Assignee | ||
Comment 3•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #2)
> either 8 or 10- is this specific to a platform, or for the most part all
> platforms?
This is for all platforms.
Assignee | ||
Comment 4•7 years ago
|
||
The PR is here https://github.com/mozilla/treeherder/pull/3243
Comment 5•7 years ago
|
||
You may also consider increasing the number of previous/next datapoints required via the minBackWindow/maxBackWindow/foreWindow options -- since the distribution of results is bimodal, using a larger number of datapoints increases the chances that the mean result will be somewhere between the two modes.
https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L47
Comment 6•7 years ago
|
||
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> You may also consider increasing the number of previous/next datapoints
> required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> distribution of results is bimodal, using a larger number of datapoints
> increases the chances that the mean result will be somewhere between the two
> modes.
>
> https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> artifact.json#L47
To be specific, I would suggest the following values:
* minBackWindow: Increase to 24 (from 12)
* maxBackWindow: Increase to 48 (from 24)
* foreWindow: Increase to 24 (from 12)
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to William Lachance (:wlach) (use needinfo!) from comment #6)
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> > You may also consider increasing the number of previous/next datapoints
> > required via the minBackWindow/maxBackWindow/foreWindow options -- since the
> > distribution of results is bimodal, using a larger number of datapoints
> > increases the chances that the mean result will be somewhere between the two
> > modes.
> >
> > https://github.com/mozilla/treeherder/blob/master/schemas/performance-
> > artifact.json#L47
>
> To be specific, I would suggest the following values:
>
> * minBackWindow: Increase to 24 (from 12)
> * maxBackWindow: Increase to 48 (from 24)
> * foreWindow: Increase to 24 (from 12)
I've updated the PR with the new window values. Is it OK for merge?
Flags: needinfo?(wlachance)
Assignee | ||
Updated•7 years ago
|
Priority: -- → P1
Assignee | ||
Comment 8•7 years ago
|
||
Attachment #8952721 -
Flags: review?(wlachance)
Comment 9•7 years ago
|
||
Comment on attachment 8952721 [details] [review]
Github PR
We should modify the tests themselves to submit this data:
https://searchfox.org/mozilla-central/rev/47cb352984bac15c476dcd75f8360f902673cb98/testing/gtest/mozilla/MozGTestBench.cpp#49
I think generally it is probably correct to increase the window size for all the platform microbench tests, as they tend to be rather unstable (and my guess is that they are sheriffed infrequently enough that it won't matter if we're a little slower to alert, which is all that increasing the window size will cause).
Not sure about how to increase the alerting threshold specifically for perfstrip -- maybe we should increase it across the board here as well? Should probably look into which alertsummary alerts have led to actionable bug reports at this point, and what the change % threshold was there.
Flags: needinfo?(wlachance)
Attachment #8952721 -
Flags: review?(wlachance) → review-
Comment 10•7 years ago
|
||
:igoldan- is this something you can finish up?
Flags: needinfo?(igoldan)
Whiteboard: [PI:February] → [PI:March]
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> :igoldan- is this something you can finish up?
To my current understanding, Perfherder doesn't support alert specific thresholds. I need to look more into this.
Still, even if we don't support this yet, we have to. Each alert varies to a certain degree. Because one of them is noisy and misbehaves, it should cause a threshold raise for the others.
Flags: needinfo?(igoldan)
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> misbehaves, it should cause a threshold raise for the others.
Typo: it should not cause a threshold raise
Comment 13•7 years ago
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #11)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #10)
> > :igoldan- is this something you can finish up?
>
> To my current understanding, Perfherder doesn't support alert specific
> thresholds. I need to look more into this.
>
> Still, even if we don't support this yet, we have to. Each alert varies to a
> certain degree. Because one of them is noisy and misbehaves, it should cause
> a threshold raise for the others.
Perfherder supports this per suite/test via the alertThreshold property:
https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json#L41
Comment 14•7 years ago
|
||
here is an example of a build job changing the alert threshold:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1565
perfstrip uses a function for printing out the perfherder_data:
https://searchfox.org/mozilla-central/source/testing/gtest/mozilla/MozGTestBench.cpp#49
I would recommend extending that function to include alertThresholds (use a default value) and then for PerfStrip tests you could adjust it to a higher threshold. This is C++, do ask for help if you are not familiar with it.
Comment hidden (mozreview-request) |
Comment 16•7 years ago
|
||
mozreview-review |
Comment on attachment 8963889 [details]
Bug 1437895 - Increase *PerfStrip* alert threshold
https://reviewboard.mozilla.org/r/232744/#review238186
This doesn't feel right to me, this sort of pattern always leads to an unsustainable number of special cases in my experience. It also obscures the fact that a particular test may have a non-default threshold by placing the information outside of its definition.
Instead, I think we should modify the function to take a parameter called `alertThreshold`. We could then modify the existing macro to call it with whatever threshold we like (I guess we can start with 2.0, the current default) and add a new macro like this:
```cpp
#define MOZ_GTEST_BENCH_THRESHOLD_F(suite, test, alertThreshold, lambdaOrFunc) \
TEST_F(suite, test) { \
mozilla::GTestBench(#suite, #test, alertThreshold, lambdaOrFunc); \
}
```
Then the test macro can just be updated like this:
```cpp
MOZ_GTEST_BENCH_F(Strings, PerfStripWhitespace, 8.0, [this] {
```
Incidentally it looks like the test that motivated this work may have have been fixed to be more stable in bug 1449148.
Attachment #8963889 -
Flags: review?(wlachance) → review-
Updated•7 years ago
|
Whiteboard: [PI:March] → [PI:April]
Assignee | ||
Updated•7 years ago
|
Priority: P1 → P2
Assignee | ||
Comment 17•7 years ago
|
||
(In reply to William Lachance (:wlach) (use needinfo!) from comment #16)
> Incidentally it looks like the test that motivated this work may have have
> been fixed to be more stable in bug 1449148.
I agree. I monitored the platform_microbench tests for a week and indeed seem more stable now.
For the time being, marking this as WONTFIX. In case lots of falseys land again, I'll reopen this.
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•