Bug 1581533 Comment 3 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(In reply to Kyle Lahnakoski [:ekyle] from comment #0)
> I would like your opinion on how to proceed:
> 
> 1. Add deviant noise as another statistic to be reported by Treeherder - This was the original plan, and it allows us to identify performance tests that are near-useless for detecting regressions.  Unfortunately, this will effectively double the number of series being tracked by the performance sheriffs.  Maybe this is not a problem,  the deviant noise statistic may also be deviant, which means regression detection does not work, and we do not care to report on deviant noise regressions.

I need more examples to better understand how this could integrate into Perfherder. It's good that this allows us to identify near-useless perf tests. What I think it's even better is this is an objective tool that mechanically states "Test X should be revisited". It's a mathematically proven proof that a test misbehaves & is likely out of date.
I believe that with a tool like this, we'll be quickly burning through the unreliable perf tests, disable them temporarily or permanently & in time decrease the number of useless alerts.
I understand t-test isn't the best approach, but deviant noise could aid it quite a lot, right?

> 2. Replace the t-test - The assumptions of the t-test is a problem.  We can use the [Mann-Whitney U test](https://github.com/mozilla/measure-noise/issues/6) instead: It is based on rank (rather than value) and is less sensitive to outliers and modality.  This will allow us to increase the sensitivity without increasing the false positive rate.  It will increase the number of legitimate performance regressions detected.  Is this a good option?  Can the sheriffs handle more performance regressions?  Also, notice my use of "false positive" and "legitimate performance regressions" is in the statistical sense only: I know we a plagued by performance regressions that, when bisected, point to changes that can not possibly have caused a slowdown - this project can not reduce those, only increase them.

I like we found an alternative to the existing t-test. Still, I believe multi-modality is a perf test illness. Data like this shouldn't appear in the first place in our charts. Sounds to me like Mann-Whitney U test is a good workaround, but I prefer we simply cut  the problem right from the root *(by updating or completely removing perf tests)*.  Ill perf tests will give a really hard time to the Firefox Engineers when trying to fix perf regressions, even if we as Perf sheriffs correctly identify the culprit bug.

Questions I have here:
* Assuming we employ the Mann-Whitney U test: could deviant noise go hand-in-hand with it, like deviant noise can do with the t-test? If yes, then this an extra reason to go with the 1st option & postpone Mann-Whitney U test for a later quarter.
* What do you mean by *increase the sensitivity*? Does this mean sustain thresholds even lower than 2%? If yes, this doesn't weigh in too much. Mainly because small regressions tend to be ignored, WONTFIXed or postponed for a later fix. Developers are often puzzled by them & question our perf sheriffing process & precision.

> 3. Do nothing - Any change in the code will have a business process impact. We should ensure the people and process are in place before we increase the number of regressions detected.
> 
> Other options? Thoughts?

My main requirement is to identify the problematic perf tests, so we can either stabilize or remove them. Perf sheriffs have a very hard time mainly because of the noise, which very likely comes from unreliable perf tests.
Given this, I'm inclined to go with the 1st option: enhance our existing t-test & clean things out.
(In reply to Kyle Lahnakoski [:ekyle] from comment #0)
> I would like your opinion on how to proceed:
> 
> 1. Add deviant noise as another statistic to be reported by Treeherder - This was the original plan, and it allows us to identify performance tests that are near-useless for detecting regressions.  Unfortunately, this will effectively double the number of series being tracked by the performance sheriffs.  Maybe this is not a problem,  the deviant noise statistic may also be deviant, which means regression detection does not work, and we do not care to report on deviant noise regressions.

I need more examples to better understand how this could integrate into Perfherder. It's good that this allows us to identify near-useless perf tests. What I think it's even better is this is an objective tool that mechanically states "Test X should be revisited". It's a mathematically proven proof that a test misbehaves & is likely out of date. This is a very valuable asset for us.
I believe that with a tool like this, we'll be quickly burning through the unreliable perf tests, disable them temporarily or permanently & in time decrease the number of useless alerts.
I understand t-test isn't the best approach, but deviant noise could aid it quite a lot, right?

> 2. Replace the t-test - The assumptions of the t-test is a problem.  We can use the [Mann-Whitney U test](https://github.com/mozilla/measure-noise/issues/6) instead: It is based on rank (rather than value) and is less sensitive to outliers and modality.  This will allow us to increase the sensitivity without increasing the false positive rate.  It will increase the number of legitimate performance regressions detected.  Is this a good option?  Can the sheriffs handle more performance regressions?  Also, notice my use of "false positive" and "legitimate performance regressions" is in the statistical sense only: I know we a plagued by performance regressions that, when bisected, point to changes that can not possibly have caused a slowdown - this project can not reduce those, only increase them.

I like we found an alternative to the existing t-test. Still, I believe multi-modality is a perf test illness. Data like this shouldn't appear in the first place in our charts. Sounds to me like Mann-Whitney U test is a good workaround, but I prefer we simply cut  the problem right from the root *(by updating or completely removing perf tests)*.  Ill perf tests will give a really hard time to the Firefox Engineers when trying to fix perf regressions, even if we as Perf sheriffs correctly identify the culprit bug.

Questions I have here:
* Assuming we employ the Mann-Whitney U test: could deviant noise go hand-in-hand with it, like deviant noise can do with the t-test? If yes, then this an extra reason to go with the 1st option & postpone Mann-Whitney U test for a later quarter.
* What do you mean by *increase the sensitivity*? Does this mean sustain thresholds even lower than 2%? If yes, this doesn't weigh in too much. Mainly because small regressions tend to be ignored, WONTFIXed or postponed for a later fix. Developers are often puzzled by them & question our perf sheriffing process & precision.

> 3. Do nothing - Any change in the code will have a business process impact. We should ensure the people and process are in place before we increase the number of regressions detected.
> 
> Other options? Thoughts?

My main requirement is to identify the problematic perf tests, so we can either stabilize or remove them. Perf sheriffs have a very hard time mainly because of the noise, which very likely comes from unreliable perf tests.
Given this, I'm inclined to go with the 1st option: enhance our existing t-test & clean things out.
(In reply to Kyle Lahnakoski [:ekyle] from comment #0)
> I would like your opinion on how to proceed:
> 
> 1. Add deviant noise as another statistic to be reported by Treeherder - This was the original plan, and it allows us to identify performance tests that are near-useless for detecting regressions.  Unfortunately, this will effectively double the number of series being tracked by the performance sheriffs.  Maybe this is not a problem,  the deviant noise statistic may also be deviant, which means regression detection does not work, and we do not care to report on deviant noise regressions.

I need more examples to better understand how this could integrate into Perfherder. It's good that this allows us to identify near-useless perf tests. What I think it's even better is this is an objective tool that mechanically states "Test X should be revisited". It's a mathematically proven proof that a test misbehaves & is likely out of date. This is a very valuable asset for us.
I believe that with a tool like this, we'll be quickly burning through the unreliable perf tests, disable them temporarily or permanently & in time decrease the number of useless alerts.
I understand t-test isn't the best approach, but deviant noise could aid it quite a lot, right?

> 2. Replace the t-test - The assumptions of the t-test is a problem.  We can use the [Mann-Whitney U test](https://github.com/mozilla/measure-noise/issues/6) instead: It is based on rank (rather than value) and is less sensitive to outliers and modality.  This will allow us to increase the sensitivity without increasing the false positive rate.  It will increase the number of legitimate performance regressions detected.  Is this a good option?  Can the sheriffs handle more performance regressions?  Also, notice my use of "false positive" and "legitimate performance regressions" is in the statistical sense only: I know we a plagued by performance regressions that, when bisected, point to changes that can not possibly have caused a slowdown - this project can not reduce those, only increase them.

I like we found an alternative to the existing t-test. Still, I believe multi-modality is a perf test illness. Data like this shouldn't appear in the first place in our charts. Sounds to me like Mann-Whitney U test is a good workaround, but I prefer we simply cut  the problem right from the root *(by updating or completely removing perf tests)*.  Ill perf tests will give a really hard time to the Firefox Engineers when trying to fix perf regressions, even if we as Perf sheriffs correctly identify the culprit bug.

Questions I have here:
* Assuming we employ the Mann-Whitney U test: could deviant noise go hand-in-hand with it, like deviant noise can do with the t-test? If yes, then this an extra reason to go with the 1st option & postpone Mann-Whitney U test for a later quarter.
* What do you mean by *increase the sensitivity*? Does this mean sustain thresholds even lower than 2%? If yes, this doesn't weigh in too much. Mainly because small regressions tend to be ignored, WONTFIXed or postponed for a later fix. Developers are often puzzled by them & question our perf sheriffing process & precision.

> 3. Do nothing - Any change in the code will have a business process impact. We should ensure the people and process are in place before we increase the number of regressions detected.
> 
> Other options? Thoughts?

My main requirement is to identify the problematic perf tests, so we can either stabilize or remove them. Perf sheriffs have a very hard time mainly because of the noise, which very likely comes from unstable/unreliable perf tests.
Given this, I'm inclined to go with the 1st option: enhance our existing t-test & clean things out.

Back to Bug 1581533 Comment 3