Closed Bug 1406878 Opened 3 years ago Closed 3 years ago

17.32 - 25.65% perf_reftest_singletons (linux64, osx-10-10, windows10-64, windows7-32) regression on push 3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b (Sat Oct 7 2017)

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox58 --- affected

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=3e85f0761fc9ec42f8cc0ef57ad3e27e8127323b

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 26%  perf_reftest_singletons summary linux64 opt e10s     24.68 -> 31.02
 25%  perf_reftest_singletons summary linux64 pgo e10s     23.07 -> 28.92
 23%  perf_reftest_singletons summary windows7-32 opt e10s 25.86 -> 31.67
 19%  perf_reftest_singletons summary windows7-32 pgo e10s 22.54 -> 26.86
 19%  perf_reftest_singletons summary windows10-64 pgo e10s23.45 -> 27.92
 19%  perf_reftest_singletons summary windows10-64 opt e10s25.71 -> 30.60
 17%  perf_reftest_singletons summary osx-10-10 opt e10s   25.44 -> 29.85


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=9869

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Layout: Text
Product: Firefox → Core
Isn't that simply because a new test is added? I think this is an INVALID or WONTFIX.
I think for perf_reftest_singletons, the subtests should be tracked separately, rather than bundling together like this. This is both useless and misleading, and hard to catch real regressions.
In reply to Xidorn Quan [:xidorn] UTC+10 from comment #1)
> Isn't that simply because a new test is added? I think this is an INVALID or
> WONTFIX.
I was just about to ask for that. I'm marking this as WONTFIX then.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #2)
> I think for perf_reftest_singletons, the subtests should be tracked
> separately, rather than bundling together like this. This is both useless
> and misleading, and hard to catch real regressions.

Thanks for this suggestion. I agree and will stick with it.
:xidorn, we discussed this, and it would be too much noise to track individual test results- this gives us some signal and we use a geometric mean which catches most of the real sustained regressions.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> :xidorn, we discussed this, and it would be too much noise to track
> individual test results- this gives us some signal and we use a geometric
> mean which catches most of the real sustained regressions.

I don't quite understand. Are you calculating the geometric mean of all subtests and then use the percentage of their difference? That doesn't make sense, because we currently have 16 subtests, which means a single test can be regressed by up to 17% (1.01^16 - 1) without triggering a 1% regression alert. Maybe you can exp the result difference percentage by the number of subtests, that probably makes more sense.

I wonder why is it too much noise? Probably the several very quick test can vary significantly?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7)
> noise == pestering developers more frequently, lack of sheriffing resources
> - we have 900 tests we track so far it is not sustainable for Mozilla to add
> 240 more tests that we track.

I think you misunderstood me. I said that subtests should be tracked separately specifically for perf_reftest_singletons, so that's 16 more, not 240.

Other talos tests are basically general performance tests, and we care them as a whole. But each single test in perf_reftest_singletons is pretty much testing a very specific optimization, and it doesn't make much sense to mix them together.

I'm not sure I understand the additional burden for having finer-grained tracking here. I suppose alerts are triggered by the CI directly, and perf sheriffs would track them in some dashboard thing? I guess we can have some script to scan the history of the subtests of perf_reftest_singletons and see if they are really noisy and add significant more tracking work to perf sheriffs. I suspect they are not.

And given that a single test may be regressed by 17% without triggering even a 1% alert at the moment, I think as a compromise we can set a larger tolerant range for those subtests so that they alert less.

> here is an example of adding 1 test to perf_reftests which generated a 25%
> regression alert:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=autoland&originalRevision=52748bb525f2a7aac2d8
> 2647a6d41b16c873a245&newProject=autoland&newRevision=3e85f0761fc9ec42f8cc0ef5
> 7ad3e27e8127323b&originalSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&n
> ewSignature=d816936ecd2474b13579e9e9426c4e92c0c4d3a7&framework=1

This is a pretty good example that the current approach is problematic for perf_reftest_singletons.

There isn't any real regression here. It's just a new test added. And this kind of annoyance would happen whenever someone adds a new subtest, which is expected to happen more in the future as we make more optimizations.

And having perf sheriffs file regression alerts for this kind of things is a waste to both sheriffing resources and developer time.
we have 15 for perf_reftest + 15 for singletons so 30 additional data points.  We run these on:
* linux64-stylo, linux64-stylo-disabled
* macosx-stylo, macosx-stylo-disabled
* win7-stylo, win7-stylo-disabled
* win10-stylo, win10-stylo-disabled

that is 8 configurations * 30 == 240 new tests to track.
You need to log in before you can comment on or make changes to this bug.