Closed Bug 1317048 Opened 8 years ago Closed 8 years ago

2.41 - 19.34% glterrain / tp5o / tp5o Main_RSS / tsvgx (linux64, windows7-32, windows8-64, windowsxp) regression on push b2b359340a84 (Fri Nov 11 2016)

Categories

(Core :: Layout, defect)

52 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Attachments

(2 obsolete files)

Talos has detected a Firefox performance regression from push b2b359340a84. As author of one of the patches included in that push, we need your help to address this regression. Regressions: 19% tsvgx summary windows8-64 opt e10s 124.41 -> 148.47 14% glterrain summary windows7-32 opt 5.17 -> 5.89 6% tp5o summary linux64 opt e10s 365.92 -> 388.3 6% tp5o summary windowsxp opt e10s 361.63 -> 382.72 6% tp5o summary windows7-32 pgo e10s 261.03 -> 275.41 5% tp5o summary windows7-32 opt e10s 361.95 -> 381.21 5% tp5o summary windowsxp pgo e10s 263.62 -> 277.02 3% tp5o Main_RSS windows7-32 opt e10s 117090362.53 -> 120678008.11 2% tp5o Main_RSS linux64 pgo e10s 217991940.95 -> 223251491.31 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=4087 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:dholbert- can you comment on this issue. I know there are some expected regressions to be seen on talos with this landing, but are all of these expected? I am concerned that almost all of these are e10s vs non-e10s.
Flags: needinfo?(dholbert)
(In reply to Joel Maher ( :jmaher) from comment #1) > :dholbert- can you comment on this issue. I know there are some expected > regressions to be seen on talos with this landing, but are all of these > expected? I think they're expected. I wasn't involved with the investigation of these the last time this landed, but here's what I can say: - This patch is known to regress pageload time a bit (but brings an improvement in perceived performance), since we now start painting sooner while we're loading content. (bug 1283302 comment 16) So content *appears* sooner, but it finishes loading slightly later. - This slight-pageload-slowdown caused talos regressions when we tried to land it before, and I believe the outcome of the investigation last time was that they're an acceptable cost for the improved-perceived-perf win. (Specifically, the reported talos regressions before were ("4% tp5o regression", "13% glterrain regression on win7" -- bug 1283302 comment 9, bug 1283302 comment 11; and the decision was noted in bug 1283302 comment 37.) - I think we still need to get feedback from jgilbert on what the glterrain regression means. Looks like jet had brought this up to him, per bug 1244978 comment 20, but I'm not sure he ever weighted in. But absent extra information from him, I'm willing to bet that (as with the other tests) we've simply been completing part of the test without actually doing any painting at all, which may have artificially inflated our performance. And now we'll be spending cycles on painting a bit earlier in the test-duration, and hence look like we're doing not quite as well. - The tsvgx regression (which wasn't reported or at least wasn't mentioned the last time this patch landed AFAIK) is extremely believable & acceptable for this change. As noted at https://wiki.mozilla.org/Buildbot/Talos/Tests#tsv , it tests SVG paint performance, "in unlimited frame-rate mode" (i.e. no 60-frames-per-second clamp). It's very believable that paint-suppression was *artificially* making us look great on that test, by skipping frequent paints at the beginning of each pageload in the test -- an now that artificial boost is removed. > I am concerned that almost all of these are e10s vs non-e10s. That is an interesting observation, but I think it makes sense. My hypothesis: With e10s (as compared to non-e10s)... - With e10s, we get to the "business" of these tests quicker (and have stuff ready to be painted quicker). - ...so e10s was skipping more paints during the paint-suppression time period than non-e10s. - ...so when we remove paint suppression, e10s configurations will experience a bigger relative regression on these tests.
Flags: needinfo?(dholbert) → needinfo?(jgilbert)
(jgilbert, the needinfo for you here is: does it make sense that glterrain would have (kinda-artificially) improved results, as a result of us skipping paints for the first 250ms of pageload-time? Because that seems to be the implication here. If that makes sense, then I think we've got nothing to worry about, basically -- though we should of course watch for feedback from users as bug 1283302 makes its way through the trains.)
(In reply to Daniel Holbert [:dholbert] from comment #3) > (jgilbert, the needinfo for you here is: does it make sense that glterrain > would have (kinda-artificially) improved results, as a result of us skipping > paints for the first 250ms of pageload-time? Because that seems to be the > implication here. If that makes sense, then I think we've got nothing to > worry about, basically -- though we should of course watch for feedback from > users as bug 1283302 makes its way through the trains.) Note that the GL test shouldn't be measuring anything for at least 500ms after page load completes: http://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/tests/webgl/benchmarks/terrain/perftest.html#l297 The test sequence is: - Page load - Render the scene once - wait (currently) 500ms - animate the scene for 130 frames in ASAP mode (no vsync, unlimited framerate). - Take the measurement over the last 100 rendered frames (so the first 30 are not measured).
Also, Look at the subtests comparison: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=affa5fe3fa4f87e68167ef2d89cf59d69cbc9165&newProject=mozilla-inbound&newRevision=b2b359340a84abda881e038d535dd90ac2fa58aa&originalSignature=b932eaa1fe05753b0a217489e875feb2783b6267&newSignature=b932eaa1fe05753b0a217489e875feb2783b6267&framework=1 0.WebGL-terrain-alpha-no-AA-no opt 4.17 ± 2.60% < 4.63 ± 5.61% 10.78% 1.WebGL-terrain-alpha-no-AA-yes opt 7.12 ± 2.71% < 7.49 ± 5.61% 5.20% 2.WebGL-terrain-alpha-yes-AA-no opt 3.67 ± 2.68% < 4.33 ± 7.44% 18.01% 3.WebGL-terrain-alpha-yes-AA-yes opt 6.61 ± 2.45% < 6.97 ± 5.51% 5.47% All of the four subtests run on the same page at the order they're displayed above, so the load event fires only before the first of those. Seems the biggest regression is when AA is disabled (10%, 18%), and between those, the bigger is when alpha is enabled (18%). When AA is enabled, regardless of alpha, the regression is ~5%. If I had to guess, I'd have said the regressions are not uniform enough on one hand to suggest a per-frame regression (fake or not), and don't happen only before the first subtest after the page loads on the other hand, so I'd guess it's not a penalty for something right after the page loads.
thanks for all the comments in this bug.
I see the tsvgx regression on aurora now: https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%5Bmozilla-inbound,2a94379c1bce14b854c6fddec17d4e319b2f2199,1,1%5D&series=%5Bmozilla-aurora,801468cb00bf0ca29ad9135a05a3bcfcdba8d480,1,1%5D&selected=%5Bmozilla-inbound,2a94379c1bce14b854c6fddec17d4e319b2f2199,143858,39087043,1%5D I also noticed that tsvgx is very noisy as a result of this bug when looking at the above graph. I haven't looked over the other regressions, but ran across this while investigating another issue.
Given comment 5 (particularly the bits about the 500ms delay and the dropped frames), I'm at a loss for explaining the glterrain regression. I've been playing around with it locally and I *think* I can reproduce the regression (on Windows 10), though only in a debug build. (Maybe because it depends on "slowish hardware", and a debug build gets me more in the direction of slowish hardware.) It pains me to say this, but I suspect we should: (a) Backout bug 1283302 on Nightly and Aurora. (b) Come up with & land patches to fix* these talos tests (at least tsvgx and glterrain). (c) Re-land bug 1283302. Part (b) is a bit hand-wavy -- "fixing" these tests might involve: - adding a short delay after test startup (though glterrain already has this) - making all of the content initially "display:none" so that we get to an "all frames constructed" state pretty quickly [which IIRC cancels the paint-suppression timer]. Superficially, it seems like this should remove any possibility for paint suppression to influence behavior. - OR, finding & fixing actual bugs in the code that might be causing these regressions here. (Hopefully there aren't any such bugs, but that's always a possibility). These changes might trigger talos regressions themselves, but hopefully they'll be better-understood & in the direction of making the tests more reliable/stable-in-what-they-depend-on.
And to be clear, the reason I think we should backout is: - I don't think we understand these regressions well enough to be sure they're expected (particularly glterrain, where my naive explanation from comment 2 seems insufficient given avih's response). - Given that lack-of-understanding, we can't rule out the possibility that this is a legitimate regression (perhaps due to second-order downstream effects of painting sooner). - I think we need more dedicated poking/tweaking of the glterrain talos test to determine whether the regression is expected or not. - I suspect/hope it's possible to fix the tests (and/or code) such that this change *won't* regress/noisify glterrain & tsvgx -- and it'd be much cleaner to make those tweaks *first*, and then land bug 1283302 on top of that new talos baseline, to be able to follow the talos trends, rather than the strawman alternative option of keeping bug 1283302 in the tree & landing tweaks on top of it.
FWIW, you can run this test in your browser also outside of talos. Just visit http://hg.mozilla.org/mozilla-central/raw-file/tip/testing/talos/talos/tests/webgl/benchmarks/terrain/perftest.html to run all 4 subtests (reload to re-run). Note that in talos it runs in ASAP mode so it won't hit a 16.67 ms wall as it's likely to in normal vsync mode. But ASAP can be set in a normal browser too. In talos, ASAP mode is the following pref values: layout.frame_rate = 0 docshell.event_starvation_delay_hint = 1
We do not want to back out this change for WebGL reasons. We will have a look at the potential regression here, but from WebGL's standpoint, we'd prefer leaving it in-tree.
Flags: needinfo?(jgilbert)
(In reply to Jeff Gilbert [:jgilbert] from comment #11) > We do not want to back out this change for WebGL reasons. We will have a > look at the potential regression here, but from WebGL's standpoint, we'd > prefer leaving it in-tree. Ah! You're too late -- already backed out over on bug 1283302. --> resolving this as FIXED by that backout. (hopefully at least, unless something else crept in) In any case, per comment 8 / 9, I'd be much more comfortable if we had a better understanding of the regression, or at least a tweak to the test (e.g. adding some extra delay) so that it doesn't report a regression. And I think/hope we can get to a better understanding in the short term -- I'll plan on spending some time over the next week with some Try runs to learn more, and hopefully we'll be in a position to re-land next week. (and then maybe backport to Aurora) Independent of the glterrain concerns, the backout still gives us an opportunity to tweak tsvgx to hopefully address comment 7 & then perhaps re-land this without having a regression reported there.
(In reply to Daniel Holbert [:dholbert] from comment #12) > (In reply to Jeff Gilbert [:jgilbert] from comment #11) > > We do not want to back out this change for WebGL reasons. We will have a > > look at the potential regression here, but from WebGL's standpoint, we'd > > prefer leaving it in-tree. > > Ah! You're too late -- already backed out over on bug 1283302. --> resolving > this as FIXED by that backout. (hopefully at least, unless something else > crept in) > > In any case, per comment 8 / 9, I'd be much more comfortable if we had a > better understanding of the regression, or at least a tweak to the test > (e.g. adding some extra delay) so that it doesn't report a regression. And > I think/hope we can get to a better understanding in the short term -- I'll > plan on spending some time over the next week with some Try runs to learn > more, and hopefully we'll be in a position to re-land next week. (and then > maybe backport to Aurora) > > Independent of the glterrain concerns, the backout still gives us an > opportunity to tweak tsvgx to hopefully address comment 7 & then perhaps > re-land this without having a regression reported there. That's fine! I meant I don't want to back out any WebGL changes at this point.
Fixed by backout.
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Untriaged → Layout
Product: Firefox → Core
Resolution: --- → FIXED
For future reference, dholbert, jmaher and myself had a discussion on IRC regarding the tsvgx side of the regression. Joel helped with some new results and we noticed that the regressions are only with the "plain" page load tests (the ones which don't animate anything). dholbert mentioned that the issue is mainly due to ASAP mode which tsvgx uses, which combined with the "offending" patch induce a very big number of paints during page load which affect the plain page load numbers, and also add noise. Since we only need ASAP for animation tests (in tsvgx and in general), it was suggested that we should move the plain page load subtests of tsvgx into their own new test - where ASAP is not enabled. If that still proves too noisy for our taste due to the added intermediate paints, and while keeping in mind that the main objective here is to measure svg rendering/processing performance, we could go further and implement an approach where instead of plain load time measurement, the pages will deploy a script which reports its own measurement, where it first renders a blank page, then renders the svg content, then measures and reports the result. We haven't discussed the glterrain regressions.
I have to run, but please file a new bug for the tsvgx changes ("Fork tsvgx into static & non-static"), as this bug's already closed (as fixed-by-backout). And please s/non-hixie/static/ in the commit message. Thanks!
Flags: needinfo?(npancholi)
Attachment #8812000 - Attachment is obsolete: true
Attachment #8812000 - Flags: review?(dholbert)
Attachment #8812028 - Attachment is obsolete: true
Attachment #8812028 - Flags: review?(dholbert)
(In reply to Daniel Holbert [:dholbert] from comment #17) > I have to run, but please file a new bug for the tsvgx changes ("Fork tsvgx > into static & non-static"), as this bug's already closed (as > fixed-by-backout). > > And please s/non-hixie/static/ in the commit message. Thanks! As per dholbert's request, the new bug that I filed is: Bug 1318530 Thanks!
Flags: needinfo?(npancholi)
See Also: → 1319458
Summarizing the final outcome for comment 0's regressions here: * The "regressing" patch was backed out, but I intend to re-land it soon. * tsvgx regression: dealt with (compartmentalized + hopefully-reduced) via bug 1318530 * glterrain regressions: will be dealt with (to the extent that we do anything) in bug 1319458 * Small tpo regressions: expected / OK
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: