Closed Bug 1469258 Opened 6 years ago Closed 6 years ago

12.6 - 51.94% displaylist_mutate / glterrain (linux64-qr, windows10-64-qr) regression on push 7fbf8a9126e5 (Fri Jun 15 2018)

Categories

(Core :: Graphics: WebRender, defect, P2)

62 Branch
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox60 --- unaffected
firefox61 --- unaffected
firefox62 --- disabled
firefox63 --- affected

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [gfx-noted])

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c1d664fefdff80e066dde0140c22e8ff74fe8694&tochange=7fbf8a9126e54addb86d01ff6ec1e62e7fda18e5

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 52%  glterrain windows10-64-qr opt e10s stylo     1.41 -> 2.14
 19%  displaylist_mutate windows10-64-qr opt e10s stylo4,115.04 -> 4,911.79
 13%  displaylist_mutate linux64-qr opt e10s stylo 4,832.74 -> 5,441.80


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=13840

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: General → Graphics: WebRender
Product: Testing → Core
The perf regressions were caused by this landing: https://bugzilla.mozilla.org/show_bug.cgi?id=1467096#c13

:kats What are the plans for these? Can we fix them now?
Flags: needinfo?(bugmail)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #1)
> And here are the Gecko profiles:
> 
> For Windows 10 Quantum Render:
> 
> Before:
> https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.
> net%2Fv1%2Ftask%2FEHVHvaCgQoeQz1n-
> 47LfOQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_displaylist_muta
> te.zip
> 
> After:
> https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.
> net%2Fv1%2Ftask%2FAZh-2Z3rQE-
> 7pUKEkbVwZg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_displaylist
> _mutate.zip

These are for the displaylist_mutate perf test.
The regression was expected (see bug 1467096 comment 10) and the webrender issue at servo/webrender#2807 will help with this. I plan to work on that this week.
Blocks: 1426770, 1416645
Flags: needinfo?(bugmail)
Quick update: I wrote a patch for servo/webrender#2807 and it helped with displaylist_mutate but regressed tresize. I'm going to investigate that a bit more.
I looked into the tresize regression and found a the root of the problem. tresize doesn't run in ASAP mode, so the thing it's measuring (average time between window.resize call to the following MozAfterPaint) is affected in large part by the amount of work that happens between the vsync that triggers composition, and the end of the composition. If we spend extra time doing stuff *before* that vsync it doesn't really factor into the end result, except if it causes a whole frame delay.

So, currently the sequence of events is approximately this:
(a) window.resize
(b) DL build
(c) WR scene build
(d) schedule composite
(e) WR render (for rebuilding hit-testing info)
(f) (wait for vsync)
(g) WR render (triggered via GenerateFrame transaction in WRBP::CompositeToTarget)
(h) composite and fire MozAfterPaint

With my patches, step (e) changes so that instead of doing a full render we just rebuild the hit-testing info. I verified that this does take an order of magnitude less time at step (e). However, it also *increases* the time spent in step (g), presumably because that render has to do more work and can't just reuse the results from (e). So while the amount of work done *before* the vsync is reduced, that doesn't affect the end result, but the small increase in work done *after* the vsync has a disproportionate impact on the test result.

I verified that turning on ASAP mode for this test locally removes the regression, but I'll do a try push to confirm that since I don't know how far I can trust my local results.
On tryserver there's still a small regression with the test in ASAP mode when comparing with my patches [1] and without [2]. Not sure why yet.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=52fc13269c5d050d325526f649590d063957eaaf
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=57055a302cb756ff5935ae80b7a6d46a2c5ede72
assuming this is high priority since kats is working on it
Assignee: nobody → bugmail
Priority: -- → P1
Whiteboard: [gfx-noted]
It's not that high.
Blocks: stage-wr-trains
No longer blocks: stage-wr-nightly
Priority: P1 → P2
(flipping bits, on the assumption this WR nightly-only for the time being)
I rebased my patches for servo/webrender#2807 and kicked off another try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca9520ec0b1ea14f97f7260d58ff5ccfd7860200. I'll see if there's still talos regressions from this.
It still hurts tresize on windows.
Kat is on leave for 3 more weeks.  Un-assigning so we know it needs a new owner.
Assignee: kats → nobody
Priority: P2 → P3
Priority: P3 → P2
Kats, is there anything we want to still do with this bug?
Flags: needinfo?(kats)
Not raelly, I think we can close it. We have existing bugs tracking the talos regressions compared to non-WR. And the patches I was working on earlier were obsoleted by stuff nical landed which accomplished the same thing, I believe.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(kats)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.