Closed Bug 1234932 Opened 8 years ago Closed 8 years ago

2-7% Linux 64/Win7 tp5o_scroll/tscrollx e10s regression on Mozilla-Inbound on dec 23, 2015 from push bb5becd378f4

Categories

(Core :: Layout, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s ? ---
firefox46 --- unaffected

People

(Reporter: jmaher, Assigned: mstange)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit bb5becd378f4 in bug 1147673.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=bb5becd378f40a9be14e4e635d7034f2835fc7b5&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5o_scroll

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64,win32 -u none -t g1  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tp5o_scroll

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
here is a compare view of this push and the previous:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=ccc320a19060&newProject=mozilla-inbound&newRevision=bb5becd378f4&framework=1

you can see a lot of noise there, but if you look at the alerts we actually got:
http://alertmanager.allizom.org:8080/alerts.html?rev=bb5becd378f40a9be14e4e635d7034f2835fc7b5&showAll=1

you can see those reflected in the related graphs and subtests.

:mstange, can you look into this and determine why we are seeing the scrolling regressions?
Flags: needinfo?(mstange)
marking as e10s only regression.
Summary: 2-7% Linux 64/Win7 tp5o_scroll/tscrollx regression on Mozilla-Inbound on dec 23, 2015 from push bb5becd378f4 → 2-7% Linux 64/Win7 tp5o_scroll/tscrollx e10s regression on Mozilla-Inbound on dec 23, 2015 from push bb5becd378f4
Whiteboard: [talos_regression] → [talos_regression][e10s]
I'm investigating.

Part of this regression is expected: Whenever there is text in a transparent layer, we're now using a component alpha layer, as we were without APZ. Component alpha layers are about twice as expensive to render as regular RGBA layers. The fact that APZ wasn't using component alpha layers in the past was a bug and hurt the visual quality of the text.

What surprises me about this regression is that it also seems to affect pages that don't have component alpha layers. That shouldn't happen and is something we should be able to fix.
Assignee: nobody → mstange
Status: NEW → ASSIGNED
Flags: needinfo?(mstange)
oh great.  Do ask for help on anything related to talos as needed!
Any updates on this?
Flags: needinfo?(mstange)
Not much - I haven't really worked on this bug in the past days. Today I triggered two try builds that skip the sites that have component alpha layers during their tp5o_scroll talos tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=512db7c3f129 [before]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=612a57e35723 [after]
If this still shows a regression in tp5o_scroll, I need to investigate more. If not, then the regression is entirely due to the fact that we now have better text rendering, and we can wontfix this bug.
Flags: needinfo?(mstange)
(In reply to Markus Stange [:mstange] from comment #6)
> Not much - I haven't really worked on this bug in the past days. Today I
> triggered two try builds that skip the sites that have component alpha
> layers during their tp5o_scroll talos tests:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=512db7c3f129 [before]
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=612a57e35723 [after]
> If this still shows a regression in tp5o_scroll, I need to investigate more.
> If not, then the regression is entirely due to the fact that we now have
> better text rendering, and we can wontfix this bug.

FWIW, if you want to dig into which sites regressed on a measure like tp5o_scroll, just hover over the test you're interested in on the main compare view and click "subtests". It will show a breakdown of how each test changed. Here's what the original regression looks like:

https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=ccc320a19060&newProject=mozilla-inbound&newRevision=bb5becd378f4&originalSignature=9762e35db1360d056dedab42265a11812c0e111c&newSignature=9762e35db1360d056dedab42265a11812c0e111c
Right, yes, I didn't write down the whole story.

The breakdown shows that lots of pages are affected, not only those with component alpha layers. I don't have any explanation for that, other than that the component alpha pages might somehow, through delayed overhead, have affectected the measurements of the non-component-alpha pages. The new try pushes test that hypothesis.
I see tscrollx on winxp on aurora as a regression related to this bug.  still validating the other regressions.
Bug 1249976, which I just duped over, was initially filed for the tscrollx regression, and was marked a P1 e10s perf bug because we don't want to regress scroll performance if we can. Putting this back in the triage queue.
Just to confirm, is this regression caused by APZ-related patches?
Flags: needinfo?(mstange)
Yes, this is an APZ regression. The patch that caused it made APZ quality match non-APZ quality.

So the regression that we should track is the one between non-e10s and post-bug 1147673 e10s+apz. The comparison between non-e10s and pre-bug 1147673 e10s+apz was not valid because the latter didn't have subpixel text antialiasing.

I've confirmed that this regression here is really only due to component alpha layers (necessary for subpixel text antialiasing in moving layers on transparent backgrounds). If I exclude all pages that have component alpha layers from being part of the test, then there is no regression. Thus, this regression is an expected effect of bug 1147673 and I'm going to WONTFIX it.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(mstange)
Resolution: --- → WONTFIX
I don't understand what comment 13 means. Could you clarify which of the following you mean:

A: This was a transient regression, and with the landing of bug 1147673 there is no more regression, so there's nothing left to track.

B: There is a real regression caused by e10s+apz, only in pages with component alpha layers. We have no plans to fix this regression, so resolving WONTFIX.

Or something else?
Flags: needinfo?(mstange)
Sorry about the confusion.

C: This regression was part of a real regression caused by e10s+apz. This real regression is caused by the bigger display port and we should try to fix it by downsizing the display port, which is being done in other bugs. However, this part of the regression is not something that we can fix on its own.

I'll try to explain some more, I hope this will make it a little clearer.

There are four states we can compare here:
A1: non-e10s, component alpha layers disabled
A2: non-e10s, component alpha layers enabled
B1: e10s+APZ, component alpha layers disabled
B2: e10s+APZ, component alpha layers enabled

If we want to look at regressions caused by APZ, we need to either look at the comparison A1/B1 or at the comparison B1/B2.
A1 never existed. Pre-APZ, A2 was the current state. Post-APZ and pre-bug 1147673, B1 was the current state due to that bug, and it was never a state we were considering to ship. And now post-bug 1147673, B2 is the current state.

This bug here is about the regression caused by B1->B2. It's not a very meaningful regression to look at. We should be looking at A2->B2 instead.

Does that clear things up?
Flags: needinfo?(mstange)
We also need to consider this config:

C2: e10s without APZ, component alpha enabled

The e10s team is considering whether we should ship e10s without APZ at all, because of numerous and unexplained talos regressions. So if this bug is part of the regression in a comparison C2->B2 then we'd want to either reopen this or just turn off APZ completely until the basic regression is fixed.
Yes, we definitely need to address the regression C2->B2. We're not going to address it by turning off component alpha layers. That's why having this particular bug around is not useful.
ok, that makes sense, thanks
Marking unaffected for 46 since e10s will not be enabled on 46 release.
You need to log in before you can comment on or make changes to this bug.