Closed Bug 1266755 Opened 8 years ago Closed 8 years ago

2.02 - 55.46% damp / tp5o / tp5o % Processor Time / tp5o Main_RSS / tp5o Private Bytes / tps (linux64, osx-10-10, windows7-32, windows8-64, windowsxp) regression on push 1fd853200c99 (Mon Apr 18 2016)

Categories

(Core :: Panning and Zooming, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox48 --- fixed

People

(Reporter: jmaher, Assigned: kats)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][gfx-noted])

Talos has detected a Firefox performance regression from push 1fd853200c99 . As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=904

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:
https://wiki.mozilla.org/Buildbot/Talos/Tests#DAMP
https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5o_scroll
https://wiki.mozilla.org/Buildbot/Talos/Tests#tscrollx
https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5
https://wiki.mozilla.org/Buildbot/Talos/Tests#tps

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64,win32,win64,macosx64 -u none -t g2-e10s[Windows XP,Windows 7,Windows 8,10.10],g1-e10s[Windows XP,Windows 7,Windows 8,10.10],svgr-e10s[Windows XP,Windows 7,Windows 8,10.10],tp5o-e10s[Windows XP,Windows 7,Windows 8,10.10] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a damp:tp5o_scroll:tscrollx:tp5o:tps

(add --e10s to run tests in e10s mode)

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Panning and Zooming
Product: Firefox → Core
:kats, I had done some retriggers on the pushes and compare view matches up fairly well with the above alerts:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=c2239c623e86&newProject=mozilla-inbound&newRevision=1fd853200c99&framework=1

My concern here is the e10s regressions- we are tracking e10s and starting to sign off on the comparisons of e10s vs non e10s.

can you take a look at this and help us figure it out.  There is a nice tp5o_scroll win :)
Flags: needinfo?(bugmail.mozilla)
Yeah, I'll take a look. My first guess is that sending a repaint request instead of the scroll acknowledgement actually triggers some repaints in some cases, which might explain the regression. I'll do some try pushes to isolate this.
Assignee: nobody → bugmail.mozilla
Flags: needinfo?(bugmail.mozilla)
Whiteboard: [talos_regression] → [talos_regression][gfx-noted]
great :kats, thanks for jumping on this!
I think I might just back out bug 1248078 for now because we're just a weekend away from 48 going up to aurora. Since 48 is a potential e10s release train I don't want to leave these regressions on that train in case it takes me a while to track this down. And bug 1248078 isn't a critical fix, so I don't feel strongly that it must ride in 48, I can reland it in 49.
I backed out bug 1248078, but will leave this open until I track down where the regression came from.
First try push [1] (linux only because every other platform has giant wait times) is based on the regressing cset (1fd853200c99) and comments out the two new needContentRepaint=true lines. This causes some failures, and a lot of the tests are noisy, but we do see that tp5o e10s goes back down to close to it's original value (regression was 8.35% [2], win was -8.24% [3]). So that validates my theory in comment 2, but there's two "needContentRepaint = true" lines, so I need to figure out which one is causing the problem and then see if I can avoid the repaint there.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b553f1b0dc0
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=c2239c623e86&newProject=mozilla-inbound&newRevision=1fd853200c99&framework=1&filter=linux64%20tp5o%20opt%20e10s&showOnlyImportant=0
[3] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=1fd853200c99&newProject=try&newRevision=8b553f1b0dc0b2f20882827651528f1e037be0a4&framework=1&filter=linux64%20tp5o%20opt%20e10s&showOnlyConfident=1
:kats you are quick
Second try push [1] again is based on the regressing cset, just comments out one of the needContentRepaint=true lines. Unlike the previous try push, the tp5o score remains regressed, so this change doesn't appear to be responsible. The other needContentRepaint=true (the one causing the regression) is the one at [2], and I think it should be safe to remove entirely.

This line gets run when the APZ first pulls in metrics via NotifyLayersUpdated. Sending an AcknowledgeScrollUpdate was necessary here, because after this point APZC might want to send repaint requests for new scroll offsets, and if the scroll generation wasn't in sync those would get rejected. However, if the repaint request automatically carries a new generation counter (which is what the patch does) then the AcknowledgeScrollUpdate isn't needed at all. Also there should be no need to do an actual repaint at this point, except in the condition explicitly handled a few lines down (where the displayport is stale). So I think we should be able to just remove that repaint request.

Try push at [3] to verify the talos improvements. I'll add some jobs there to verify correctness as well. Assuming all is well, I'll reland the changes next week (on 49) with the offending lines removed.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2ac2624f91ee&selectedJob=19867715
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/AsyncPanZoomController.cpp?rev=d430b697c76d#3459
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2b9b7699723c
tracking-e10s: --- → ?
I fixed this by backing out the regressing cset, so I'm closing this bug. I'll reland that cset with the modification described above which should avoid the regression.
Status: NEW → RESOLVED
Closed: 8 years ago
tracking-e10s: ? → ---
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.