Closed Bug 1256667 Opened 5 years ago Closed 5 years ago

2.18 - 5.4% sessionrestore_no_auto_restore / tart / ts_paint (windows7-32, windows8-64, windowsxp) regression on push b5621f52feab (Mon Mar 7 2016)

Categories

(Firefox :: Session Restore, defect)

48 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 48
Tracking Status
firefox48 --- fixed

People

(Reporter: jmaher, Assigned: myk)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

Talos has detected a Firefox performance regression from push b5621f52feab. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:

https://treeherder.mozilla.org/perf.html#/alerts?id=299

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:

https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint
https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART
https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore

Reproducing and debugging the regression:

If you would like to re-run this Talos test on a potential fix, use try with the following syntax:

try: -b o -p win32,win64 -u none -t other-e10s[Windows 7,Windows 8,Windows XP],svgr-e10s[Windows 7,Windows 8,Windows XP] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:

https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:

talos --develop -e [path]/firefox -a ts_paint:tart:sessionrestore_no_auto_restore

(add --e10s to run tests in e10s mode)

Making a decision:

As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Friday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:

https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:myk, I love code removal!  I am surprised that we ended up with some regressions.  the ts_paint ones caught me at first, then I found session_restore, and then tart.  What is nice is that our startup file io bytes have been reduced!

here is a breakdown of the tests, you can see the subtests from here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=31655275a522&newProject=mozilla-inbound&newRevision=b5621f52feab&framework=1

do you have any thoughts on this?  this seems to affect e10s, and not the non-e10s as much.
Flags: needinfo?(myk)
Component: Untriaged → Session Restore
(In reply to Joel Maher (:jmaher) from comment #1)
> :myk, I love code removal!  I am surprised that we ended up with some
> regressions.  the ts_paint ones caught me at first, then I found
> session_restore, and then tart.  What is nice is that our startup file io
> bytes have been reduced!
> 
> here is a breakdown of the tests, you can see the subtests from here:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> inbound&originalRevision=31655275a522&newProject=mozilla-
> inbound&newRevision=b5621f52feab&framework=1
> 
> do you have any thoughts on this?  this seems to affect e10s, and not the
> non-e10s as much.

Hmm, it's puzzling, as I wouldn't have expected that patch to produce any perf regressions, much less those specific ones!  I'll investigate further, and I've cc:ed two colleagues who may also be able to help.
(In reply to Joel Maher (:jmaher) from comment #0)
> If you would like to re-run this Talos test on a potential fix, use try with
> the following syntax:
> 
> try: -b o -p win32,win64 -u none -t other-e10s[Windows 7,Windows 8,Windows
> XP],svgr-e10s[Windows 7,Windows 8,Windows XP] --rebuild 5  # add
> "mozharness: --spsProfile" to generate profile data

I re-read through the patch, and I didn't see anything that could account for the regressions.  In the meantime, I also scheduled a tryserver run, using your suggested syntax, on the tip of central (as of a couple days ago, the last time I pulled it):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9cf657accbe8

And then I scheduled one on a branch off the tip that reverts the removal of the desktop web runtime:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b4a790dc4ff

Only a single run for each test suite has completed so far, but the comparison between the two revisions shows no significant difference on any of the tests:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=0b4a790dc4ff&framework=1&showOnlyImportant=0

If the rest of the test runs confirm those results, then the desktop web runtime removal is unlikely to have caused this regression.  Unfortunately, I don't have a good alternative explanation!  Perhaps the runtime's removal did cause these regressions, but one of the half-dozen-or-so followup patches we've landed since then has clawed back that performance?

I also wonder if there's any chance of an infrastructure issue here.  I don't know talos/treeherder/perfherder well, and the perfherder comparison in question clearly compares the runtime removal revision to its parent revision, which should mean that no other change can be involved.  But its grandparent revision sure looks like the kind of change that could affect paint perf, upon superficial examination, anyway:

https://hg.mozilla.org/integration/mozilla-inbound/rev/65f225f7d0db

Is there any chance that the comparison might have entrained that change?

(FWIW, I'm also building and testing the original revision and its parent on my Windows 7 x64 VM to see if I can reproduce the regression there, although I'm not sure how reliable its results can be.)
Assignee: nobody → myk
Status: NEW → ASSIGNED
Flags: needinfo?(myk)
Myk, thanks for jumping on this and getting try pushes.  Now that we have data, I see the backout shows a clear win on ts_paint for win7/win8 on e10s.

There are concerns that this could be infrastructure or random noise- thanks for thinking of that.  To work around that I retrigger the jobs on the original push a few times as well as the push before.  This gives me a full set of talos data with plenty of data points related to just the change in question.

What I don't see is the larger sessionrestore or tart improvements, I do see what might be related improvements on the backout, but it is either not enough data, or other things have landed in the last week which would have improved whatever influenced the change.

As to why this removal of code is causing ts_paint issues, I am not clear.
(In reply to Joel Maher (:jmaher) from comment #4)
> As to why this removal of code is causing ts_paint issues, I am not clear.

IIRC there were some concerns that ts_paint wasn't correctly measured on e10s, was that fixed/dismissed?
:avih, could you comment on e10s + ts_paint?
Flags: needinfo?(avihpit)
(In reply to Marco Castelluccio [:marco] from comment #5)
> (In reply to Joel Maher (:jmaher) from comment #4)
> > As to why this removal of code is causing ts_paint issues, I am not clear.
> 
> IIRC there were some concerns that ts_paint wasn't correctly measured on
> e10s, was that fixed/dismissed?

I'm not really familiar with ts_paint, but according to bug 1250620 comment 11, I don't think there are such concerns, unless you know otherwise.
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #7)
> I'm not really familiar with ts_paint, but according to bug 1250620 comment
> 11, I don't think there are such concerns, unless you know otherwise.

Indeed.  I just read through bug 1174767 and bug 1254628, and there was a question about what ts_paint should be measuring (time to paint just browser chrome or time to paint both browser chrome and the initial web page).  Currently, it measures the latter, but some folks think it should measure the former (or that there should be measures of both).  And that might affect the meaning of ts_paint in the future, but it shouldn't affect a comparison between ts_paint results that are both using e10s today.

There was also a concern about whether the use of MozAfterPaint in ts_paint (and other Talos tests) would correctly measure the time to paint the initial web page with e10s.  And jimm confirmed that it does in bug 1254628, comment 2.  So ts_paint should function as expected with e10s.
Although I removed browser chrome and didn't add any new chrome, I theorized that the removal might be able to trigger a chrome paint regression somehow.  So I reverted those browser chrome removals:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d5fd199c2a5

So far only Windows 8 has run tests six times, but it throws cold water on the theory, showing no change in ts_paint:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=9d5fd199c2a5&framework=1&showOnlyImportant=0

Another theory is that removing WebappManager triggered the regression.  nsBrowserGlue.js, which is evaluated on app-startup, lazily imports WebappManager.jsm on final-ui-startup in order to call WebappManager.init().

WebappManager.jsm then eagerly imports a variety of other modules.  I'm not sure when Talos starts timing ts_paint, but if it somehow manages to happen during/after final-ui-startup without e10s and before final-ui-startup with e10s (because it happens in the child process with e10s), then that could account for this.

So I reverted WebappManager removal in this tryserver run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f4e7db088e59

I'm still waiting on the results, which will show up here:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=f4e7db088e59&framework=1&showOnlyImportant=0
(In reply to Myk Melez [:myk] [@mykmelez] from comment #9)
> I'm still waiting on the results, which will show up here:
> 
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newR
> evision=f4e7db088e59&framework=1&showOnlyImportant=0

It looks like reintroducing WebappManager.jsm does claw back the perf hit, although it still isn't clear why.  It could be that it eagerly imports a module that is otherwise imported at a time when it affects ts_paint more.

Here's a try run on a branch that makes BrowserGlue._finalUIStartup() import the same modules that WebappManager.jsm used to import (modulo the ones that nsBrowserGlue.js has already imported):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=15a51c0d5921
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=15a51c0d5921&framework=1&showOnlyImportant=0

If that doesn't claw back the regression, then I'll look into WebappManager.init() next.
(In reply to Myk Melez [:myk] [@mykmelez] from comment #10)
> Here's a try run on a branch that makes BrowserGlue._finalUIStartup() import
> the same modules that WebappManager.jsm used to import (modulo the ones that
> nsBrowserGlue.js has already imported):
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=15a51c0d5921
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newR
> evision=15a51c0d5921&framework=1&showOnlyImportant=0

The results confirm that importing those modules claws back the regression.  Now to figure out which one(s) in particular are responsible:

AppsUtils:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7c40e45bacdc
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=7c40e45bacdc&framework=1&showOnlyImportant=0

PrivateBrowsingUtils:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=037a9ba8c776
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=037a9ba8c776&framework=1&showOnlyImportant=0

Promise:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff19a56665de
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=ff19a56665de&framework=1&showOnlyImportant=0

Task:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=16777f4ac80e
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=16777f4ac80e&framework=1&showOnlyImportant=0

Webapps:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f41524351204
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=f41524351204&framework=1&showOnlyImportant=0
(In reply to Myk Melez [:myk] [@mykmelez] from comment #11)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f41524351204&filter-tier=1&filter-tier=2&filter-tier=3
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf657accbe8&newProject=try&newRevision=f41524351204&framework=1

The various try runs all came back inconclusive except for this one, which appears to account for all (or at least the significant portion) of the regression.  So here's the patch to evaluate Webapps.jsm early in order to claw it back.

I'd still like to understand why it occurs.  And I wonder if it's possible to do more than claw back the regression and actually improve ts_paint by avoiding the evaluation of Webapps.jsm entirely.  But that's a separable concern from the regression itself, so I'm requesting review on this patch to get us back to where we were before, after which I'll investigate Webapps.jsm evaluation further in a followup.
Attachment #8731889 - Flags: review?(dtownsend)
Thanks Myk for looking into this and finding an area to focus on!
Comment on attachment 8731889 [details] [diff] [review]
evaluate Webapps.jsm early

Review of attachment 8731889 [details] [diff] [review]:
-----------------------------------------------------------------

Confusing :s
Attachment #8731889 - Flags: review?(dtownsend) → review+
Blocks: 1257898
(In reply to Dave Townsend [:mossop] from comment #14)
> Confusing :s

Indeed!  I filed followup bug 1257898 to figure out and resolve the underlying issue.
https://hg.mozilla.org/mozilla-central/rev/e4efc904501f
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 48
as a note, we see the improvements we expected:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=77caf1897d7a&newProject=mozilla-inbound&newRevision=e4efc904501f&framework=1

and a few regressions- but these effectively back out the improvements seen when landing the original patch.  So I think there are no concerns here.
Version: unspecified → 48 Branch
You need to log in before you can comment on or make changes to this bug.