1356138 - 2.42 - 10.26% sessionrestore / sessionrestore_no_auto_restore (windows7-32, windows8-64) regression on push e8efc3c5d61f386cc64f8ee82d6739cee71fd205 (Wed Apr 12 2017)

Reporter

Description

•

7 years ago

Talos has detected a Firefox performance regression from push e8efc3c5d61f386cc64f8ee82d6739cee71fd205. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 10%  sessionrestore windows8-64 opt e10s     738.92 -> 814.75
  9%  sessionrestore_no_auto_restore windows8-64 opt e10s766.67 -> 838.33
  2%  sessionrestore_no_auto_restore windows7-32 pgo e10s686.12 -> 702.75


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=5972

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Component: Untriaged → Graphics

Product: Firefox → Core

Ionuț Goldan [:igoldan]

Reporter

Comment 1

•

7 years ago

:gerald Please confirm this issue is related to changeset: https://hg.mozilla.org/integration/autoland/rev/e8efc3c5d61f

Flags: needinfo?(gsquelart)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

7 years ago

Blocks: 1346783

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Updated

•

7 years ago

Whiteboard: [gfx-noted]

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 2

•

7 years ago

(In reply to igoldan from comment #1)
> :gerald Please confirm this issue is related to changeset:
> https://hg.mozilla.org/integration/autoland/rev/e8efc3c5d61f

I don't believe it can be related to bug 1337062, as that one was backed out 90 minutes later on autoland (see bug 1337062 comment 50), but the regression doesn't go away after that, and even spreads to inbound and central.


From the Alerts page, the pushlog [1] points at a bunch of bugs:

There is bug 1337063, which is similar to 1337062: Preemptively pushing data when launching a child process, instead of letting the child process sync-request soon after launch -- But overall the time should be the same or shorter; I'd need to investigate further...

What about the other ones?
Gregory Szorc — Bug 1355624 - Make Mercurial require TLS 1.2+ connections; r=dustin
Doug Thayer — Bug 1348396 - Only show a badge on PanelUI while in fullscreen r=enndeakin+6102
Drew Willcoxon — Bug 1295458 - Rework key and mouse handling for the one-off search buttons. r=florian
Jared Wein — Bug 1355078 - Restore scroll position after updating select popup. r=mossop
Jared Wein — Bug 1354196 - Forward the text-shadow CSS property to the select popup for styling. r=mossop
(And could there be more?)

Could someone who understands the Talos tests better, give some hints as to which one(s) of these bugs could possibly have an impact?
Could we run Talos tests against each of these pushes to find the most-likely culprit?


[1] https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=9929b37fea6884a758f3b5fee384bb7ea91792fa&tochange=e8efc3c5d61f386cc64f8ee82d6739cee71fd205

No longer blocks: 1337062

Flags: needinfo?(gsquelart)

Ionuț Goldan [:igoldan]

Reporter

Comment 3

•

7 years ago

:gerald Thank you for the quick notice!

Could you then investigate bug 1337063? I've filed this bug by looking over this chart:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,5f783f35de9134018990e631fb6ff725b2f3a9da,1,1%5D&series=%5Bautoland,555ac79a588637a3ec5752d5b9b3ee769a55d7f6,1,1%5D&series=%5Bautoland,5a96e86df99a2eb1d406bde937a4e673c58dfd8f,1,1%5D&series=%5Bautoland,cc4be2ad4c55588a87acfe014c9c55cdde0528f6,1,1%5D&series=%5Bautoland,ff7fe8783ea3a7d6feb2000be0142b1b11351206,1,1%5D&zoom=1491888561828.816,1492057619000,500,1192.193308550186&selected=%5Bautoland,ff7fe8783ea3a7d6feb2000be0142b1b11351206,191522,90726272,1%5D

This chart highlights these 2 changesets better:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=26b7053440a5cf15eacd10b9cbf16597eb75916b&tochange=e8efc3c5d61f386cc64f8ee82d6739cee71fd205

I also noticed these 2 changesets are related to e10s, so I ignored the other ones.

Flags: needinfo?(gsquelart)

Ionuț Goldan [:igoldan]

Reporter

Comment 4

•

7 years ago

Could you state if the changesets listed in comment 2 are related to this issue?

Flags: needinfo?(jaws)

Flags: needinfo?(gps)

Flags: needinfo?(dothayer)

Flags: needinfo?(adw)

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

7 years ago

the sessionrestore* regressions seem to be related to Bug 1337063:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=26b7053440a5cf15eacd10b9cbf16597eb75916b&tochange=a6300a521f9c0438c1688007c2889530a0facce2


that is narrowing down the data from comment 3:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,5f783f35de9134018990e631fb6ff725b2f3a9da,1,1%5D&series=%5Bautoland,cc4be2ad4c55588a87acfe014c9c55cdde0528f6,1,1%5D&zoom=1491955015425.928,1491956863018.8486,694.662793050537,887.0891082589441&selected=%5Bautoland,5f783f35de9134018990e631fb6ff725b2f3a9da,191520,91316049,1%5D


I believe :gerald should be looking at this as the data clearly shows that is the culprit.

:gerald, what help do you need here?

Flags: needinfo?(jaws)

Flags: needinfo?(gps)

Flags: needinfo?(dothayer)

Flags: needinfo?(adw)

(no longer active)

Comment 6

•

7 years ago

It seems like this regression only happens on Windows.  My best guess is that the reason is the call that bug 1337063 added to this function: gfxPlatform::BuildContentDeviceData <http://searchfox.org/mozilla-central/rev/313e5199bf58200f158c6fcbe193e41b88ed58a6/gfx/thebes/gfxPlatform.cpp#2696>

Unfortunately this function waits on the GPU process to starts up, and it could be that when we run ContentParent::LaunchSubprocess() for the first time we don't yet have a GPU process, so we end up launching on here and waiting for it to be launched.  If launching the GPU process is anything close to as expensive as launching the content process itself (see bug 1348361) then there is a great chance that we are adding a real regression here by inflating the already slow code here due to bug 1348361 and basically making it doubly as slow by making us wait on two processes to start up rather than 1.)

Gerald, do you think my guess above has any relation to reality here?

David, is it possible to do this work without waiting on the GPU process to start up?

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 7

•

7 years ago

In theory the GPU process is faster to start up since it doesn't init most of XPCOM, though it's still going to be loading xul.dll. When we expect Direct3D 11 to work, it is not possible to skip this step unfortunately. We'd have to initialize devices in the UI process which is exactly what we want to avoid.

First, I would confirm this is what actually happens. Putting a log print or NS_WARNING after here [1] should do it. It might also be that the GPU process has already sent a reply by then, but we haven't given the UI message loop a chance to run and dispatch the message. We could put a log message here [2] (in addition to the other one) to see if that was the case. If it is, we could probably resort to something that bypasses IPDL.

If we're legitimately just blocking waiting for the GPU process to start up, things get a little harder, but I have a few ideas.

[1] http://searchfox.org/mozilla-central/source/gfx/ipc/GPUChild.cpp#99
[2] http://searchfox.org/mozilla-central/source/gfx/ipc/GPUParent.cpp#209

Flags: needinfo?(dvander)

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 8

•

7 years ago

Thank you Joel, Ehsan, and David for your help.

Unfortunately I don't know that much about the Gfx/GPU side of things, so this is getting harder for me to deal with this efficiently. I'm happy to continue, with David's help (and I do have some ideas of my own too), but if we need this fixed correctly soon, I would suggest assigning this bug to someone much more knowledgeable than me!

In the meantime, I will follow David's advice in comment 7, and try to dive a bit deeper in case I can find a solution.

And if needed, we could easily back out bug 1337063 to get Talos back to a base level, before we experiment further in a new bug.

Blocks: 1337063

Flags: needinfo?(gsquelart)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 9

•

7 years ago

Diagnostic try run:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca526ed2e3d3470d2eba0de1bd05d96dfbf84461

Chris Peterson [:cpeterson]

Comment 10

•

7 years ago

We should track Talos sessionrestore regressions for Quantum Flow.

Whiteboard: [gfx-noted] → [gfx-noted][qf]

(no longer active)

Comment 11

•

7 years ago

Bas, this needs to be tracked somehow, it's a regression from bug 1337063.  Can you please track it to make sure we don't end up shipping it?  Thanks!

status-firefox55: --- → affected

tracking-firefox55: --- → ?

Flags: needinfo?(bas)

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 12

•

7 years ago

I haven't had luck with this, so I'll just prepare a back-out patch. If it lands, I'll open a new bug to try to fix bug 1337063 again...

Sorry for making you lie (twice!) in your newsletter, Ehsan.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/137358/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/137358/

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 14

•

7 years ago

mozreview-review

Comment on attachment 8865735 [details]
Bug 1356138 - Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression -

https://reviewboard.mozilla.org/r/137358/#review140432

I'm sad to see this reverted. I'm a little suspicious that the regression might not be this patch's fault - your other sync-removal patch changed timings and exposed bugs in network code. Also when I did a diagnostic run, I put a MOZ_CRASH on the sync path in your patch, and it didn't hit for sessionrestore tests. That would suggest we pushed a startup delay elsewhere which is the intended outcome, and suggests the "blocking on two processes instead of one" scenario isn't common.

The diagnostic run did show evidence that sometimes when we go to block on the init message, it's stuck in the event queue, so we're doing a round-trip for no reason. We might be able to fix that with CrossProcessMutex or a background pipe. (Or something more invasive to layers, like delaying content device creation.)

Attachment #8865735 - Flags: review?(dvander) → review+

Gerald Squelart (he/him) (not at Mozilla since 2022-09-15)

Comment 15

•

7 years ago

mozreview-review-reply

Comment on attachment 8865735 [details]
Bug 1356138 - Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression -

https://reviewboard.mozilla.org/r/137358/#review140432

Agreed that it's sad, but hopefully it will at least add a bit of certainty as to whether bug 1337063 was the cause of the Talos regression -- if not, we can backout the backout! And if it was the cause, then we can work on a different solution...

I'd be happy to delay landing if you'd prefer, as we have a few weeks left before the next merge window.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 16

•

7 years ago

Nah it's fine to revert now. I don't doubt that it caused a talos regression - I'm just doubting whether the root cause is really this patch's fault.

Pulsebot

Comment 17

•

7 years ago

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c73219703d72
Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression - r=dvander

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

7 years ago

this fixed all of the win8 regressions, but not the win7 regressions:
== Change summary for alert #6476 (as of May 09 2017 16:37 UTC) ==

Improvements:

 11%  sessionrestore windows8-64 pgo e10s     633.71 -> 563.00
 11%  sessionrestore windows8-64 opt e10s     763.54 -> 681.22
 11%  sessionrestore_no_auto_restore windows8-64 pgo e10s654.58 -> 585.50
 10%  sessionrestore_no_auto_restore windows8-64 opt e10s787.38 -> 706.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6476

Marcia Knous [:marcia]

Comment 19

•

7 years ago

Tracking 55+ per Comment 11.

tracking-firefox55: ? → +

(no longer active)

Comment 20

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #18)
> this fixed all of the win8 regressions, but not the win7 regressions:

What does that mean?!  That seems like the worst of both worlds, we got bug 1337063 backed out and our regression isn't completely fixed either.  Not sure what to read out of this!

Bas Schouten (:bas.schouten)

Comment 21

•

7 years ago

(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #20)
> (In reply to Joel Maher ( :jmaher) from comment #18)
> > this fixed all of the win8 regressions, but not the win7 regressions:
> 
> What does that mean?!  That seems like the worst of both worlds, we got bug
> 1337063 backed out and our regression isn't completely fixed either.  Not
> sure what to read out of this!

The indicated Win7 regression was tiny, I wonder if there ever was a Win7 regression in the first place, as our Win7 devices don't even get a content device iirc. And this is probably a regression due to us doing something with the device during the period of time tested by session restore rather than outside it.

Even if there was a small win7 regression, it might be small enough that due to subtly different noise the small nudge back to the old value isn't picked up as statistically significant.

Flags: needinfo?(bas)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

7 years ago

it could be that win7 was related to something else and we accidentally miscategorized it as this root cause.

Guilherme Lima

Comment 23

•

7 years ago

https://hg.mozilla.org/mozilla-central/rev/c73219703d72 - landed almost 24 hours ago.

Andrew Overholt [:overholt]

Comment 24

•

7 years ago

The root bug is still marked qf:p1 so let's not track this backout.

Whiteboard: [gfx-noted][qf] → [gfx-noted][qf-]

Mike Taylor [:miketaylr]

Updated

•

7 years ago

status-firefox55: affected → fix-optional

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 25

•

7 years ago

Tracking "-" due to comment 24 and the fact that it is fix-optional for 55.

tracking-firefox55: + → -

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 26

•

7 years ago

I've relanded patches in bug 1337063.  You can read through the comments there, or the conclusion in bug 1337063 comment 41 that the session restore worsening numbers do not indicate a regression, just happen to measure things at the point where more time is now spent.  Mostly because content process is not blocked and can get further ahead.

So, when this bug shows back up, I'm going to suggest we mark it as resolved and start tracking the new numbers.

Assignee: gsquelart → milan

Ionuț Goldan [:igoldan]

Reporter

Comment 27

•

7 years ago

The expected regression arrived:

== Change summary for alert #7494 (as of June 22 2017 18:34 UTC) ==

Regressions:

 12%  sessionrestore windows7-32 opt e10s     757.29 -> 850.08
 11%  sessionrestore_no_auto_restore windows7-32 opt e10s836.04 -> 929.42
 10%  sessionrestore windows7-32 pgo e10s     611.33 -> 674.00
 10%  sessionrestore_no_auto_restore windows7-32 pgo e10s661.46 -> 728.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7494

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 28

•

7 years ago

I expect we will won't fix this.

On a related note, I found sessionrestore_no_auto_restore  improves 5% or so if we turn off content multi-process (e.g., run talos with a single content process instead of 4.)  Wonder if that gives us some clues about this one.

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Updated

•

7 years ago

Comment 29

•

7 years ago

See bug 1377229 comment 5.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

thanks Milan!

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: --- → -

Whiteboard: [gfx-noted][qf-] → [gfx-noted]