Closed Bug 1356138 Opened 3 years ago Closed 2 years ago

2.42 - 10.26% sessionrestore / sessionrestore_no_auto_restore (windows7-32, windows8-64) regression on push e8efc3c5d61f386cc64f8ee82d6739cee71fd205 (Wed Apr 12 2017)

Categories

(Core :: Graphics, defect)

Unspecified
Windows
defect
Not set

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox55 - fix-optional

People

(Reporter: igoldan, Assigned: milan)

References

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [gfx-noted][qf-])

Attachments

(1 file)

Talos has detected a Firefox performance regression from push e8efc3c5d61f386cc64f8ee82d6739cee71fd205. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 10%  sessionrestore windows8-64 opt e10s     738.92 -> 814.75
  9%  sessionrestore_no_auto_restore windows8-64 opt e10s766.67 -> 838.33
  2%  sessionrestore_no_auto_restore windows7-32 pgo e10s686.12 -> 702.75


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=5972

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Graphics
Product: Firefox → Core
:gerald Please confirm this issue is related to changeset: https://hg.mozilla.org/integration/autoland/rev/e8efc3c5d61f
Flags: needinfo?(gsquelart)
Whiteboard: [gfx-noted]
(In reply to igoldan from comment #1)
> :gerald Please confirm this issue is related to changeset:
> https://hg.mozilla.org/integration/autoland/rev/e8efc3c5d61f

I don't believe it can be related to bug 1337062, as that one was backed out 90 minutes later on autoland (see bug 1337062 comment 50), but the regression doesn't go away after that, and even spreads to inbound and central.


From the Alerts page, the pushlog [1] points at a bunch of bugs:

There is bug 1337063, which is similar to 1337062: Preemptively pushing data when launching a child process, instead of letting the child process sync-request soon after launch -- But overall the time should be the same or shorter; I'd need to investigate further...

What about the other ones?
Gregory Szorc — Bug 1355624 - Make Mercurial require TLS 1.2+ connections; r=dustin
Doug Thayer — Bug 1348396 - Only show a badge on PanelUI while in fullscreen r=enndeakin+6102
Drew Willcoxon — Bug 1295458 - Rework key and mouse handling for the one-off search buttons. r=florian
Jared Wein — Bug 1355078 - Restore scroll position after updating select popup. r=mossop
Jared Wein — Bug 1354196 - Forward the text-shadow CSS property to the select popup for styling. r=mossop
(And could there be more?)

Could someone who understands the Talos tests better, give some hints as to which one(s) of these bugs could possibly have an impact?
Could we run Talos tests against each of these pushes to find the most-likely culprit?


[1] https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=9929b37fea6884a758f3b5fee384bb7ea91792fa&tochange=e8efc3c5d61f386cc64f8ee82d6739cee71fd205
No longer blocks: 1337062
Flags: needinfo?(gsquelart)
Could you state if the changesets listed in comment 2 are related to this issue?
Flags: needinfo?(jaws)
Flags: needinfo?(gps)
Flags: needinfo?(dothayer)
Flags: needinfo?(adw)
It seems like this regression only happens on Windows.  My best guess is that the reason is the call that bug 1337063 added to this function: gfxPlatform::BuildContentDeviceData <http://searchfox.org/mozilla-central/rev/313e5199bf58200f158c6fcbe193e41b88ed58a6/gfx/thebes/gfxPlatform.cpp#2696>

Unfortunately this function waits on the GPU process to starts up, and it could be that when we run ContentParent::LaunchSubprocess() for the first time we don't yet have a GPU process, so we end up launching on here and waiting for it to be launched.  If launching the GPU process is anything close to as expensive as launching the content process itself (see bug 1348361) then there is a great chance that we are adding a real regression here by inflating the already slow code here due to bug 1348361 and basically making it doubly as slow by making us wait on two processes to start up rather than 1.)

Gerald, do you think my guess above has any relation to reality here?

David, is it possible to do this work without waiting on the GPU process to start up?
Flags: needinfo?(dvander)
In theory the GPU process is faster to start up since it doesn't init most of XPCOM, though it's still going to be loading xul.dll. When we expect Direct3D 11 to work, it is not possible to skip this step unfortunately. We'd have to initialize devices in the UI process which is exactly what we want to avoid.

First, I would confirm this is what actually happens. Putting a log print or NS_WARNING after here [1] should do it. It might also be that the GPU process has already sent a reply by then, but we haven't given the UI message loop a chance to run and dispatch the message. We could put a log message here [2] (in addition to the other one) to see if that was the case. If it is, we could probably resort to something that bypasses IPDL.

If we're legitimately just blocking waiting for the GPU process to start up, things get a little harder, but I have a few ideas.

[1] http://searchfox.org/mozilla-central/source/gfx/ipc/GPUChild.cpp#99
[2] http://searchfox.org/mozilla-central/source/gfx/ipc/GPUParent.cpp#209
Flags: needinfo?(dvander)
Thank you Joel, Ehsan, and David for your help.

Unfortunately I don't know that much about the Gfx/GPU side of things, so this is getting harder for me to deal with this efficiently. I'm happy to continue, with David's help (and I do have some ideas of my own too), but if we need this fixed correctly soon, I would suggest assigning this bug to someone much more knowledgeable than me!

In the meantime, I will follow David's advice in comment 7, and try to dive a bit deeper in case I can find a solution.

And if needed, we could easily back out bug 1337063 to get Talos back to a base level, before we experiment further in a new bug.
Blocks: 1337063
Flags: needinfo?(gsquelart)
We should track Talos sessionrestore regressions for Quantum Flow.
Whiteboard: [gfx-noted] → [gfx-noted][qf]
Bas, this needs to be tracked somehow, it's a regression from bug 1337063.  Can you please track it to make sure we don't end up shipping it?  Thanks!
Flags: needinfo?(bas)
I haven't had luck with this, so I'll just prepare a back-out patch. If it lands, I'll open a new bug to try to fix bug 1337063 again...

Sorry for making you lie (twice!) in your newsletter, Ehsan.
Comment on attachment 8865735 [details]
Bug 1356138 - Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression -

https://reviewboard.mozilla.org/r/137358/#review140432

I'm sad to see this reverted. I'm a little suspicious that the regression might not be this patch's fault - your other sync-removal patch changed timings and exposed bugs in network code. Also when I did a diagnostic run, I put a MOZ_CRASH on the sync path in your patch, and it didn't hit for sessionrestore tests. That would suggest we pushed a startup delay elsewhere which is the intended outcome, and suggests the "blocking on two processes instead of one" scenario isn't common.

The diagnostic run did show evidence that sometimes when we go to block on the init message, it's stuck in the event queue, so we're doing a round-trip for no reason. We might be able to fix that with CrossProcessMutex or a background pipe. (Or something more invasive to layers, like delaying content device creation.)
Attachment #8865735 - Flags: review?(dvander) → review+
Comment on attachment 8865735 [details]
Bug 1356138 - Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression -

https://reviewboard.mozilla.org/r/137358/#review140432

Agreed that it's sad, but hopefully it will at least add a bit of certainty as to whether bug 1337063 was the cause of the Talos regression -- if not, we can backout the backout! And if it was the cause, then we can work on a different solution...

I'd be happy to delay landing if you'd prefer, as we have a few weeks left before the next merge window.
Nah it's fine to revert now. I don't doubt that it caused a talos regression - I'm just doubting whether the root cause is really this patch's fault.
Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c73219703d72
Backout bug 1337063 (Msg_GetGraphicsDeviceInitData opt) for Talos regression - r=dvander
this fixed all of the win8 regressions, but not the win7 regressions:
== Change summary for alert #6476 (as of May 09 2017 16:37 UTC) ==

Improvements:

 11%  sessionrestore windows8-64 pgo e10s     633.71 -> 563.00
 11%  sessionrestore windows8-64 opt e10s     763.54 -> 681.22
 11%  sessionrestore_no_auto_restore windows8-64 pgo e10s654.58 -> 585.50
 10%  sessionrestore_no_auto_restore windows8-64 opt e10s787.38 -> 706.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6476
(In reply to Joel Maher ( :jmaher) from comment #18)
> this fixed all of the win8 regressions, but not the win7 regressions:

What does that mean?!  That seems like the worst of both worlds, we got bug 1337063 backed out and our regression isn't completely fixed either.  Not sure what to read out of this!
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #20)
> (In reply to Joel Maher ( :jmaher) from comment #18)
> > this fixed all of the win8 regressions, but not the win7 regressions:
> 
> What does that mean?!  That seems like the worst of both worlds, we got bug
> 1337063 backed out and our regression isn't completely fixed either.  Not
> sure what to read out of this!

The indicated Win7 regression was tiny, I wonder if there ever was a Win7 regression in the first place, as our Win7 devices don't even get a content device iirc. And this is probably a regression due to us doing something with the device during the period of time tested by session restore rather than outside it.

Even if there was a small win7 regression, it might be small enough that due to subtly different noise the small nudge back to the old value isn't picked up as statistically significant.
Flags: needinfo?(bas)
it could be that win7 was related to something else and we accidentally miscategorized it as this root cause.
The root bug is still marked qf:p1 so let's not track this backout.
Whiteboard: [gfx-noted][qf] → [gfx-noted][qf-]
Tracking "-" due to comment 24 and the fact that it is fix-optional for 55.
I've relanded patches in bug 1337063.  You can read through the comments there, or the conclusion in bug 1337063 comment 41 that the session restore worsening numbers do not indicate a regression, just happen to measure things at the point where more time is now spent.  Mostly because content process is not blocked and can get further ahead.

So, when this bug shows back up, I'm going to suggest we mark it as resolved and start tracking the new numbers.
Assignee: gsquelart → milan
The expected regression arrived:

== Change summary for alert #7494 (as of June 22 2017 18:34 UTC) ==

Regressions:

 12%  sessionrestore windows7-32 opt e10s     757.29 -> 850.08
 11%  sessionrestore_no_auto_restore windows7-32 opt e10s836.04 -> 929.42
 10%  sessionrestore windows7-32 pgo e10s     611.33 -> 674.00
 10%  sessionrestore_no_auto_restore windows7-32 pgo e10s661.46 -> 728.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7494
I expect we will won't fix this.

On a related note, I found sessionrestore_no_auto_restore  improves 5% or so if we turn off content multi-process (e.g., run talos with a single content process instead of 4.)  Wonder if that gives us some clues about this one.
See bug 1377229 comment 5.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.