Tabs opened by clicking links sometimes result in a perma tabswitch spinner, even though content process is not hung

VERIFIED FIXED in Firefox 51

Status

()

Core
Graphics
--
critical
VERIFIED FIXED
a year ago
a year ago

People

(Reporter: mconley, Assigned: dvander)

Tracking

(Blocks: 2 bugs)

unspecified
mozilla53
Unspecified
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox51 verified, firefox52+ verified, firefox53 verified)

Details

Attachments

(2 attachments)

STR:

(This takes time and patience, as there seems to be some kind of race here)
0) Make sure e10s is enabled and you're in an e10s window.
1) Have some pre-existing tabs opened and loaded.
2) In one of those tabs, visit http://waitbutwhy.com/2016/11/its-going-to-be-okay.html
3) Once you're able, scroll down to the bottom of the article, where the text says "Want to feel hopeful? Read about Elon."
4) Left-click on Elon.
5) When a new tab opens, see if it's a perma-spinner. If not, close the tab.
6) Repeat steps 4 and 5 until you see a perma-spinner.

I've hit this a few times, and it usually occurs after Firefox has been in the background for a while. I've only started to notice this in the past few days, and only on my Windows 7 machine.

When I have hit it, I've noticed that the content process that the tab belongs to is not hung. The async tab switching just doesn't think (or hasn't been told) that the layers for the newly opened tab are ready and available.

Switching away from the tab and switching back doesn't address the issue (since the async tab switcher never "destructs", it thinks that the spinning tab is in a constant state of LOADING while it waits for the layers).

I've noticed an uptick in long spinners on Telemetry: https://mzl.la/2fFFZ1I

And this is starting on the 5th of November.
Created attachment 8809465 [details]
Async tab switcher log

I set browser.tabs.remote.logSwitchTiming and hit the bug, and this is what I saw in the Browser Console.
Blocks: 1300411, 1131167
I'll note that in the log from comment 1, the tab I switched to (#133) isn't actually in the state that's logged to the console.

billm, any idea how that could be?
Flags: needinfo?(wmccloskey)
Note that this only affects e10s, so our ESR users should not be affected.

However, this is awful enough behaviour (and we're close enough to uplift time) that I'm pre-emptively setting tracking flags so we don't somehow lose this.
tracking-firefox52: --- → ?
Here's the set of commits that seem to map to the spike we're seeing in Telemetry:

https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2016-11-03&enddate=2016-11-05
Looking at Telemetry for the FX_TAB_SWITCH_SPINNER_VISIBLE_LONG_MS

OS X: https://mzl.la/2g1sGJi
Linux (64): https://mzl.la/2g1pJZn
Windows: https://mzl.la/2g1pfCt

Breaking down Windows, I only see the spike on Windows 7:

https://mzl.la/2g1pJbE

Although Windows 8.1 has fewer samples, I don't see the spike there: https://mzl.la/2g1pEF6
I'll note that the compositor process was enabled for Windows in bug 1314133 within that changeset range.
Hey dvander,

I'm sniffing around bug 1314133 as the possible regressor here simply because this only seems to affect Windows. The async tab switcher depends on the MozLayerTreeReady event to know when it's okay to complete the async tab switch and present the painted tab content to the user. Do you know if, with the new compositor process, there's potential for some kind of race when creating a new tab (and therefore a new TabParent, and layer tree ID) where the MozLayerTreeReady event may not ever be fired?
Flags: needinfo?(dvander)
Tracking 52+ we should try to fix this before the merge.
tracking-firefox52: ? → +
I'm unfortunately away from my Windows 7 machine until Tuesday, and it was only on that machine where I was able to reproduce this (which is consistent with the data from Telemetry). If anybody else is able to reproduce this, can you please try setting layers.gpu-process.dev.enabled to false and seeing if you can still reproduce?

If not, then this seems pretty connected to bug 1314133, which means that the bug will likely stay Nightly only (since that's the only place where layers.gpu-process.dev.enabled is set to true).
Hey RyanVM, got any suggestions for who is best to test this from QA? Both my STR in comment 0, and my suspicions of a likely "fix" in comment 9?
Flags: needinfo?(ryanvm)
Flags: needinfo?(ryanvm) → needinfo?(rares.bologa)
(Assignee)

Updated

a year ago
Blocks: 1307578

Updated

a year ago
Flags: needinfo?(rares.bologa) → needinfo?(ciprian.muresan)
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
Build ID:   20161113030203

I haven't been able to reproduce the issue. I have tried on 3 different machines with different hardware. I tried Nightly x32 & x64 on Windows 7 x64 and Nightly x32 on Windows 7 x32. 

I've tried to reproduce with the steps from comment 0, I've also tried leaving the machines alone for a while and tested the steps after that, I've tried opening the page from Step 3 in multiple new tabs (40+) and they've all loaded as expected.

When I've tried doing the same on Windows 7 x32 I've got inconclusive results, some random OOM crashes, tabs going black, but the tabs loaded and finished loading.

Let me know if I can do anything more.
Flags: needinfo?(ciprian.muresan)
(Assignee)

Comment 12

a year ago
(In reply to Ciprian Muresan [:cmuresan] from comment #11)
> User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101
> Firefox/52.0
> Build ID:   20161113030203
> 
> I haven't been able to reproduce the issue. I have tried on 3 different
> machines with different hardware. I tried Nightly x32 & x64 on Windows 7 x64
> and Nightly x32 on Windows 7 x32. 
> 
> I've tried to reproduce with the steps from comment 0, I've also tried
> leaving the machines alone for a while and tested the steps after that, I've
> tried opening the page from Step 3 in multiple new tabs (40+) and they've
> all loaded as expected.
> 
> When I've tried doing the same on Windows 7 x32 I've got inconclusive
> results, some random OOM crashes, tabs going black, but the tabs loaded and
> finished loading.
> 
> Let me know if I can do anything more.

Did these machines have the Windows 7 Platform Update installed?

(In reply to Mike Conley (:mconley) (PTO - Nov 11 - Nov 14) from comment #0)

Same question - you can tell via about:support, if Direct2D 1.1 is enabled.

My guess is yes, but it's worth checking.
Flags: needinfo?(dvander) → needinfo?(mconley)
(Assignee)

Comment 13

a year ago
Mike, I'm also wondering whether you were able to pin the blame on layers.gpu-process.dev.enabled during your test session?

I just did some Telemetry and don't see any obvious correlation - histogram distribution looks the same between sessions with and without the GPU process having been launched. What's also weird is that Windows 7 is much more likely to *not* have the GPU process, so I'd expect Windows 8 to have more regressions than less.

That said... I'm totally willing to believe we regressed something because we touched exactly that area of code.
On the 3 machines on which I have tested the issue I can see Direct2D: True in about:support, but nothing on Direct2D 1.1.
(Assignee)

Comment 15

a year ago
Okay, that means you have Direct2D 1.1 and the Platform Update installed, then.
I'm in the same boat as cmuresan - I see Direct2D: True, but nothing about Direct2D 1.1.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) from comment #2)
> I'll note that in the log from comment 1, the tab I switched to (#133) isn't
> actually in the state that's logged to the console.
> 
> billm, any idea how that could be?

I think maybe this could happen when there are preloaded tabs. It used to be that these tabs appeared under tabbrowser.mPanelContainer but they weren't in tabbrowser.tabs. That would explain the difference. I'm not sure if we still preload tabs this way though.

If that's the case, it's probably not related to this problem.
Flags: needinfo?(wmccloskey)

Comment 18

a year ago
str
¡Hola!

Commenting here per mconley's request on #nightly...

STR
- Load https://help.softtek.com/sm/ess.do

Result:
A never ending e10s spinner at the center of the viewport

Expected result:
Page loads

Fix:
Set layers.gpu-process.dev.enabled to false and restart Nightly

Environment:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0 ID:20161117030212 CSet: 13f49da109ea460665ad27c8497cb1489548450c

¡Gracias!
Alex
I think this is a more precise regression range, based on the build IDs of the Nightlies:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8002299dfc3fb3b2c53e50e40a8e16f8dff2c3e7&tochange=a7c654513f2ffd9d9ef38fa2bf512b9e8dae3cdd
Notably, the GPU process work is not on that list. It didn't make it into the 11/5 Nightly.
I'm afraid I haven't seen this again since I originally filed it. :/
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → INCOMPLETE
Temporarily re-opening this because I went back and found the build I was using on the day that I originally experienced it. I downloaded the Nightly build, and have reproduced the issue. I've also gotten symbols and the source for that day, and am able to step around in the code. I'll be working today to see what is going on here.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
(In reply to Bill McCloskey (:billm) from comment #20)
> Notably, the GPU process work is not on that list. It didn't make it into
> the 11/5 Nightly.

Unfortunately, the Telemetry dashboard was lying to us. See bug 1319845 comment 18 and https://github.com/mozilla/telemetry-dashboard/issues/265.

The long and the short of it is that the regression _did_ appear in the Nov 6th build, which puts the GPU process back on the table as a candidate cause.
Whoah, that's awful. Can you try disabling the GPU process and see if it still reproduces?
In comment 22, I described how I'd gotten into the state with the perma-spinner despite the content process not being hung.

Today, I walked through the code to see what was going on. Here's what I know:

1) The content process is indeed responsive. Other tabs within the same content process paint and are interactive.

2) At least part of Gecko thinks that the "spinny" tab is showing it's content. If, for example, the page has a text area on it, even when I'm spinny, if I put my mouse into the region where the textarea would normally be, the mouse cursor changes to the text select cursor. If the spinny tab is supposed to be showing an image, if I click and drag the region where the image is supposed to be displayed, I can get the drag effect on the cursor for the image.

3) I caused the spinny tab to force paint by setting gBrowser.selectedBrowser.docShellIsActive = false (the async tab switcher will soon force it to active again by itself). I was able to follow through our graphics machinery and we're definitely painting the tab content. We also forward it along to the compositor process.

4) The compositor process receives the layer transaction, but this is where it falls over:

http://searchfox.org/mozilla-central/rev/b4e6fa3527436bdbcd9dd2e1982382fe423431f3/gfx/layers/ipc/LayerTransactionParent.cpp#272

The LayerManager is nullptr, so we just return IPC_OK() without doing any kind of compositing. That means that the TabParent never hears about the successful compositing of the tab contents, which means that the MozLayerTreeReady event never fires, and so we perma-spin.

So the question is: how come the LayerManager is nullptr here?
(In reply to Bill McCloskey (:billm) from comment #24)
> Whoah, that's awful. Can you try disabling the GPU process and see if it
> still reproduces?

I'll try it later today. It was tricky to get into this state (it took me about twenty minutes to a half an hour yesterday), and so I want to get as much data out of it while I have it in this state. Once I've exhausted it, I'll try with the GPU process disabled.
It doesn't look like LayerTransactionParent ever clears mLayerManager / layer_manager() itself. It is, however, periodically constructed with a nullptr one:

http://searchfox.org/mozilla-central/rev/b4e6fa3527436bdbcd9dd2e1982382fe423431f3/gfx/layers/ipc/CompositorBridgeParent.cpp#1420
http://searchfox.org/mozilla-central/rev/b4e6fa3527436bdbcd9dd2e1982382fe423431f3/gfx/layers/ipc/CrossProcessCompositorBridgeParent.cpp#97
I just spent some time attempting to reproduce this bug with the GPU process disabled, and I wasn't able to.

That's not super conclusive, especially considering how hard it was to reproduce it with the GPU process enabled. But it's a datapoint.
I've also had no luck reproducing this on Linux, where I forced the GPU process active. I was hoping I could capture this problem with rr, but no dice.

It'd be really nice to be able to say with a high degree of confidence that the GPU process is contributing to this, or eliminate it as the culprit entirely. Would it be possible to turn off the GPU process on Nightly for a few days to see if we see the regression in Telemetry disappear? That's the simplest way I can think of showing a connection.

dvander - would you have any objections if I disabled the GPU process on Nightly for a few days?
Flags: needinfo?(dvander)
(Assignee)

Comment 30

a year ago
No objection here, but let's ping ashughes since we're about to run an A/B test. If that's coming Very Soon we can use that instead.

Anthony, see comment #29
Flags: needinfo?(dvander) → needinfo?(anthony.s.hughes)

Comment 31

a year ago
str
¡Hola!

Here's another use case that got me never ending spinners:

- Load http://www.dhl.com/en/contact_center/contact_express.html
- Pick "Mexico"
- Click "Go>>"

mconley suggest there might be something specific to this laptop.

This reproduces in a clean profile.

This works with layers.gpu-process.dev.enabled;false.

Below find the relevant section of about:support.

¡Gracias!
Alex

Gráficas
Características
Composición	Direct3D 11
Encuadro/zoom asíncrono	entrada de rueda de ratón activada; entrada táctil habilitada
Procesador WebGL	Google Inc. -- ANGLE (Intel(R) HD Graphics 3000 Direct3D11 vs_4_1 ps_4_1)
Procesador WebGL2	Google Inc. -- ANGLE (Intel(R) HD Graphics 3000 Direct3D11 vs_4_1 ps_4_1)
Decodificación H264 por hardware	No; Hardware video decoding disabled or blacklisted
Backend de audio	wasapi
Direct2D	true
DirectWrite	true (6.2.9200.17568)
GPU #1
Activo	Sí
Descripción	Intel(R) HD Graphics 3000
ID de vendedor	0x8086
ID de dispositivo	0x0126
Versión del controlador	8.15.10.2696
Fecha del controlador	3-19-2012
Controladores	igdumd64 igd10umd64 igd10umd64 igdumd32 igd10umd32 igd10umd32
ID del subsistema	04931028
RAM	Unknown
Diagnósticos
AzureCanvasAccelerated	0
AzureCanvasBackend	direct2d 1.1
AzureContentBackend	direct2d 1.1
AzureFallbackCanvasBackend	cairo
Registro de decisiones
D3D9_COMPOSITING	
disabled by default: Disabled by default
(In reply to alex_mayorga from comment #31)
> This works with layers.gpu-process.dev.enabled;false.

I clarified this with alex_mayorga over IRC. To avoid ambiguity, what is being said here is that with the GPU process disabled, there are no perma-spinners with his STR.
(Assignee)

Comment 33

a year ago
Thanks for persisting with this, Mike! I think based on comment #27 this is what could go wrong:

 (1) TabParent::Show sends a message, via the UI's PCompositorBridge, to associate a layers id with the UI compositor.
 (2) TabParent::Show sends a Show message to its TabChild.
 (3) TabChild::RecvShow calls TabChild::InitRenderingState.
 (4) InitRenderingState sends a PLayerTransactionParent constructor via CrossProcessCompositorBridge.
 (5) If the message from (1) has not resolved before the message from (4), LayerTransactionParent::mLayerManager will be null.

Of course... there is no ordering at all between (1) and (4). The only reason it worked so well before the GPU process is because the message was posted to a thread in the same process, whereas now it goes over IPC.

The fix is to make the association atomic with allocating a PLayerTransaction, to eliminate these kinds of problems.
Assignee: nobody → dvander
Status: REOPENED → ASSIGNED
Flags: needinfo?(anthony.s.hughes)
(Assignee)

Comment 34

a year ago
Created attachment 8815079 [details] [diff] [review]
fix

After talking this over with Bill, we think making NotifyChildCreated sync is the best short-term solution. It's also easy to uplift. It's not ideal, but at the moment we don't have clear alternatives.

One idea was to send a parent compositor key to the child process, but this would let a compromised process associate tabs with the wrong window.

Changing AllocPLayerTransaction and NotifyChildCreated so that they can happen out of order does not seem easy either. The former is already sync, since it needs a TextureFactoryIdentifier. 

Another idea would be to create the PLayerTransaction as part of the association, by sending a message down via CrossProcessCompositorBridge. Then we would delay InitRenderingState until this message was received.

But for now this patch should work fine.
Attachment #8815079 - Flags: review?(matt.woodrow)
Attachment #8815079 - Flags: review?(matt.woodrow) → review+

Comment 35

a year ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ec0f21b006c
Ensure an ordering between cross-process compositor association and PLayerTransaction construction. (bug 1316632, r=mattwoodrow)

Updated

a year ago
Component: Tabbed Browser → Graphics
Product: Firefox → Core

Comment 36

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5ec0f21b006c
Status: ASSIGNED → RESOLVED
Last Resolved: a year agoa year ago
status-firefox53: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Hey alex_mayorga,

dvander's fix for this bug made it into today's Nightly (2016-11-29). Can you re-enable the GPU process and test to see if the perma-spinner issue has been fixed for you?
Flags: needinfo?(alex_mayorga)
(In reply to David Anderson [:dvander] from comment #30)
> No objection here, but let's ping ashughes since we're about to run an A/B
> test. If that's coming Very Soon we can use that instead.
> 
> Anthony, see comment #29

Sorry for the delayed response and sorry if this comment is no longer relevant. I missed this need-info because I've been heads down trying to solve a bug in my Telemetry Experiment before pushing it live. I was hoping to go live before the All-hands but that's not going to happen. My best hope at this point is to get the issue resolved at the All-hands and to go live soon after.

Comment 39

a year ago
¡Hola Mike!

I can't reproduce the spinners mentioned on https://bugzilla.mozilla.org/show_bug.cgi?id=1316632#c18 nor  https://bugzilla.mozilla.org/show_bug.cgi?id=1316632#c31 on Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0 ID:20161129030222 CSet: f8107cf961444a8d7fdc2c0f446238af9893f875 with layers.gpu-process.dev.enabled;true

¡Bravo David! =)

¡Gracias!
Alex
Status: RESOLVED → VERIFIED
Flags: needinfo?(alex_mayorga)
(Assignee)

Comment 40

a year ago
Comment on attachment 8815079 [details] [diff] [review]
fix

Approval Request Comment
[Feature/Bug causing the regression]: None/e10s
[User impact if declined]: In very rare cases tabs may spin forever and never display
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: 
[Is the change risky?]: No
[Why is the change risky/not risky?]: This patch forces us to wait until a critical message has been received by the compositor. The surrounding code already assumed we did this. The biggest risk is maybe delaying tab display by a millisecond or two, but it's unlikely. We already have other sync messages on this path.
[String changes made/needed]:
Attachment #8815079 - Flags: approval-mozilla-beta?
Attachment #8815079 - Flags: approval-mozilla-aurora?
Comment on attachment 8815079 [details] [diff] [review]
fix

Fix a tab switch spinner issue. Beta51+ and Aurora52+. Should be in 51 beta 6.
Attachment #8815079 - Flags: approval-mozilla-beta?
Attachment #8815079 - Flags: approval-mozilla-beta+
Attachment #8815079 - Flags: approval-mozilla-aurora?
Attachment #8815079 - Flags: approval-mozilla-aurora+
status-firefox51: --- → affected
status-firefox52: --- → affected
Flags: qe-verify+

Comment 42

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/11ba6a74739c
status-firefox52: affected → fixed

Comment 43

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/060ac6a33892
status-firefox51: affected → fixed
Outstanding - the results of this patch are having an impact. Check out the last few graphs for the Nov 29th build (which is where this fix was first made available in):

https://mikeconley.github.io/bug1310250/

Nice fall-off in the worst buckets (64000ms+, 27856ms - 63999ms, 12124ms - 27855ms), and (at least with the current snapshot), we saw an overall drop in number of e10s users affected by spinners on Nightly by a full percentage point (from ~43% down to ~42%)

Great work!
Duplicate of this bug: 1319845
I didn't managed to reproduce this issue on Windows 7 (x64 and x86) following the Str from Mike's comment 0. I used an old Nightly build from 2016-11-10.

However I think it's safe to say based on comment 44, that this issue is verified fixed on:
- 51.0b10 (20161222080852)
- latest Aurora 52.0a2 (2016-12-28)
- latest Nightly 53.0a1 (2016-12-28)
I've also verified with the builds mentioned above and I can confirm that, I always get tabs loaded on the sites mentioned in this bug report. If you have any question please let me know.
status-firefox51: fixed → verified
status-firefox52: fixed → verified
status-firefox53: fixed → verified
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.