Closed Bug 1350408 Opened 3 years ago Closed 2 years ago

Crash in mozilla::layers::PWebRenderBridgeChild::SendCreate

Categories

(Core :: Graphics: WebRender, defect, P3, critical)

x86
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 - disabled
firefox56 --- fixed

People

(Reporter: marcia, Assigned: aosmond)

References

(Depends on 1 open bug)

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-d5f3289f-d7d3-48a5-9d63-591ef2170324.
=============================================================

Seen while looking at crash stats - crashes started using 20170324030205: http://bit.ly/2mZGT8I.

Possible regression range based on crash stats: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7513b3f42058e9bcf9950d4acf4647d4ad2240f0&tochange=01d1dedf400d4be413b1a0d48090dca7acf29637
I enabled gfx.webrender.enabled in today's Nightly and here are the results:

1. The browser doesn't start with NVIDIA 540m graphics. It shows the small crash-box during startup. Intel HD 3000 iGPU boots up fine.

2. The browser crashes when it's minimized then maximized. When I maximize it, the whole browser turns white and can only be stopped from the Task Manager.

3. Scrolling has jank. 

4. Switching tabs is slow. When I open a new tab, sometimes it displays a spinning icon in center.

5. When I open a new tab successfully, the web page becomes completely red for a second before displaying the page.

Here is the full list of crashes. There are two types of errors: mozilla::layers::PWebRenderBridgeChild::SendCreate and IPCError-browser | ShutDownKill

bp-88eee366-3e02-41b1-92d5-0679d2170324
bp-265fc923-18f0-4f93-9c44-4a4e22170324
bp-9464269a-0d2a-4c7b-9de3-cb58c2170324
bp-69669d9a-0ca3-48aa-a187-f969a2170324
bp-d822ccc9-666c-4e1b-91f6-0f9ac2170324
bp-b93d58c5-bad0-413f-9c32-4420c2170324
bp-da697adb-8bc2-4825-b51f-cfffe2170324
bp-eb9d4e16-8b2c-487d-ac7c-d9b6e2170324
bp-188b9131-851b-401c-ad83-88f9d2170324
bp-e313eda1-172c-4929-9257-2c2682170324
bp-cec2f82a-b7b6-438e-96b9-14b872170324
bp-bdfd5cdd-f617-403c-8113-958f32170324
bp-cd12ffad-a54e-42e3-a82f-c1b292170324
bp-3ab09a72-4306-43cc-be6a-c82e12170324
bp-dc4e3911-ff85-4510-9225-c3b7c2170324
bp-0a857074-5a01-4230-ba0b-5e71b2170324
bp-f08e9d77-9e86-4ec9-8fc1-cdb882170324
bp-b549be60-c59c-46b7-8973-5a4192170324
bp-badf260a-a7bd-4eb3-836f-bb6aa2170324
As with bug 1350404, we can probably do a better job here to handle this failure. It looks like WrBridge() at [1] is null, which means the SendPWebRenderBridgeConstructor call returned null.

[1] http://searchfox.org/mozilla-central/rev/7419b368156a6efa24777b21b0e5706be89a9c2f/gfx/layers/wr/WebRenderLayerManager.cpp#217
Assignee: nobody → bugmail
Blocks: 1342450
See Also: → 1350404
(In reply to smartfon.reddit from comment #1)
> I enabled gfx.webrender.enabled in today's Nightly and here are the results:

Thanks for the feedback. I'll (eventually) file separate bugs for these things. Right now we're still getting things stabilized.
In Comment 1 I made a mistake. The reverse is true. Nightly starts fine with NVIDIA but crashes during startup with Intel HD 3000.

The top portion of these characters are not being rendered: f l 5 0
I'm also seeing these kind of crashes as secondary crash in the content process after the gfx process crashed without crash reports being created, it spawns winfault.exe instead.
STR on Windows 10 x64 (no bug on Debian Testing): 
1. about:preferences > "when Nightly starts: Show home page"
2. browser.tabs.warnOnClose;true (=default)
3. you have multiple tabs open
4. gpu process force-enabled, webrender enabled, hardware accelleration (force-)enabled where possible
5. close window with X
6. the popup which asks you if you want to close all tabs is completely black inside, it seems the gpu process crashes and restarts, you may press enter to proceed and close those tabs and the browser.

Meldungs-ID 	Sendedatum
bp-0d5df607-92be-423f-bc09-02a2b0170521	21.05.2017	15:01
bp-fc2d5f64-4374-4ddd-90cb-30d580170521	21.05.2017	15:01
= [@ mozilla::layers::PWebRenderBridgeChild::SendCreate ]
= this bug

bp-e5c01613-c63b-42e5-8adf-c28ff0170521	21.05.2017	15:01
= [@ mozilla::layers::CrossProcessCompositorBridgeParent::AllocPWebRenderBridgeParent ]
= bug 1350404

bp-2fe29d7c-5579-4a47-a986-f75860170521	21.05.2017	15:01
= this bug
Another case where I saw this crash signature:

Windows 10 x64: crash while doing a Radeon RX480 driver update. (After my whole desktop turned black for a second when windows activated the new driver).
gpu process force-enabled, webrender enabled, hardware accelleration (force-)enabled where possible

Meldungs-ID 	Sendedatum
bp-897e9aee-9f80-435d-a90f-9acf90170521  21.05.2017	15:09
= [@ mozilla::gl::GLContextEGLFactory::Create ]
= bug 1341478 <- Please consider to reopen.

bp-8177c9fe-a27e-4313-9276-30e500170521	21.05.2017	15:08
= [@ mozilla::layers::PWebRenderBridgeChild::SendCreate ]
= this bug
Assignee: nobody → aosmond
We've had a big spike in Nightly 20170602030204. 87 crashes, making it the #3 Window topcrash.

Crash address is always 0x8, which suggests a null pointer dereference.
This is sort of a generic crash signature, it happens if the GPU process crashes and then the content process tries to reinit and then crashes itself. I described this flow in bug 1365009 comment 7 for anybody interested. I suspect the root cause of this big spike is the shader compilation failure in bug 1370104 which will crash the GPU process.
Actually upon further investigation I don't see why the content process would fail in this way. I tried killing the GPU process on linux+webrender and it reinitialized fine. I'll try some more things.
At least some of them are related to this:

https://crash-stats.mozilla.com/report/index/1b833ecd-df38-45c7-9937-a72dd0170604

Where the parent compositor is missing from the table. What I was thinking on trying is making the content process reinit fail gracefully on the IPC call itself, and reinitialize *that* process as well. But it doesn't fix why the parent compositor is missing, it *should* be added and from all of my local testing, it is.
(In reply to Andrew Osmond [:aosmond] from comment #12)
> At least some of them are related to this:
> 
> https://crash-stats.mozilla.com/report/index/1b833ecd-df38-45c7-9937-
> a72dd0170604
> 
> Where the parent compositor is missing from the table. What I was thinking
> on trying is making the content process reinit fail gracefully on the IPC
> call itself, and reinitialize *that* process as well. But it doesn't fix why
> the parent compositor is missing, it *should* be added and from all of my
> local testing, it is.

Agreed. I also tried on the latest windows nightly - I got GPU crashes from the shader compilation failures but then Firefox fell back to basic compositing and recovered. about:crashes lists 4 GPU process crashes and nothing else. So most likely it is a race condition somewhere that is causing the parent compositor to be missing from the table.
According to the documentation at [1] the compositor should get instantiated as part of the SendEnsureLayersConnected codepath. However RenderFrameParent::EnsureLayersConnected has a bunch of early exits and I'm wondering if we're bailing out one of them. If so then there's no fallback to ensure the compositor gets created.

[1] http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/gfx/ipc/GPUProcessManager.cpp#453-455
Tracking as a recent regression in 55.
Note that this is an opt-in feature that's disabled by default on Nightly and cannot be enabled on any other channel. So feel free to track it, but these crashes should not jeopardize the upcoming merge. If you want to eliminate the crashes anyway to prevent it masking other problems, we can make it so WebRender remains off regardless of the pref, but I would prefer to not do that.
Crashes together with bug 1372880 when clicking on an webextension icon inside the address bar with extensions.webextensions.remote;true.
[Tracking Requested - why for this release]: This should only affect WebRender builds. Given WebRender will not be enabled in this release (nor 56), I don't think we should be tracking this for 55. I scanned the crash reports, and there are no cases of this happening on 55 beta (as one would expect).
This is the #2 topcrash on Linux in Nightly 20170623100152. Over the past 7 days there have been 76 occurrences across 9 installations.
Blocks: 1381095
Status: NEW → ASSIGNED
Priority: -- → P3
Whiteboard: [gfx-noted]
Comment on attachment 8886672 [details] [diff] [review]
Gracefully handle failures in WebRenderLayerManager::Initialize to allow reinit., v1

Review of attachment 8886672 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/wr/WebRenderLayerManager.cpp
@@ +709,5 @@
>  
>  void
>  WebRenderLayerManager::SetLayerObserverEpoch(uint64_t aLayerObserverEpoch)
>  {
> +  if (WrBridge()->IPCOpen()) {

Don't you need to check if (WrBridge()) also here?

::: widget/PuppetWidget.h
@@ +182,5 @@
>                    LayersBackend aBackendHint = mozilla::layers::LayersBackend::LAYERS_NONE,
>                    LayerManagerPersistence aPersistence = LAYER_MANAGER_CURRENT) override;
>  
>    // This is used after a compositor reset.
> +  bool RecreateLayerManager(const std::function<bool(LayerManager*)>& aFunc);

This needs some documentation as to what aFunc is for. A better name might be good too.
Attachment #8886672 - Flags: review?(bugmail) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #22)
> Comment on attachment 8886672 [details] [diff] [review]
> Gracefully handle failures in WebRenderLayerManager::Initialize to allow
> reinit., v1
> 
> Review of attachment 8886672 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/wr/WebRenderLayerManager.cpp
> @@ +709,5 @@
> >  
> >  void
> >  WebRenderLayerManager::SetLayerObserverEpoch(uint64_t aLayerObserverEpoch)
> >  {
> > +  if (WrBridge()->IPCOpen()) {
> 
> Don't you need to check if (WrBridge()) also here?
> 

No, but maybe yes if other things break in the future :). With this patch, we keep the old WebRenderLayerManager around longer than we used to, that's why I had to check if the IPC is still open here -- but we still have mWrChild, just a stale/closed one.

> ::: widget/PuppetWidget.h
> @@ +182,5 @@
> >                    LayersBackend aBackendHint = mozilla::layers::LayersBackend::LAYERS_NONE,
> >                    LayerManagerPersistence aPersistence = LAYER_MANAGER_CURRENT) override;
> >  
> >    // This is used after a compositor reset.
> > +  bool RecreateLayerManager(const std::function<bool(LayerManager*)>& aFunc);
> 
> This needs some documentation as to what aFunc is for. A better name might
> be good too.

Done.
Attachment #8886672 - Attachment is obsolete: true
Attachment #8887466 - Flags: review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/649a35887b30
Gracefully handle failures in WebRenderLayerManager::Initialize to allow reinit. r=kats
https://hg.mozilla.org/mozilla-central/rev/649a35887b30
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
See Also: → 1384548
You need to log in before you can comment on or make changes to this bug.