Closed Bug 1569131 Opened 5 years ago Closed 5 years ago

document splitting: Crash in [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands]

Categories

(Core :: Graphics: WebRender, defect, P3)

Desktop
All
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox68 --- unaffected
firefox69 --- unaffected
firefox70 --- disabled

People

(Reporter: jan, Assigned: Gankra)

References

(Regression)

Details

(Keywords: crash, nightly-community, regression)

Crash Data

Attachments

(1 file)

Gankro got the same in bug 1568887 comment 4.

This bug is for crash report bp-576ff374-b47b-484e-94f2-86e4e0190726.

MOZ_RELEASE_ASSERT(mIsSome)

Top 10 frames of crashing thread:

0 libxul.so mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands gfx/layers/wr/WebRenderBridgeParent.cpp
1 libxul.so mozilla::layers::WebRenderBridgeParent::RecvParentCommands gfx/layers/wr/WebRenderBridgeParent.cpp:1532
2 libxul.so mozilla::layers::PWebRenderBridgeParent::OnMessageReceived ipc/ipdl/PWebRenderBridgeParent.cpp:678
3 libxul.so mozilla::layers::PCompositorManagerParent::OnMessageReceived ipc/ipdl/PCompositorManagerParent.cpp:200
4 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2184
5 libxul.so mozilla::ipc::MessageChannel::RunMessage ipc/glue/MessageChannel.cpp:1955
6 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1986
7 libxul.so MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:523
8 libxul.so base::MessagePumpDefault::Run ipc/chromium/src/base/message_pump_default.cc:35
9 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290

OS: Linux → All

HW_COMPOSITING
blocked by runtime: GPU Process is disabled, fallback to software solution.
GPU_PROCESS
failed by runtime: GPU process disabled after 4 attempts
WEBRENDER
opt-in by default: WebRender is an opt-in feature
available by user: Force enabled by pref
unavailable by runtime: GPU Process is disabled

Fehlerprotokoll
(#0) GP+[GFX1-]: DataSourceSurface of SharedSurfaces does not exist for extId:128849018901
(#52) Error Receive IPC close with reason=AbnormalShutdown
(#53) Error GPU process disabled after 4 attempts
(#54) Error Receive IPC close with reason=AbnormalShutdown
(#55) Error Receive IPC close with reason=AbnormalShutdown
(#56) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#57) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#58) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#59) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#60) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#61) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#62) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#63) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#64) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#65) CP+[GFX1-]: Receive IPC close with reason=AbnormalShutdown
(#66) Error Compositors might be mixed (5,1)

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

According to build IDs and bug 1568887 comment 4 this regression was likely caused by bug 1547351.

Also this only happens with an off-by-default-even-on-nightly feature that we were trying to dogfood.

Gonna make this the main bug since it has the "real" cause (I think) of the crashes (the others are just failing to recover from this crash).

Crash Signature: [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] → [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] [@ RetainedDisplayListBuilder::MergeDisplayLists]
Crash Signature: [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] [@ RetainedDisplayListBuilder::MergeDisplayLists] → [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands]
Crash Signature: [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] → [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] [@ RetainedDisplayListBuilder::MergeDisplayLists]

Jan, you haven't randomly found a really reliable way to trigger this, have you? It only seems to happen really randomly for me, and I haven't been able to reproduce it on a local build (with a fresh profile) at all. I wonder if it's tied to something addons are doing to pages?

Crash Signature: [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] [@ RetainedDisplayListBuilder::MergeDisplayLists] → [@ mozilla::layers::WebRenderBridgeParent::ProcessWebRenderParentCommands] [@ RetainedDisplayListBuilder::MergeDisplayLists]

(nvm, seems to happen more reliably in debug builds, hooray)

Hey :aosmond, I'm need-info'ing you on this because I'm almost certainly going to ask you on monday about the TODO at the end of this comment. So feel free to ignore this or read this, but here's all the context for when that happens.


OK SO! Here's what I've figured out so far. First, context:

With the current design of document splitting a (non-root) WRBP may currently be associated with one of two documents: the "content" document or the "popover" document (the thing that shows up when you go fullscreen on e.g. a video, aiui). Which document you're associated with is tracked by mRenderRoot. (gecko calls Documents RenderRoots)

We started seeing this issue because of this change: https://phabricator.services.mozilla.com/D37078.

A significant part of that change was to make mRenderRoot into an Maybe. Essentially, in some situations a WRBP may exist and even have display lists associated with it, but not actually know what document it's associated with yet. (I need to remind myself exactly why this happens, but for now we will take it for granted). If this happens, it doesn't know what WebrenderAPI it's supposed to be talking to. In such a case it defers any work it needs to do to the Root WRBP, who possibly buffers it up until It Is Known. Note that some things are ultimately "API agnostic", and for those things the Root WRBP can just do it right away. Also note: a WRBP's Api starts as Nothing, and once it becomes Some it never changes back.

These comments describe this situation a bit: https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderBridgeParent.h#517


So, our crash: there are several places in the code where we call Api(renderRoot) to resolve our WebrenderAPI. For non-root WRBP's, the argument is ignored and this just resolves mRenderRoot. So if we call Api and mRenderRoot is still Nothing, then we crash. Specifically, we crash on the second line of WRBP::ProcessWebRenderParentCommands where we grab our Api to stuff it in an AutoTransactionSender.

Why is ProcessWebRenderParentCommands being called? I wasn't able to verify this, but our hunch is that our child is executing EndClearCachedResources. This theory is supported by the fact that all of the commands that I saw when I caught the crash were ReleaseTextureOfImage. So a content process is getting convinced that the parent has some images, tries to tell it to clear them, and then we blow up trying to grab our non-existent API.

Note that WRBP::RecvClearCachedResources, which is called from WRBC::BeginClearCachedResources and is supposed to purge any display lists (scenes?) that might reference the resources we want to purge, is aware of the problem of missing renderRoots, and properly handles the situation: https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderBridgeParent.cpp#1867,1878-1880

So it's possible everything that's happening here is totally fine and correct, we just need to change ProcessWebRenderParentCommands to handle the missing renderRoot case like RecvClearCachedResources does.

I'll look into this on monday.

Big TODO: workout how exactly the ImageKeys we're being told to delete got created in the first place. This will probably tell me exactly how we're supposed to handle ReleaseTextureOfImage when mRenderRoot doesn't exist yet. (I expect that the root WRBP just did it for us, but I'm fuzzy on the whole image pipeline.)

Flags: needinfo?(aosmond)

(In reply to Alexis Beingessner [:Gankro] from comment #8)

So, our crash: there are several places in the code where we call Api(renderRoot) to resolve our WebrenderAPI. For non-root WRBP's, the argument is ignored and this just resolves mRenderRoot. So if we call Api and mRenderRoot is still Nothing, then we crash.

Why is ProcessWebRenderParentCommands being called?

Is bug 1561675 (https://phabricator.services.mozilla.com/D33710?id=120057#inline-211076) about this?

Jan: yeah exactly, with that refactoring implemented it would be a lot clearer what can/can't be done without mRenderRoot.

Assignee: nobody → a.beingessner

We defer ParentCommands when they come in attached to a display list,
but we forgot to defer them when they're sent directly. This was
causing us to crash when we defer updates that register textures
and the content process then suddenly decides we need to delete the
textures and only tells us through RecvParentCommands. We have no
idea what the textures are yet, so we need to defer these commands
along with everything else so that it's processed in the intended order.

Flags: needinfo?(aosmond)

hrm, looks like I'm probably still hitting the hit testing version of the crash, and now it's leaving the browser in a wonky state instead of crashing all out.

Also, reasonably quick STR: load up a bunch of youtube videos in different tabs, and switch between them. Happens very quickly.

Ok so writing out what I'm now debugging.


STR:

  • start up a (debug) build with webrender and split-render-roots enabled
  • load up a bunch of youtube videos in different tabs
  • toggle through them over and over until you hit a GPU crash (window flickers black/white)

In release builds I'm still getting GPU process crashes, just somewhere else. I'm pretty sure the debug crash is more instructive since it's catching the problem earlier.


Problem description:

Something somewhere is causing APZ's and WR's understanding of the page to fall out of sync. Specifically I am seeing a crash in APZCTreeManager::GetAPZCAtPointWR when processing ReceiveMouseInput. We seemingly successfully call WebRenderAPI::HitTest and get back a pipelineId/scrollId, but GetTargetNode fails to resolve them and we crash on an assert that claims that this should only happen if scrollId==0 (it was 2 in my last crash).

This isn't a particularly surprising bug since document splitting makes it significantly more complex to keep APZ/WR in sync, but it is a difficult one for me to debug. I'm not really sure where to begin, or how to try to isolate the issue. Something's getting out of sync... somehow.

Working under the theory that this was also caused by this patch probably helps isolate the problem a bit. Certainly deferring WRBP messages needs to properly cooperate with epoch management. I'm just very new to the epoch management code, so I'm not sure what to look for.

While walking through some of the APZ update queue code, I ran across this comment that was a bit troubling, since it specifically calls out this approach as something which could mess up update order, but asserts it's fine since we only have two RenderRoots -- except we have 3 now. Although I wouldn't expect this to be an issue because the new third RenderRoot is just the "you're fullscreen now" overlay, which I don't ever trigger in my STR.

ni?sotaro -- you reviewed a lot of the work here, any ideas for how to proceed on the hunt for this bug?

Flags: needinfo?(sotaro.ikeda.g)
Depends on: 1570593

(In reply to Alexis Beingessner [:Gankro] from comment #14)


STR:

  • start up a (debug) build with webrender and split-render-roots enabled
  • load up a bunch of youtube videos in different tabs
  • toggle through them over and over until you hit a GPU crash (window flickers black/white)

Today, I tried the above STR. I saw the 2 asserts

  • [1] Asserts around external image lock failure
  • [2] Asserts during apz handling

[1] was addressed by Bug 1570593 for me.

Currently very suspicious of the code in RecvEmptyTransaction.

https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderBridgeParent.cpp#1497-1503

Normally while processing the RenderRootUpdates we determine if we're allowed to immediately NotifyPipelineRendered to the CompositorBridgeParent. But when we defer processing RenderRootUpates, we skip the loop that checks for that, and default to assuming it's fine to Notify.

I don't know this code well but it's extremely believable to me that this could cause Gecko and WR to desync on the state of the APZ tree.

Doug has got me relatively convinced the APZ desync is just unrelated bitrotting, so I want to land this patch now so things are less crashy and it's easier to isolate the APZ issue.

Keywords: leave-open
Pushed by abeingessner@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f6e400fb14a2
Also defer RecvParentCommands. r=jrmuizel

actually no wait i do want to just close this one, and just start fresh on a new bug.

Keywords: leave-open
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Flags: needinfo?(sotaro.ikeda.g)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: