Recreate GPU endpoints when the GPU process dies

RESOLVED FIXED in Firefox 52

Status

()

Core
Graphics: Layers
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: dvander, Assigned: dvander)

Tracking

(Blocks: 1 bug)

unspecified
mozilla52
Points:
---

Firefox Tracking Flags

(firefox52 fixed)

Details

Attachments

(4 attachments, 1 obsolete attachment)

The goal of this bug is to recreate the rendering stack when the GPU process dies. It will not have any logic for re-instantiating the GPU process - this is just about making the endpoints live again, and recreating all the Compositor/ImageBridge objects.
Created attachment 8789198 [details] [diff] [review]
part 1, send endpoints atomically

This sends mandatory endpoints to the compositor as a bundle rather than individual messages. Not strictly necessary but it makes resets easier, and it's nice to have the reset/init messages look the same.
Attachment #8789198 - Flags: review?(wmccloskey)
Created attachment 8789199 [details] [diff] [review]
part 2, reset widgets

This revokes RemoteCompositorSessions from widgets when the GPU process unexpectedly shuts down. Note, the GPU process is not respawned. When the next repaint occurs, the first widget to request a layer manager will end up creating an in-process compositor.

We'll address respawning in a separate bug - this is to just get re-rendering working at all.
Attachment #8789199 - Flags: review?(matt.woodrow)
Created attachment 8789200 [details] [diff] [review]
part 3, reset content WIP

This "works" in that we don't assert or crash (i.e., everything is shut down in an orderly fashion and recreated), but I'm not seeing paints appear yet. Probably something obvious is missing like rebuilding the LayerTransactionChild.
Comment on attachment 8789199 [details] [diff] [review]
part 2, reset widgets

Review of attachment 8789199 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUProcessManager.cpp
@@ +246,5 @@
> +
> +  // Build a list of sessions to notify, since notification might delete
> +  // entries from the list.
> +  nsTArray<RefPtr<RemoteCompositorSession>> sessions;
> +  for (auto& session : mRemoteSessions) {

You can just copy construct/assign this to make a deep copy, or even use SwapElement to steal the list content. Hardly matters though.

::: gfx/ipc/RemoteCompositorSession.cpp
@@ +31,5 @@
> +{
> +  MOZ_ASSERT(!mCompositorBridgeChild);
> +  if (mCompositorBridgeChild) {
> +    // The compositor session was not properly shutdown.
> +    Shutdown();

Should we unregister the session in this case too?
Attachment #8789199 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8789199 [details] [diff] [review]
part 2, reset widgets

Review of attachment 8789199 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/RemoteCompositorSession.cpp
@@ +31,5 @@
> +{
> +  MOZ_ASSERT(!mCompositorBridgeChild);
> +  if (mCompositorBridgeChild) {
> +    // The compositor session was not properly shutdown.
> +    Shutdown();

Sorry, of course we don't need to, the GPUProcessManager holds a strong reference to us, so we can't hit the destructor while we're registered.

It seems like this code is dead too though, for the same reason. If we don't get Shutdown() called, then we'll be leaked.
Created attachment 8789652 [details] [diff] [review]
part 3, reset content bridges

This just re-establishes singleton bridges. Matt, could you look at the gfx changes?
Attachment #8789200 - Attachment is obsolete: true
Attachment #8789652 - Flags: review?(matt.woodrow)
Comment on attachment 8789652 [details] [diff] [review]
part 3, reset content bridges

Bill, could you look at the IPC/dom changes?
Attachment #8789652 - Flags: review?(wmccloskey)
Created attachment 8789654 [details] [diff] [review]
part 4, reconnect layers

With this patch, I can kill -s 9 the GPU process, and both the content and UI redraw immediately. That's cool. But this patch is kind of gross. When the GPU process dies, we lose the sIndirectLayerTrees map. This map associates PBrowsers (via layers ids) to a top-level window compositor. Thus, a CompositorBridge in a content process can host tabs across multiple windows.

We can't re-establish this without going through RenderFrameParent, since it knows the widget, and I don't really understand that code. It seems to have a lot of edge cases for various levels of connected-ness to graphics and DOM. I also don't know how well this works with inactive tabs or tabs that are in the middle of loading.

It's also kind of gross that we have to send multiple sync messages for every tab. On one hand it probably doesn't matter, they're tiny messages and the GPU process dying is a catastrophic event. On the other hand, it feels like this could be lazy.

Suggestions welcome on whether this could be better.
Attachment #8789654 - Flags: review?(wmccloskey)
Attachment #8789652 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8789198 [details] [diff] [review]
part 1, send endpoints atomically

Review of attachment 8789198 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/PContent.ipdl
@@ +463,5 @@
>  child:
>      // Give the content process its endpoints to the compositor.
> +    async InitRendering(
> +      Endpoint<PCompositorBridgeChild> compositor,
> +      Endpoint<PImageBridgeChild> bridge,

Maybe call it imageBridge?
Attachment #8789198 - Flags: review?(wmccloskey) → review+
Comment on attachment 8789652 [details] [diff] [review]
part 3, reset content bridges

Review of attachment 8789652 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to see more comments about how this is supposed to work. We've talked about it before, but I forget some of the details. And it would definitely help other people. Mostly I think you just need to say, at each point, what we know has happened so far. As an example, I tried to follow what CompositorBridgeChild::Destroy() will do, and I'm not sure. Do we know for sure whether or not ActorDestroy() has been called on the old CompositorBridgeChild by the time RecvReinitRendering has happened? That seems to have a big effect on what CompositorBridgeChild::Destroy does (via the mCanSend flag).

I'm also having trouble understand where the process is being restarted. The name OnCompositorRestarted() suggests that the restart has already happened, but I can't figure out where. And comment 0 says about this bug: "It will not have any logic for re-instantiating the GPU process". Is that still true? If so, how can you kill -9 the GPU process and have it paint again? Do we start using an in-process compositor? When do we actually create the CompositorBridgeParent in that case?

::: dom/ipc/PContent.ipdl
@@ +466,5 @@
>        Endpoint<PCompositorBridgeChild> compositor,
>        Endpoint<PImageBridgeChild> bridge,
>        Endpoint<PVRManagerChild> vr);
>  
> +    // Re-create the rendering stack using the given endpoints.

Please mention that this would happen when the GPU process crashes.
Comment on attachment 8789652 [details] [diff] [review]
part 3, reset content bridges

Review of attachment 8789652 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentParent.cpp
@@ +2381,5 @@
>    return true;
>  }
>  
>  void
> +ContentParent::OnCompositorRestarted()

Please add a comment somewhere here saying that, while we're creating the channels for a new compositor, we're not necessarily starting a new GPU process or anything like that. These channels may end up connecting to an in-process CrossProcessCompositorBridgeParent. And we may not create a CompositorBridgeParent for it until later.

::: dom/ipc/PContent.ipdl
@@ +466,5 @@
>        Endpoint<PCompositorBridgeChild> compositor,
>        Endpoint<PImageBridgeChild> bridge,
>        Endpoint<PVRManagerChild> vr);
>  
> +    // Re-create the rendering stack using the given endpoints.

Please mention that this would happen when the GPU process crashes.

::: gfx/ipc/GPUProcessManager.cpp
@@ +258,5 @@
>    }
> +
> +  // Notify content.
> +  for (const auto& listener : mListeners) {
> +    listener->OnCompositorRestarted();

Maybe OnCompositorUnexpectedShutdown?

::: gfx/layers/ipc/CompositorBridgeChild.cpp
@@ +200,5 @@
> +{
> +  MOZ_ASSERT(NS_IsMainThread());
> +
> +  if (RefPtr<CompositorBridgeChild> old = sCompositorBridge.forget()) {
> +    old->Destroy();

Please mention that ActorDestroy may or may not have been called on |old| at this point. Destroy needs to work either way. It may end up sending a message to the old GPU process. In that case we'll end up with MsgDropped, which will be ignored.

::: gfx/layers/ipc/ImageBridgeChild.cpp
@@ +963,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
>  
>    if (RefPtr<ImageBridgeChild> child = GetSingleton()) {
>      child->WillShutdown();

Please make a similar comment to the one about CompositorBridgeChild? Also, it would be nice if we didn't send this message if !mCanSend.
Attachment #8789652 - Flags: review?(wmccloskey) → review+
Comment on attachment 8789654 [details] [diff] [review]
part 4, reconnect layers

Review of attachment 8789654 [details] [diff] [review]:
-----------------------------------------------------------------

This seems okay to me, but I don't feel qualified to review it. I think Matt should also have a look too (maybe you intended that based on the commit summary).

I don't care so much about efficiency here. I'm much more concerned about whether it will work if the original failure in the GPU process happens in the middle of initializing or shutting down something. But that's the sort of thing that you can only really discover in testing. Maybe we should try to fuzz this somehow by crashing the GPU process at random times?

::: dom/ipc/ContentChild.cpp
@@ +1199,5 @@
> +      tabChild->InvalidateLayers();
> +    }
> +  }
> +
> +  // Re-establish singelton bridges to the compositor.

singleton
Attachment #8789654 - Flags: review?(wmccloskey)
Attachment #8789654 - Flags: review?(matt.woodrow)
Attachment #8789654 - Flags: review+
Attachment #8789654 - Flags: review?(matt.woodrow) → review+
To help address comment #10 I've added a big step-by-step comment to GPUProcessManager::OnProcessUnexpectedShutdown that explains the exact sequence of events.

Comment 14

2 years ago
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/320c9cc8fc52
Send content compositor bridges atomically rather than individually. (bug 1300936 part 1, r=billm)
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ac08fb77360
Recreate widget compositors when the GPU process dies. (bug 1300936 part 2, r=mattwoodrow)
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a6f5fac405f
Recreate content compositor endpoints when the GPU process dies. (bug 1300936 part 3, r=mattwoodrow,billm)
https://hg.mozilla.org/integration/mozilla-inbound/rev/101e81606402
Recreate PLayerTransactions for TabChildren when the compositor restarts. (bug 1300936 part 4, r=mattwoodrow, r=billm)

Comment 15

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/320c9cc8fc52
https://hg.mozilla.org/mozilla-central/rev/5ac08fb77360
https://hg.mozilla.org/mozilla-central/rev/9a6f5fac405f
https://hg.mozilla.org/mozilla-central/rev/101e81606402
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox52: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in before you can comment on or make changes to this bug.