Closed Bug 1363126 Opened 3 years ago Closed 3 years ago

Unify device reset handling

Categories

(Core :: Graphics: Layers, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: dvander, Assigned: dvander)

References

Details

(Whiteboard: [gfx-noted])

Attachments

(7 files)

Currently, we have three ways of handling device resets:
 1. In-process resets, driven by gfxWindowsPlatform and propagated via PCompositorBridge.
 2. Remote resets, driven by the compositor and propagated via PGPU+PCompositorBridge.
 3. GPU process crashes, which propagate through PContent.

Methods #1 and #2 don't fully recreate the rendering stack - they create a new Compositor, swap it out of the compositor's LayerManager, and send a message asking tabs to repaint. There are a few problems with this approach. One is it's super complicated, and we still have bugs where we composite old textures/layers. Two, it doesn't work at all with WebRender or Advanced Layers.

If we can't re-acquire an accelerated device, we have to default back to software rendering which requires a LayerManagerComposite. There is no way to switch LayerManagers, and doing so looks very complicated. It'd have to propagate through AsyncCompositionManager, LayerTransactionParent, and LayerTreeState. And we'd be open to a whole new class of latent bugs.

For WR/AL, it makes the most sense to just tear down the LayerTransactionParent - and at the widget level, maybe even the PCompositorBridge. This would match what a GPU process kill does, and we already know that works. (This will also make TDR handling come full circle - it worked this way originally, but we changed it since killing layers wasn't reliable at the time.)
This switches the GPU process device reset path to shutdown and recreate widget and tab PLayerTransactions. In the widget case this also creates a new PCompositorBridge.

The only annoying part of this patch is we have to work around a message ordering problem. If we just try to create a new PLayerTransaction, the compositor will  reject the association since it will see two PLayerTransactions with the same layer id. This is due to the fact that ClientLayerManager is destroyed asynchronously, and even then, only when the last reference is dropped (which is not guaranteed by DestroyLayerManager).

I worked around this by adding a new sync shutdown message to PLayerTransaction, which we forcefully call before re-creating the layer manager. This synchronously triggers ActorDestroy on the compositor side, which disassociates the PLayerTransaction from the compositor.
Attachment #8866209 - Flags: review?(rhunt)
Comment on attachment 8866209 [details] [diff] [review]
part 1, switch remote case

Bill, r? for adding a new sync message. This is only used on D3D11 device reset, which is catastrophic and rare.
Attachment #8866209 - Flags: review?(wmccloskey)
Handle the in-process case. I tested this both with and without e10s and it seems to work.
Attachment #8866210 - Flags: review?(rhunt)
All this stuff is unused now.
Attachment #8866211 - Flags: review?(rhunt)
Add a crash annotation right before handling a device reset. We do keep a log in critical notes/app notes, but something direct and searchable would be nice.
Attachment #8866214 - Flags: review?(milan)
Device resets are very spammy. This at least stops all but the first "BeginFrame" message.
Attachment #8866229 - Flags: review?(rhunt)
There's a risk these patches stress a rare scenario, where a child layer tree sends updates to a destroyed PCompositorBridge. I think we do handle this for the most part, LayerTransactionParent will reject messages when the layer manager is destroyed.

But I'm hoping they also make resets much more stable... since we synchronously shut down *everything* now, there's much less risk that we'll composite or buffer upload to a dead device, and mixing devices should be almost impossible.
Comment on attachment 8866214 [details] [diff] [review]
part 4, crash annotations

Review of attachment 8866214 [details] [diff] [review]:
-----------------------------------------------------------------

This can run on any process?  Do we know that it works properly from the compositor process, given the !XRE_IsParentProcess() check in CrashReporter::AnnotateCrashReport()?
Attachment #8866214 - Flags: review?(milan) → review+
Yup, it works, the GPU process wouldn't have shipped otherwise. The parent process check is because child processes need to send their metadata over shmem.
Attachment #8866209 - Flags: review?(wmccloskey) → review+
Attachment #8866209 - Flags: review?(rhunt) → review+
Comment on attachment 8866210 [details] [diff] [review]
part 2, switch in-process case

Review of attachment 8866210 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/RemoteCompositorSession.h
@@ -32,5 @@
>               uint64_t aSeqNo,
>               TextureFactoryIdentifier* aOutIdentifier) override;
>    void Shutdown() override;
>  
> -  void NotifyDeviceReset(uint64_t aSeqNo);

nit: I think this was meant to be removed in the next patch, the GPUProcessManager code isn't touched
Attachment #8866210 - Flags: review?(rhunt) → review+
Comment on attachment 8866211 [details] [diff] [review]
part 3, rm dead code

Beautiful
Attachment #8866211 - Flags: review?(rhunt) → review+
Attachment #8866229 - Flags: review?(rhunt) → review+
I missed some code to remove in part 3.
Attachment #8866977 - Flags: review?(rhunt)
This code was missing before, and though this whole mechanism is fairly bogus when compositors are mixed, it's better to have it than to not. It runs when we normally InitRenderingState on tabs.
Attachment #8866978 - Flags: review?(rhunt)
Attachment #8866977 - Flags: review?(rhunt) → review+
Attachment #8866978 - Flags: review?(rhunt) → review+
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/063adb4deaf5
Handle remote device resets by recreating the entire rendering stack. (bug 1363126 part 1, r=rhunt, ipc_r=billm)
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff60b177a442
Handle in-process device resets by recreating the entire rendering stack. (bug 1363126 part 2, r=rhunt)
https://hg.mozilla.org/integration/mozilla-inbound/rev/b8d1153cb941
Remove old device reset and compositor swapping code. (bug 1363126 part 3, r=rhunt)
https://hg.mozilla.org/integration/mozilla-inbound/rev/96669e08cf8f
Add crash annotations before handling device resets. (bug 1363126 part 4, r=milan)
https://hg.mozilla.org/integration/mozilla-inbound/rev/d376fc7d074a
Cut down on compositor spam after a device reset. (bug 1363126 part 5, r=rhunt)
https://hg.mozilla.org/integration/mozilla-inbound/rev/c78b979c5c59
Make sure to re-identify TextureFactoryIdentifiers when reinitializing rendering. (bug 1363126 part 6, r=rhunt)
Depends on: 1364433
Comment on attachment 8866209 [details] [diff] [review]
part 1, switch remote case

Review of attachment 8866209 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/TabChild.cpp
@@ +3131,5 @@
> +
> +  RefPtr<LayerManager> lm = mPuppetWidget->GetLayerManager();
> +  ClientLayerManager* clm = lm->AsClientLayerManager();
> +  if (!clm) {
> +    return;

If lm is a WebRenderLayerManager, this is going to skip the ReinitRendering() call at the bottom of this function. Is that intentional?
Discussed this with kats on irc, but posting here for posterity.

This code path won't be run by WebRender because it always runs in the GPU process and the remote case is only triggered by CompositorD3D11. Which is a good question about what work needs to be done for a reset with ANGLE.
You need to log in before you can comment on or make changes to this bug.