Closed Bug 1245765 Opened 4 years ago Closed 4 years ago

Improve reliability of Firefox under device resets

Categories

(Core :: Graphics: Layers, defect)

Unspecified
Windows
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
e10s m9+ ---
firefox47 --- fixed

People

(Reporter: dvander, Assigned: dvander)

References

Details

Attachments

(9 files, 2 obsolete files)

4.37 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
14.05 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
5.68 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.18 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
16.10 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
6.82 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
28.34 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
5.22 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
7.15 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
Our device reset (TDR) handling is still pretty flaky on E10s. I can easily reproduce problems with our fake-TDR pref or by installing a restartless driver update.

The main problem is that a device reset causes open windows to destroy and recreate their CompositorParent (this itself is untested and even asserts in a debug build, but seems like it works). This action is not explicitly mimicked on the child side. The ClientLayerManager, even if it invalidates everything, will happily continue to forward transactions to the old compositor.
See Also: → 1245843
This is to make iterating sIndirectLayerTrees about 2% easier.
Attachment #8717109 - Flags: review?(matt.woodrow)
The D3D9 compositor handles device resets locally, and sends an invalidation message to its CompositorChild. We should send an invalidation message to children with remote layers as well.

This is completely separate from D3D11 device resets which actually destroy the compositor. I renamed "InvalidateAll" to "InvalidateLayers" so this will be clearer later on.
Attachment #8717110 - Flags: review?(matt.woodrow)
The TDR testing pref sends a message to child processes, which isn't a function of our actual TDR handling code. This causes us to appear OK when in reality we're not handling true resets correctly.

This patch changes the TDR testing pref so that it only affects the parent process. Later it might make sense to add a separate pref for just child processes.
Attachment #8717116 - Flags: review?(matt.woodrow)
Attached patch part 4, remove "bogus" assert (obsolete) — Splinter Review
When we encounter a device reset, we call nsBaseWidget::DestroyLayerManager. This sets mLayerManager to null, calls mCompositorChild->Destroy, but does *not* set mCompositorParent or mCompositorChild to null.

Then, when we call CreateCompositor again, this assert fires because mCompositorChild and mCompositorParent are still lingering around.

One of the following must be true:
 (1) The assert is valid and we should be nullptr'ing in DestroyLayerManager.
 (2) The assert is valid and none of this is safe.
 (3) The assert is bogus and we can remove it.

Bas says #2 is not true. #3 seems most likely to me. What do you think, nical?
Attachment #8717121 - Flags: review?(nical.bugzilla)
Attached patch part 5, WIP (obsolete) — Splinter Review
This patch tries to address the issue in comment #0. Right before we start handling a device reset in the parent process, we have the CompositorParent broadcast an "InvalidateLayers" message to all remote layers. Each TabChild then throws away its ClientLayerManager. When a tab wants to paint again, it must request a new Layers ID and PLayerTransaction.
Comment on attachment 8717109 [details] [diff] [review]
part 1, indirect layers iterator

Review of attachment 8717109 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +125,5 @@
>    }
>  }
>  
> +// The indirect layer tree lock must be held before calling this function.
> +// Callback should take (LayerTreeState* aState, const uint64_t& aLayersId)

This comment would be better in the header file.
Attachment #8717109 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8717110 [details] [diff] [review]
part 2, invalidate d3d9 remote layers

Review of attachment 8717110 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.h
@@ +321,5 @@
> +   * invalidated to clear any cached TextureSources.
> +   *
> +   * This must be called on the compositor thread.
> +   */
> +  void SendInvalidateLayers();

I'd prefer just 'InvalidateLayers'. The 'Send' prefix is usually used for autogenerated ipdl methods.
Attachment #8717110 - Flags: review?(matt.woodrow) → review+
Attachment #8717116 - Flags: review?(matt.woodrow) → review+
OS: Unspecified → Windows
Comment on attachment 8717124 [details] [diff] [review]
part 5, WIP

This approach - migrating to a new layers id - doesn't really work and there are subtle races involved with opening new tabs. The destroy-compositor approach might still be salvageable if we instead reassociated the existing layers id with the new CompositorParent, and skipped creating a new PLayerTransaction. But we'd have to be careful, racing with WillStop or tab creation still looks complicated.

I'd like to see if we can get away with  destroying *only* the Compositor, and not the CompositorParent. LayerManagerComposite, PCompositor, and PLayerTransaction are all backend independent (in theory). We should be able to invalidate layers, swap out the compositor reference, and broadcast a new TextureFactoryIdentifier. It might be simpler and safer than destroying the entire compositor stack.
Attachment #8717124 - Attachment is obsolete: true
Attachment #8717121 - Attachment is obsolete: true
Attachment #8717121 - Flags: review?(nical.bugzilla)
Randomly: I have a feeling that a few of the issues are due to us coming back from a reset without D2D or  DWrite support, and trouble ensues in gfxDWriteFont code.
If we lose the D3D11 device, we may still have LAYERS_D3D11 in texture factory identifiers. Those callsites need to check that there is a valid device first. Luckily they all do this except for the path async drawing uses.
Attachment #8718643 - Flags: review?(matt.woodrow)
Attachment #8718643 - Flags: review?(matt.woodrow) → review+
I noticed a lot of unnecessary static_casts while touching this code. This patch removes static_casting to CrossProcessCompositorParent and ClientLayerManager.
Attachment #8719636 - Flags: review?(matt.woodrow)
This factors compositor initialization out from LayerManagerComposite, so we can reuse it for device resets.
Attachment #8719637 - Flags: review?(matt.woodrow)
Attachment #8719636 - Flags: review?(matt.woodrow) → review+
Attachment #8719637 - Flags: review?(matt.woodrow) → review+
This patch changes how resets are handled. Instead of destroying the CompositorParent and waiting for it to be recreated, we now force the CompositorParent to throw away its Compositor and create a new one. The new compositor is propagated to all layer managers. This avoids all the complexity involved with rebuilding the IPDL stack.

With this patch I now survive TDRs in all tabs on E10S. It needs more testing and there are potential followup issues to address (like preventing calls that might touch the old D3D11 device), but this is the bare minimum to get things moving.
Attachment #8719665 - Flags: review?(matt.woodrow)
Brad, something you want to block E10S (to a certain degree?)  Probably at least track it?
Flags: needinfo?(blassey.bugs)
tracking-e10s: --- → ?
Flags: needinfo?(blassey.bugs)
Comment on attachment 8719665 [details] [diff] [review]
part 7, new reset handling

Review of attachment 8719665 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/TabChild.h
@@ +572,5 @@
>                             const TimeStamp& aCompositeReqEnd);
>  
>    void ClearCachedResources();
>    void InvalidateLayers();
> +  void UpdateCompositor(const TextureFactoryIdentifier& aNewIdentifier);

I'd prefer CompositorUpdated() or similar as a name, since this is a notification, not a request.

::: gfx/layers/composite/LayerManagerComposite.cpp
@@ +1472,5 @@
> +void
> +LayerManagerComposite::ChangeCompositor(Compositor* aNewCompositor)
> +{
> +  mCompositor = aNewCompositor;
> +  mTextRenderer = new TextRenderer(aNewCompositor);

Do we need to make sure we dump any cached ContainerLayer surfaces too?

::: gfx/layers/ipc/CompositorParent.h
@@ +277,5 @@
> +   * Request that the compositor be recreated due to a shared device reset.
> +   * This must be called on the main thread, and blocks until a task posted
> +   * to the compositor thread has completed.
> +   */
> +  bool ResetCompositor(const nsTArray<LayersBackend>& aBackendHints,

As discussed, this posts a task directly (rather than using IPDL) to try get ahead as many IPDL messages as possible and reset the compositor immediately.

Can you please add a detailed comment about what we're doing, and why.
Attachment #8719665 - Flags: review?(matt.woodrow) → review+
Before notifying tabs that they should invalidate layers, we need to notify each content process that there is probably a device reset, otherwise nothing will cause us to recreate devices. A single message to each process would be best, but unfortunately there's no ordering between PContent and PCompositor. It's easier to just check the device every time a tab gets a CompositorUpdated message.

This exposes a pre-existing oddity: ClientLayerManager doesn't match the rendering device to the layer manager, so in theory if we have two layer managers (for example, one content process covering two windows), only the first might get invalidated. I'll address this in part 9.

Note that we could probably replace the SchedulePaintIfDeviceReset call with the new UpdateRenderModeIfDeviceReset. It looks like it's okay to flush gfxPlatform from within nsRefreshDriver::Tick, so there's no need to schedule and wait for a WM_PAINT. But I'll save that for another bug.
Attachment #8722910 - Flags: review?(matt.woodrow)
Comment on attachment 8722910 [details] [diff] [review]
part 8, reset content devices

Review of attachment 8722910 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxPlatform.h
@@ +612,5 @@
>      virtual void FlushContentDrawing() {}
>  
> +    // If a device reset has occurred, schedule any necessary paints in the
> +    // widget. This should only be used within nsRefreshDriver.
> +    virtual bool SchedulePaintIfDeviceReset() {

Any need for these two to be bools? Doesn't look like we check it.
Attachment #8722910 - Flags: review?(matt.woodrow) → review+
This covers the loophole mentioned earlier, where a client-side reset after layers have been built could result in invalid layers.
Attachment #8723381 - Flags: review?(matt.woodrow)
Attachment #8723381 - Flags: review?(matt.woodrow) → review+
David, what remaining work do we have to do here (as indicated by the leave-open tag)? Wondering if we can untrack this from e10s M9.
Flags: needinfo?(dvander)
Sorry, yup - nothing more planned here.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(dvander)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
You need to log in before you can comment on or make changes to this bug.