Closed Bug 1256517 Opened 8 years ago Closed 8 years ago

EXCEPTION_BREAKPOINT crashes in mozilla::layers::BasicCompositor::DrawQuad related to device resets

Categories

(Core :: Graphics, defect)

Unspecified
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: dbaron, Assigned: dvander)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(4 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-51594f25-9661-4b27-ae7f-c41412160314.
=============================================================

Related to bug 1255711, but at lower frequency:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-01-01&reason=%3DEXCEPTION_BREAKPOINT&signature=mozilla%3A%3Alayers%3A%3ABasicCompositor%3A%3ADrawQuad&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations
is a topcrash that's been happening in nightly at decent frequency since about March 2, although there were a few crashes before then (although not much before).
Flags: needinfo?(dvander)
It looks like we've reset the compositor after a device reset, but we're trying to composite an old layer tree. This should be safe in theory, but there are prerelease asserts guarding that it doesn't happen, so I guess we should block compositing until the root of the tree changes or something.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(dvander)
I can reproduce this (and other asserts) by flipping gfx.testing.device-fail and then gfx.testing.device-reset.
This patch tracks which LayerTransactionParents have fully acknowledged compositor resets. Since I think it's possible to send a reset in between the layer ID being allocated and the PLayerTransactionParent getting allocated[?], this state is also tracked in LayerTreeState.

The next few patches will make use of this to deny texture updates.
Attachment #8730606 - Flags: review?(matt.woodrow)
Same as previous patch with bug fix.
Attachment #8730606 - Attachment is obsolete: true
Attachment #8730606 - Flags: review?(matt.woodrow)
Attachment #8730892 - Flags: review?(matt.woodrow)
This patch should guarantee that we don't create new textures with stale compositor data. Unfortunately we can't just return null in AllocPTextureParent - it's an async message, so child and parent will have mismatched state. This will mess up deserialization later and cause the parent process to kill the child. (In fact, we should *never* return nullptr from this function, so we may want to go back and fix other cases later).

Instead, we now return a TextureParent that has no compositable manager and no TextureHost, and ensure in-process SurfaceDescriptor references are released early.

This breaks callers of TextureHost::AsTextureHost, which the next patch will address.
Attachment #8730895 - Flags: review?(matt.woodrow)
This patch moves the always-duplicated "compositable" field out of compositable operations by wrapping sub-ops into a "CompositableOperationDetail" struct.

I'm not sure whether this was worth it, but it does simplify part 4 a bit.
Attachment #8731024 - Flags: review?(matt.woodrow)
This patch ensures that we don't perform any operations on stale compositables, and do not attempt to attach any stale compositables to the layer tree. Together with part 3, and the fact that we already block compositing stale textures in ContainerLayerComposite, should close remaining holes where we could DrawQuad() something old.

With this I no longer hit any prerelease asserts when I switch a window from D3D11 to software.

Unfortunately video does stop working. I don't think media code handles device resets even under single-process mode, so that would be expected. Most of the asserts in comment #0's crash reports are from ImageHosts, where (1) ImageContainers are probably re-used across layer tree invalidations and (2) updates to async ImageHosts are much more frequent than normal layer tree updates, either of which would explain why asserts there were much more prevalent.

Anyway, I'll address that in a separate bug.
Attachment #8731027 - Flags: review?(matt.woodrow)
Also, expanding on comment #7 - it is still possible to call PAllocTextureParent with the wrong layers backend. ImageBridge doesn't know about device resets, so it may try to allocate the wrong texture kind after we've acknowledged the compositor update. That's okay even though I'd like ot fix it later - at least now we won't be able to attach the texture anywhere.
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Review of attachment 8730892 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.h
@@ +241,5 @@
>    virtual bool RecvFlushRendering() override;
>    virtual bool RecvForcePresent() override;
>  
> +  virtual bool RecvAcknowledgeCompositorUpdate(const uint64_t& aLayersId) override {
> +    // This message is only sent cross-process.

Might as well assert here then.

::: gfx/layers/ipc/PCompositor.ipdl
@@ +98,5 @@
>     */
>    async RemotePluginsReady();
>  
> +  // Confirmation that the child has invalidated all its layers, and will not
> +  // request layers against and old compositor.

-d
Attachment #8730892 - Flags: review?(matt.woodrow) → review+
Attachment #8731024 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation

Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but nical should be aware of this.
Attachment #8730895 - Flags: review?(nical.bugzilla)
Attachment #8730895 - Flags: review?(matt.woodrow)
Attachment #8730895 - Flags: review+
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/Compositor.h
@@ +518,5 @@
>  
> +  // A stale Compositor has no CompositorParent; it will not process frames
> +  // and should not be used.
> +  void SetStale();
> +  bool IsStale() const;

I would be ok with a slightly stronger name than 'Stale' too.

::: gfx/layers/ipc/CompositableTransactionParent.cpp
@@ +74,5 @@
>  CompositableParentManager::ReceiveCompositableUpdate(const CompositableOperation& aEdit,
>                                                       EditReplyVector& replyv)
>  {
> +  // Ignore all operations on compositables created on stale compositors. We
> +  // return false because the child is unable to handle errors.

we return true. Note that we've already notified the child to invalidate all layers and start again, so we don't need it to handle this error either.
Attachment #8731027 - Flags: review?(nical.bugzilla)
Attachment #8731027 - Flags: review?(matt.woodrow)
Attachment #8731027 - Flags: review+
Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation

Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------

R- because of the potential double-free and leaks, see comment below. I am not firmly opposed to this approach, but I would prefer that we create a TextureHost which is in a state where it will always return false in Lock, BindTextureSource, etc. The first reason is that we already do this when we receive a shmem and fail to map it on the parent process, so the code that uses TextureHost is already resistant in this kind of situation. The other reason is that depending on how the texture is setup, deallocation can happen in either the content or compositor process. In any case there's a handshake sequence that ensures one side never deallocates shared data while the other side might be using it and this patch introduces the possibility for this kind of race to happen. I think it would be easier to create the TextureHost around the shared data with an extra state that forces it to be invalid and let it do the deallocation logic, than plumb some deallocation logic for this special case ourselves. unless you add logic for all cases you will have double-deletes or leaks (depending on which side was initially responsible for using the data).

::: gfx/layers/composite/TextureHost.cpp
@@ +128,5 @@
> +      DIBTextureHost::CallRelease(aDesc.get_SurfaceDescriptorDIB());
> +      break;
> +    }
> +#endif
> +    case SurfaceDescriptor::TSurfaceDescriptorBuffer: {

BufferTextureHost can work with all compositor backends. We don't need to create an empty actor in this case I think.

@@ +134,5 @@
> +      const MemoryOrShmem& data = bufferDesc.data();
> +      if (data.type() == MemoryOrShmem::Tuintptr_t) {
> +        if (uint8_t* buffer = reinterpret_cast<uint8_t*>(data.get_uintptr_t())) {
> +          GfxMemoryImageReporter::WillFree(buffer);
> +          delete[] buffer;

You can only delete the buffer if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT. Other wise the content process might try to deallocate the buffer too.

In the case of a ShmemTextureHost, we are leaking the shmem if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT.
Attachment #8730895 - Flags: review?(nical.bugzilla) → review-
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. You could generalize this to IsValid or something of the like, that would encompass any kind of reason the compositor gets into a state where it should not be used.

For the other patch, you could make sure all TextureHost implementations properly check that they have a valid compositor with the expected backend type and have them bail out early in the important methods if it's not the case.
Attachment #8731027 - Flags: review?(nical.bugzilla) → review+
This adds a new TextureFlag that instructs TextureHosts not to access their underlying device. This shouldn't be strictly needed, but it does make things slightly safer. Part 4 prevents us from attaching these textures so they should never get used in the first place.

Unfortunately while writing this patch, I realized that GetD3D11Device is not thread-safe. I'll file a follow-up bug for that soon.
Attachment #8730895 - Attachment is obsolete: true
Attachment #8732570 - Flags: review?(nical.bugzilla)
Attachment #8732570 - Flags: review?(nical.bugzilla) → review+
Backed out for bustage.

Backouts:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d86e841099d
https://hg.mozilla.org/integration/mozilla-inbound/rev/db3309c8fe36
https://hg.mozilla.org/integration/mozilla-inbound/rev/712d5ccb76cd
https://hg.mozilla.org/integration/mozilla-inbound/rev/8127138e3146

Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=27a8a01abf66
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=24446301&repo=mozilla-inbound

10:50:57     INFO -  In file included from /builds/slave/m-in-lx-0000000000000000000000/build/src/obj-firefox/gfx/layers/Unified_cpp_gfx_layers6.cpp:20:0:
10:50:57     INFO -  /builds/slave/m-in-lx-0000000000000000000000/build/src/gfx/layers/ipc/LayerTransactionParent.cpp:992:84: error: macro "gfxDevCrash" passed 2 arguments, but takes just 1
10:50:57     INFO -       gfxDevCrash(LogReason::PAllocTextureBackendMismatch, "Texture backend is wrong");
10:50:57     INFO -                                                                                      ^
10:50:57     INFO -  gmake[5]: *** [Unified_cpp_gfx_layers6.o] Error 1
10:50:57     INFO -  gmake[5]: *** Waiting for unfinished jobs....
Flags: needinfo?(dvander)
For some reason my local build didn't see this typo as an error. Re-landing with that line fixed.
Flags: needinfo?(dvander)
Depends on: 1260391
Depends on: 1260441
See Also: → 1187466
No longer depends on: 1264341
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Approval Request Comment
[Feature/regressing bug #]: bug 1245765
[User impact if declined]: Crashes/graphics problems on device resets
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: bug 1245765 changed how we handled graphics resets to make them more reliable. However, that exposed further problems in our reset handling, and this bug addresses a few of those problems. These patches are low risk. They have been on Nightly for a while without any obvious crashes or new rendering problems as a result. We suspect these patches might help bug 1187466 as well.
[String/UUID change made/needed]:
Attachment #8730892 - Flags: approval-mozilla-aurora?
Comment on attachment 8731024 [details] [diff] [review]
part 2, refactor CompositableOperation

Approval Request Comment

See comment #21
Attachment #8731024 - Flags: approval-mozilla-aurora?
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Approval Request Comment

See comment #21
Attachment #8731027 - Flags: approval-mozilla-aurora?
Comment on attachment 8732570 [details] [diff] [review]
part 3, block stale texture creation

Approval Request Comment

See comment #21
Attachment #8732570 - Flags: approval-mozilla-aurora?
I looked at the latest Nightly data for this crash signature, I can still see crashes from 04-14 build even though this landed in Nightly on 03-28, here is an example: https://crash-stats.mozilla.com/report/index/8162c15e-1c9c-48ab-aba6-344a72160415. Though the frequency of these crashes are extremely low on 48.0 I think it's still worthwhile uplifting to Aurora47.
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Top crash fix that has baked on Nightly for a month, Aurora47+
Attachment #8730892 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8731024 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8731027 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8732570 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Given the data in crash-stats (see link in comment 0), it's not clear to me that the fix improved the problem for which this bug was filed.
Flags: needinfo?(dvander)
This doesn't appear to apply cleanly to aurora.
(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #27)
> Given the data in crash-stats (see link in comment 0), it's not clear to me
> that the fix improved the problem for which this bug was filed.

Maybe not, it'll be easier to tell on Aurora. It does fix bugs I was able to reproduce locally that had the same stack.
Flags: needinfo?(dvander)
(In reply to Wes Kocher (:KWierso) from comment #28)
> This doesn't appear to apply cleanly to aurora.

Two of the files that it touched got renamed on 48 :( It applied okay by hand, though. Try run to just make sure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b3221061e15
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: