Closed
Bug 1256517
Opened 9 years ago
Closed 9 years ago
EXCEPTION_BREAKPOINT crashes in mozilla::layers::BasicCompositor::DrawQuad related to device resets
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
FIXED
mozilla48
People
(Reporter: dbaron, Assigned: dvander)
References
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(4 files, 2 obsolete files)
11.18 KB,
patch
|
mattwoodrow
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
22.69 KB,
patch
|
mattwoodrow
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
18.04 KB,
patch
|
mattwoodrow
:
review+
nical
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
6.40 KB,
patch
|
nical
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-51594f25-9661-4b27-ae7f-c41412160314.
=============================================================
Related to bug 1255711, but at lower frequency:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-01-01&reason=%3DEXCEPTION_BREAKPOINT&signature=mozilla%3A%3Alayers%3A%3ABasicCompositor%3A%3ADrawQuad&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations
is a topcrash that's been happening in nightly at decent frequency since about March 2, although there were a few crashes before then (although not much before).
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(dvander)
Assignee | ||
Comment 1•9 years ago
|
||
It looks like we've reset the compositor after a device reset, but we're trying to composite an old layer tree. This should be safe in theory, but there are prerelease asserts guarding that it doesn't happen, so I guess we should block compositing until the root of the tree changes or something.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(dvander)
Assignee | ||
Comment 2•9 years ago
|
||
I can reproduce this (and other asserts) by flipping gfx.testing.device-fail and then gfx.testing.device-reset.
Assignee | ||
Comment 3•9 years ago
|
||
This patch tracks which LayerTransactionParents have fully acknowledged compositor resets. Since I think it's possible to send a reset in between the layer ID being allocated and the PLayerTransactionParent getting allocated[?], this state is also tracked in LayerTreeState.
The next few patches will make use of this to deny texture updates.
Assignee | ||
Updated•9 years ago
|
Attachment #8730606 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 4•9 years ago
|
||
Same as previous patch with bug fix.
Attachment #8730606 -
Attachment is obsolete: true
Attachment #8730606 -
Flags: review?(matt.woodrow)
Attachment #8730892 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 5•9 years ago
|
||
This patch should guarantee that we don't create new textures with stale compositor data. Unfortunately we can't just return null in AllocPTextureParent - it's an async message, so child and parent will have mismatched state. This will mess up deserialization later and cause the parent process to kill the child. (In fact, we should *never* return nullptr from this function, so we may want to go back and fix other cases later).
Instead, we now return a TextureParent that has no compositable manager and no TextureHost, and ensure in-process SurfaceDescriptor references are released early.
This breaks callers of TextureHost::AsTextureHost, which the next patch will address.
Attachment #8730895 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 6•9 years ago
|
||
This patch moves the always-duplicated "compositable" field out of compositable operations by wrapping sub-ops into a "CompositableOperationDetail" struct.
I'm not sure whether this was worth it, but it does simplify part 4 a bit.
Attachment #8731024 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 7•9 years ago
|
||
This patch ensures that we don't perform any operations on stale compositables, and do not attempt to attach any stale compositables to the layer tree. Together with part 3, and the fact that we already block compositing stale textures in ContainerLayerComposite, should close remaining holes where we could DrawQuad() something old.
With this I no longer hit any prerelease asserts when I switch a window from D3D11 to software.
Unfortunately video does stop working. I don't think media code handles device resets even under single-process mode, so that would be expected. Most of the asserts in comment #0's crash reports are from ImageHosts, where (1) ImageContainers are probably re-used across layer tree invalidations and (2) updates to async ImageHosts are much more frequent than normal layer tree updates, either of which would explain why asserts there were much more prevalent.
Anyway, I'll address that in a separate bug.
Attachment #8731027 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 8•9 years ago
|
||
Also, expanding on comment #7 - it is still possible to call PAllocTextureParent with the wrong layers backend. ImageBridge doesn't know about device resets, so it may try to allocate the wrong texture kind after we've acknowledged the compositor update. That's okay even though I'd like ot fix it later - at least now we won't be able to attach the texture anywhere.
Comment 9•9 years ago
|
||
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets
Review of attachment 8730892 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/ipc/CompositorParent.h
@@ +241,5 @@
> virtual bool RecvFlushRendering() override;
> virtual bool RecvForcePresent() override;
>
> + virtual bool RecvAcknowledgeCompositorUpdate(const uint64_t& aLayersId) override {
> + // This message is only sent cross-process.
Might as well assert here then.
::: gfx/layers/ipc/PCompositor.ipdl
@@ +98,5 @@
> */
> async RemotePluginsReady();
>
> + // Confirmation that the child has invalidated all its layers, and will not
> + // request layers against and old compositor.
-d
Attachment #8730892 -
Flags: review?(matt.woodrow) → review+
Updated•9 years ago
|
Attachment #8731024 -
Flags: review?(matt.woodrow) → review+
Comment 10•9 years ago
|
||
Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation
Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, but nical should be aware of this.
Attachment #8730895 -
Flags: review?(nical.bugzilla)
Attachment #8730895 -
Flags: review?(matt.woodrow)
Attachment #8730895 -
Flags: review+
Comment 11•9 years ago
|
||
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates
Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/Compositor.h
@@ +518,5 @@
>
> + // A stale Compositor has no CompositorParent; it will not process frames
> + // and should not be used.
> + void SetStale();
> + bool IsStale() const;
I would be ok with a slightly stronger name than 'Stale' too.
::: gfx/layers/ipc/CompositableTransactionParent.cpp
@@ +74,5 @@
> CompositableParentManager::ReceiveCompositableUpdate(const CompositableOperation& aEdit,
> EditReplyVector& replyv)
> {
> + // Ignore all operations on compositables created on stale compositors. We
> + // return false because the child is unable to handle errors.
we return true. Note that we've already notified the child to invalidate all layers and start again, so we don't need it to handle this error either.
Attachment #8731027 -
Flags: review?(nical.bugzilla)
Attachment #8731027 -
Flags: review?(matt.woodrow)
Attachment #8731027 -
Flags: review+
Comment 12•9 years ago
|
||
Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation
Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------
R- because of the potential double-free and leaks, see comment below. I am not firmly opposed to this approach, but I would prefer that we create a TextureHost which is in a state where it will always return false in Lock, BindTextureSource, etc. The first reason is that we already do this when we receive a shmem and fail to map it on the parent process, so the code that uses TextureHost is already resistant in this kind of situation. The other reason is that depending on how the texture is setup, deallocation can happen in either the content or compositor process. In any case there's a handshake sequence that ensures one side never deallocates shared data while the other side might be using it and this patch introduces the possibility for this kind of race to happen. I think it would be easier to create the TextureHost around the shared data with an extra state that forces it to be invalid and let it do the deallocation logic, than plumb some deallocation logic for this special case ourselves. unless you add logic for all cases you will have double-deletes or leaks (depending on which side was initially responsible for using the data).
::: gfx/layers/composite/TextureHost.cpp
@@ +128,5 @@
> + DIBTextureHost::CallRelease(aDesc.get_SurfaceDescriptorDIB());
> + break;
> + }
> +#endif
> + case SurfaceDescriptor::TSurfaceDescriptorBuffer: {
BufferTextureHost can work with all compositor backends. We don't need to create an empty actor in this case I think.
@@ +134,5 @@
> + const MemoryOrShmem& data = bufferDesc.data();
> + if (data.type() == MemoryOrShmem::Tuintptr_t) {
> + if (uint8_t* buffer = reinterpret_cast<uint8_t*>(data.get_uintptr_t())) {
> + GfxMemoryImageReporter::WillFree(buffer);
> + delete[] buffer;
You can only delete the buffer if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT. Other wise the content process might try to deallocate the buffer too.
In the case of a ShmemTextureHost, we are leaking the shmem if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT.
Attachment #8730895 -
Flags: review?(nical.bugzilla) → review-
Comment 13•9 years ago
|
||
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates
Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. You could generalize this to IsValid or something of the like, that would encompass any kind of reason the compositor gets into a state where it should not be used.
For the other patch, you could make sure all TextureHost implementations properly check that they have a valid compositor with the expected backend type and have them bail out early in the important methods if it's not the case.
Attachment #8731027 -
Flags: review?(nical.bugzilla) → review+
Assignee | ||
Comment 14•9 years ago
|
||
This adds a new TextureFlag that instructs TextureHosts not to access their underlying device. This shouldn't be strictly needed, but it does make things slightly safer. Part 4 prevents us from attaching these textures so they should never get used in the first place.
Unfortunately while writing this patch, I realized that GetD3D11Device is not thread-safe. I'll file a follow-up bug for that soon.
Attachment #8730895 -
Attachment is obsolete: true
Attachment #8732570 -
Flags: review?(nical.bugzilla)
Updated•9 years ago
|
Attachment #8732570 -
Flags: review?(nical.bugzilla) → review+
Comment 15•9 years ago
|
||
Comment 16•9 years ago
|
||
Backed out for bustage.
Backouts:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d86e841099d
https://hg.mozilla.org/integration/mozilla-inbound/rev/db3309c8fe36
https://hg.mozilla.org/integration/mozilla-inbound/rev/712d5ccb76cd
https://hg.mozilla.org/integration/mozilla-inbound/rev/8127138e3146
Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=27a8a01abf66
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=24446301&repo=mozilla-inbound
10:50:57 INFO - In file included from /builds/slave/m-in-lx-0000000000000000000000/build/src/obj-firefox/gfx/layers/Unified_cpp_gfx_layers6.cpp:20:0:
10:50:57 INFO - /builds/slave/m-in-lx-0000000000000000000000/build/src/gfx/layers/ipc/LayerTransactionParent.cpp:992:84: error: macro "gfxDevCrash" passed 2 arguments, but takes just 1
10:50:57 INFO - gfxDevCrash(LogReason::PAllocTextureBackendMismatch, "Texture backend is wrong");
10:50:57 INFO - ^
10:50:57 INFO - gmake[5]: *** [Unified_cpp_gfx_layers6.o] Error 1
10:50:57 INFO - gmake[5]: *** Waiting for unfinished jobs....
Flags: needinfo?(dvander)
Assignee | ||
Comment 17•9 years ago
|
||
For some reason my local build didn't see this typo as an error. Re-landing with that line fixed.
Flags: needinfo?(dvander)
Comment 18•9 years ago
|
||
Comment 19•9 years ago
|
||
Comment 20•9 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/3b995a72599c
https://hg.mozilla.org/mozilla-central/rev/eb55f7b93b3d
https://hg.mozilla.org/mozilla-central/rev/524be4de6a5a
https://hg.mozilla.org/mozilla-central/rev/e63011d83825
https://hg.mozilla.org/mozilla-central/rev/13d8c68828ca
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Assignee | ||
Comment 21•9 years ago
|
||
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets
Approval Request Comment
[Feature/regressing bug #]: bug 1245765
[User impact if declined]: Crashes/graphics problems on device resets
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: bug 1245765 changed how we handled graphics resets to make them more reliable. However, that exposed further problems in our reset handling, and this bug addresses a few of those problems. These patches are low risk. They have been on Nightly for a while without any obvious crashes or new rendering problems as a result. We suspect these patches might help bug 1187466 as well.
[String/UUID change made/needed]:
Attachment #8730892 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 22•9 years ago
|
||
Comment on attachment 8731024 [details] [diff] [review]
part 2, refactor CompositableOperation
Approval Request Comment
See comment #21
Attachment #8731024 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 23•9 years ago
|
||
Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates
Approval Request Comment
See comment #21
Attachment #8731027 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 24•9 years ago
|
||
Comment on attachment 8732570 [details] [diff] [review]
part 3, block stale texture creation
Approval Request Comment
See comment #21
Attachment #8732570 -
Flags: approval-mozilla-aurora?
I looked at the latest Nightly data for this crash signature, I can still see crashes from 04-14 build even though this landed in Nightly on 03-28, here is an example: https://crash-stats.mozilla.com/report/index/8162c15e-1c9c-48ab-aba6-344a72160415. Though the frequency of these crashes are extremely low on 48.0 I think it's still worthwhile uplifting to Aurora47.
status-firefox47:
--- → affected
Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets
Top crash fix that has baked on Nightly for a month, Aurora47+
Attachment #8730892 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8731024 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8731027 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8732570 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Reporter | ||
Comment 27•9 years ago
|
||
Given the data in crash-stats (see link in comment 0), it's not clear to me that the fix improved the problem for which this bug was filed.
Flags: needinfo?(dvander)
This doesn't appear to apply cleanly to aurora.
Assignee | ||
Comment 29•9 years ago
|
||
(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #27)
> Given the data in crash-stats (see link in comment 0), it's not clear to me
> that the fix improved the problem for which this bug was filed.
Maybe not, it'll be easier to tell on Aurora. It does fix bugs I was able to reproduce locally that had the same stack.
Flags: needinfo?(dvander)
Assignee | ||
Comment 30•9 years ago
|
||
(In reply to Wes Kocher (:KWierso) from comment #28)
> This doesn't appear to apply cleanly to aurora.
Two of the files that it touched got renamed on 48 :( It applied okay by hand, though. Try run to just make sure:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b3221061e15
Assignee | ||
Comment 31•9 years ago
|
||
Assignee | ||
Comment 32•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/18a82d8398787ae952f027ee6a551daeadc2d21b
https://hg.mozilla.org/releases/mozilla-aurora/rev/a24cd849856c50cbaa01cf7fa514a9d328f222b5
https://hg.mozilla.org/releases/mozilla-aurora/rev/8be24a8a9cf659ec04111b64950b26836488adda
https://hg.mozilla.org/releases/mozilla-aurora/rev/3c3cb171a6ccc6dbda66bf513b0d33e131d6cfea
Assignee | ||
Comment 33•9 years ago
|
||
Backed out, of course, I accidentally qref'd into the try push and not the patch itself.
https://hg.mozilla.org/releases/mozilla-aurora/rev/9465609ef7a612b20d8d557deee38feaff32409b
https://hg.mozilla.org/releases/mozilla-aurora/rev/76a54862123c8d95d4245d777a779304f674df7b
https://hg.mozilla.org/releases/mozilla-aurora/rev/dce0cba4377facfc42f4bbdacfd5ca803803e047
https://hg.mozilla.org/releases/mozilla-aurora/rev/104bef115f31497dd411582328eef6ed375d0c72
New run to be safe: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a5f1e95ea5
Comment 34•9 years ago
|
||
Looks like this was relanded by David.
https://hg.mozilla.org/releases/mozilla-aurora/rev/b577f69f24286bbadb9be2ba1345b2ff18f58338
https://hg.mozilla.org/releases/mozilla-aurora/rev/6a570bddbec9d6f941cca89c242c938508dfb0da
https://hg.mozilla.org/releases/mozilla-aurora/rev/d40958a0fe8511f927253b6d8b26766aa9870a86
https://hg.mozilla.org/releases/mozilla-aurora/rev/922334c70324e4295dcfcfafe6da4ae2e1154834
You need to log in
before you can comment on or make changes to this bug.
Description
•