Closed Bug 1256517 Opened 8 years ago Closed 8 years ago

EXCEPTION_BREAKPOINT crashes in mozilla::layers::BasicCompositor::DrawQuad related to device resets

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla48

Tracking Flags:

Tracking

Status

firefox47

---

fixed

firefox48

---

fixed

People

(Reporter: dbaron, Assigned: dvander)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(4 files, 2 obsolete files)

part 1, acknowledge compositor resets 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 10.36 KB, patch		Details \| Diff \| Splinter Review
part 1, acknowledge compositor resets 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 11.18 KB, patch	mattwoodrow : review+ ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
part 3, block stale texture creation 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 10.35 KB, patch	mattwoodrow : review+ nical : review-	Details \| Diff \| Splinter Review
part 2, refactor CompositableOperation 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 22.69 KB, patch	mattwoodrow : review+ ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
part 4, block stale compositable updates 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 18.04 KB, patch	mattwoodrow : review+ nical : review+ ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
part 3, block stale texture creation 8 years ago David Anderson [:dvander] - inactive, e-mail if emergency 6.40 KB, patch	nical : review+ ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is 
report bp-51594f25-9661-4b27-ae7f-c41412160314.
=============================================================

Related to bug 1255711, but at lower frequency:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-01-01&reason=%3DEXCEPTION_BREAKPOINT&signature=mozilla%3A%3Alayers%3A%3ABasicCompositor%3A%3ADrawQuad&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations
is a topcrash that's been happening in nightly at decent frequency since about March 2, although there were a few crashes before then (although not much before).

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Updated

•

8 years ago

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 1

•

8 years ago

It looks like we've reset the compositor after a device reset, but we're trying to composite an old layer tree. This should be safe in theory, but there are prerelease asserts guarding that it doesn't happen, so I guess we should block compositing until the root of the tree changes or something.

Assignee: nobody → dvander

Status: NEW → ASSIGNED

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 2

•

8 years ago

I can reproduce this (and other asserts) by flipping gfx.testing.device-fail and then gfx.testing.device-reset.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 3

•

8 years ago

Attached patch part 1, acknowledge compositor resets (obsolete) — Details — Splinter Review

This patch tracks which LayerTransactionParents have fully acknowledged compositor resets. Since I think it's possible to send a reset in between the layer ID being allocated and the PLayerTransactionParent getting allocated[?], this state is also tracked in LayerTreeState.

The next few patches will make use of this to deny texture updates.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Updated

•

8 years ago

Attachment #8730606 - Flags: review?(matt.woodrow)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 4

•

8 years ago

Attached patch part 1, acknowledge compositor resets — Details — Splinter Review

Same as previous patch with bug fix.

Attachment #8730606 - Attachment is obsolete: true

Attachment #8730606 - Flags: review?(matt.woodrow)

Attachment #8730892 - Flags: review?(matt.woodrow)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 5

•

8 years ago

Attached patch part 3, block stale texture creation (obsolete) — Details — Splinter Review

This patch should guarantee that we don't create new textures with stale compositor data. Unfortunately we can't just return null in AllocPTextureParent - it's an async message, so child and parent will have mismatched state. This will mess up deserialization later and cause the parent process to kill the child. (In fact, we should *never* return nullptr from this function, so we may want to go back and fix other cases later).

Instead, we now return a TextureParent that has no compositable manager and no TextureHost, and ensure in-process SurfaceDescriptor references are released early.

This breaks callers of TextureHost::AsTextureHost, which the next patch will address.

Attachment #8730895 - Flags: review?(matt.woodrow)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 6

•

8 years ago

Attached patch part 2, refactor CompositableOperation — Details — Splinter Review

This patch moves the always-duplicated "compositable" field out of compositable operations by wrapping sub-ops into a "CompositableOperationDetail" struct.

I'm not sure whether this was worth it, but it does simplify part 4 a bit.

Attachment #8731024 - Flags: review?(matt.woodrow)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 7

•

8 years ago

Attached patch part 4, block stale compositable updates — Details — Splinter Review

This patch ensures that we don't perform any operations on stale compositables, and do not attempt to attach any stale compositables to the layer tree. Together with part 3, and the fact that we already block compositing stale textures in ContainerLayerComposite, should close remaining holes where we could DrawQuad() something old.

With this I no longer hit any prerelease asserts when I switch a window from D3D11 to software.

Unfortunately video does stop working. I don't think media code handles device resets even under single-process mode, so that would be expected. Most of the asserts in comment #0's crash reports are from ImageHosts, where (1) ImageContainers are probably re-used across layer tree invalidations and (2) updates to async ImageHosts are much more frequent than normal layer tree updates, either of which would explain why asserts there were much more prevalent.

Anyway, I'll address that in a separate bug.

Attachment #8731027 - Flags: review?(matt.woodrow)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 8

•

8 years ago

Also, expanding on comment #7 - it is still possible to call PAllocTextureParent with the wrong layers backend. ImageBridge doesn't know about device resets, so it may try to allocate the wrong texture kind after we've acknowledged the compositor update. That's okay even though I'd like ot fix it later - at least now we won't be able to attach the texture anywhere.

Matt Woodrow (:mattwoodrow)

Comment 9

•

8 years ago

Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Review of attachment 8730892 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.h
@@ +241,5 @@
>    virtual bool RecvFlushRendering() override;
>    virtual bool RecvForcePresent() override;
>  
> +  virtual bool RecvAcknowledgeCompositorUpdate(const uint64_t& aLayersId) override {
> +    // This message is only sent cross-process.

Might as well assert here then.

::: gfx/layers/ipc/PCompositor.ipdl
@@ +98,5 @@
>     */
>    async RemotePluginsReady();
>  
> +  // Confirmation that the child has invalidated all its layers, and will not
> +  // request layers against and old compositor.

-d

Attachment #8730892 - Flags: review?(matt.woodrow) → review+

Matt Woodrow (:mattwoodrow)

Updated

•

8 years ago

Attachment #8731024 - Flags: review?(matt.woodrow) → review+

Matt Woodrow (:mattwoodrow)

Comment 10

•

8 years ago

Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation

Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but nical should be aware of this.

Attachment #8730895 - Flags: review?(nical.bugzilla)

Attachment #8730895 - Flags: review?(matt.woodrow)

Attachment #8730895 - Flags: review+

Matt Woodrow (:mattwoodrow)

Comment 11

•

8 years ago

Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/Compositor.h
@@ +518,5 @@
>  
> +  // A stale Compositor has no CompositorParent; it will not process frames
> +  // and should not be used.
> +  void SetStale();
> +  bool IsStale() const;

I would be ok with a slightly stronger name than 'Stale' too.

::: gfx/layers/ipc/CompositableTransactionParent.cpp
@@ +74,5 @@
>  CompositableParentManager::ReceiveCompositableUpdate(const CompositableOperation& aEdit,
>                                                       EditReplyVector& replyv)
>  {
> +  // Ignore all operations on compositables created on stale compositors. We
> +  // return false because the child is unable to handle errors.

we return true. Note that we've already notified the child to invalidate all layers and start again, so we don't need it to handle this error either.

Attachment #8731027 - Flags: review?(nical.bugzilla)

Attachment #8731027 - Flags: review?(matt.woodrow)

Attachment #8731027 - Flags: review+

Nicolas Silva [:nical]

Comment 12

•

8 years ago

Comment on attachment 8730895 [details] [diff] [review]
part 3, block stale texture creation

Review of attachment 8730895 [details] [diff] [review]:
-----------------------------------------------------------------

R- because of the potential double-free and leaks, see comment below. I am not firmly opposed to this approach, but I would prefer that we create a TextureHost which is in a state where it will always return false in Lock, BindTextureSource, etc. The first reason is that we already do this when we receive a shmem and fail to map it on the parent process, so the code that uses TextureHost is already resistant in this kind of situation. The other reason is that depending on how the texture is setup, deallocation can happen in either the content or compositor process. In any case there's a handshake sequence that ensures one side never deallocates shared data while the other side might be using it and this patch introduces the possibility for this kind of race to happen. I think it would be easier to create the TextureHost around the shared data with an extra state that forces it to be invalid and let it do the deallocation logic, than plumb some deallocation logic for this special case ourselves. unless you add logic for all cases you will have double-deletes or leaks (depending on which side was initially responsible for using the data).

::: gfx/layers/composite/TextureHost.cpp
@@ +128,5 @@
> +      DIBTextureHost::CallRelease(aDesc.get_SurfaceDescriptorDIB());
> +      break;
> +    }
> +#endif
> +    case SurfaceDescriptor::TSurfaceDescriptorBuffer: {

BufferTextureHost can work with all compositor backends. We don't need to create an empty actor in this case I think.

@@ +134,5 @@
> +      const MemoryOrShmem& data = bufferDesc.data();
> +      if (data.type() == MemoryOrShmem::Tuintptr_t) {
> +        if (uint8_t* buffer = reinterpret_cast<uint8_t*>(data.get_uintptr_t())) {
> +          GfxMemoryImageReporter::WillFree(buffer);
> +          delete[] buffer;

You can only delete the buffer if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT. Other wise the content process might try to deallocate the buffer too.

In the case of a ShmemTextureHost, we are leaking the shmem if the texture flags do not have TextureFlags::DEALLOCATE_CLIENT.

Attachment #8730895 - Flags: review?(nical.bugzilla) → review-

Nicolas Silva [:nical]

Comment 13

•

8 years ago

Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Review of attachment 8731027 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. You could generalize this to IsValid or something of the like, that would encompass any kind of reason the compositor gets into a state where it should not be used.

For the other patch, you could make sure all TextureHost implementations properly check that they have a valid compositor with the expected backend type and have them bail out early in the important methods if it's not the case.

Attachment #8731027 - Flags: review?(nical.bugzilla) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 14

•

8 years ago

Attached patch part 3, block stale texture creation — Details — Splinter Review

This adds a new TextureFlag that instructs TextureHosts not to access their underlying device. This shouldn't be strictly needed, but it does make things slightly safer. Part 4 prevents us from attaching these textures so they should never get used in the first place.

Unfortunately while writing this patch, I realized that GetD3D11Device is not thread-safe. I'll file a follow-up bug for that soon.

Attachment #8730895 - Attachment is obsolete: true

Attachment #8732570 - Flags: review?(nical.bugzilla)

Nicolas Silva [:nical]

Updated

•

8 years ago

Attachment #8732570 - Flags: review?(nical.bugzilla) → review+

Pulsebot

Comment 15

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/46866237f27b
https://hg.mozilla.org/integration/mozilla-inbound/rev/8e5cb6d1d1a0
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d33171eb4e9
https://hg.mozilla.org/integration/mozilla-inbound/rev/27a8a01abf66

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 16

•

8 years ago

Backed out for bustage.

Backouts:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d86e841099d
https://hg.mozilla.org/integration/mozilla-inbound/rev/db3309c8fe36
https://hg.mozilla.org/integration/mozilla-inbound/rev/712d5ccb76cd
https://hg.mozilla.org/integration/mozilla-inbound/rev/8127138e3146

Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=27a8a01abf66
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=24446301&repo=mozilla-inbound

10:50:57     INFO -  In file included from /builds/slave/m-in-lx-0000000000000000000000/build/src/obj-firefox/gfx/layers/Unified_cpp_gfx_layers6.cpp:20:0:
10:50:57     INFO -  /builds/slave/m-in-lx-0000000000000000000000/build/src/gfx/layers/ipc/LayerTransactionParent.cpp:992:84: error: macro "gfxDevCrash" passed 2 arguments, but takes just 1
10:50:57     INFO -       gfxDevCrash(LogReason::PAllocTextureBackendMismatch, "Texture backend is wrong");
10:50:57     INFO -                                                                                      ^
10:50:57     INFO -  gmake[5]: *** [Unified_cpp_gfx_layers6.o] Error 1
10:50:57     INFO -  gmake[5]: *** Waiting for unfinished jobs....

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 17

•

8 years ago

For some reason my local build didn't see this typo as an error. Re-landing with that line fixed.

Flags: needinfo?(dvander)

Pulsebot

Comment 18

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/3b995a72599c
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb55f7b93b3d
https://hg.mozilla.org/integration/mozilla-inbound/rev/524be4de6a5a
https://hg.mozilla.org/integration/mozilla-inbound/rev/e63011d83825

Pulsebot

Comment 19

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/13d8c68828ca

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 20

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3b995a72599c
https://hg.mozilla.org/mozilla-central/rev/eb55f7b93b3d
https://hg.mozilla.org/mozilla-central/rev/524be4de6a5a
https://hg.mozilla.org/mozilla-central/rev/e63011d83825
https://hg.mozilla.org/mozilla-central/rev/13d8c68828ca

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox48: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla48

Gregor Wagner [:gwagner]

Updated

•

8 years ago

Depends on: 1260391

Thomas Zimmermann [:tzimmermann] [:tdz]

Updated

•

8 years ago

Updated

•

8 years ago

Depends on: 1260441

Sotaro Ikeda [:sotaro]

Updated

•

8 years ago

Updated

•

8 years ago

Depends on: 1264341

Jared Wein [:jaws] (please needinfo? me)

Updated

•

8 years ago

No longer depends on: 1264341

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 21

•

8 years ago

Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Approval Request Comment
[Feature/regressing bug #]: bug 1245765
[User impact if declined]: Crashes/graphics problems on device resets
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: bug 1245765 changed how we handled graphics resets to make them more reliable. However, that exposed further problems in our reset handling, and this bug addresses a few of those problems. These patches are low risk. They have been on Nightly for a while without any obvious crashes or new rendering problems as a result. We suspect these patches might help bug 1187466 as well.
[String/UUID change made/needed]:

Attachment #8730892 - Flags: approval-mozilla-aurora?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 22

•

8 years ago

Comment on attachment 8731024 [details] [diff] [review]
part 2, refactor CompositableOperation

Approval Request Comment

See comment #21

Attachment #8731024 - Flags: approval-mozilla-aurora?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 23

•

8 years ago

Comment on attachment 8731027 [details] [diff] [review]
part 4, block stale compositable updates

Approval Request Comment

See comment #21

Attachment #8731027 - Flags: approval-mozilla-aurora?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 24

•

8 years ago

Comment on attachment 8732570 [details] [diff] [review]
part 3, block stale texture creation

Approval Request Comment

See comment #21

Attachment #8732570 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 25

•

8 years ago

I looked at the latest Nightly data for this crash signature, I can still see crashes from 04-14 build even though this landed in Nightly on 03-28, here is an example: https://crash-stats.mozilla.com/report/index/8162c15e-1c9c-48ab-aba6-344a72160415. Though the frequency of these crashes are extremely low on 48.0 I think it's still worthwhile uplifting to Aurora47.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox47: --- → affected

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 26

•

8 years ago

Comment on attachment 8730892 [details] [diff] [review]
part 1, acknowledge compositor resets

Top crash fix that has baked on Nightly for a month, Aurora47+

Attachment #8730892 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Attachment #8731024 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Attachment #8731027 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Attachment #8732570 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 27

•

8 years ago

Given the data in crash-stats (see link in comment 0), it's not clear to me that the fix improved the problem for which this bug was filed.

Flags: needinfo?(dvander)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 28

•

8 years ago

This doesn't appear to apply cleanly to aurora.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 29

•

8 years ago

(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #27)
> Given the data in crash-stats (see link in comment 0), it's not clear to me
> that the fix improved the problem for which this bug was filed.

Maybe not, it'll be easier to tell on Aurora. It does fix bugs I was able to reproduce locally that had the same stack.

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 30

•

8 years ago

(In reply to Wes Kocher (:KWierso) from comment #28)
> This doesn't appear to apply cleanly to aurora.

Two of the files that it touched got renamed on 48 :( It applied okay by hand, though. Try run to just make sure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b3221061e15

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 31

•

8 years ago

 https://treeherder.mozilla.org/#/jobs?repo=try&revision=826d7a335e09

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 32

•

8 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/18a82d8398787ae952f027ee6a551daeadc2d21b
https://hg.mozilla.org/releases/mozilla-aurora/rev/a24cd849856c50cbaa01cf7fa514a9d328f222b5
https://hg.mozilla.org/releases/mozilla-aurora/rev/8be24a8a9cf659ec04111b64950b26836488adda
https://hg.mozilla.org/releases/mozilla-aurora/rev/3c3cb171a6ccc6dbda66bf513b0d33e131d6cfea

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 33

•

8 years ago

Backed out, of course, I accidentally qref'd into the try push and not the patch itself.

https://hg.mozilla.org/releases/mozilla-aurora/rev/9465609ef7a612b20d8d557deee38feaff32409b
https://hg.mozilla.org/releases/mozilla-aurora/rev/76a54862123c8d95d4245d777a779304f674df7b
https://hg.mozilla.org/releases/mozilla-aurora/rev/dce0cba4377facfc42f4bbdacfd5ca803803e047
https://hg.mozilla.org/releases/mozilla-aurora/rev/104bef115f31497dd411582328eef6ed375d0c72

New run to be safe: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a5f1e95ea5

Ryan VanderMeulen [:RyanVM]

Comment 34

•

8 years ago

Looks like this was relanded by David.
https://hg.mozilla.org/releases/mozilla-aurora/rev/b577f69f24286bbadb9be2ba1345b2ff18f58338
https://hg.mozilla.org/releases/mozilla-aurora/rev/6a570bddbec9d6f941cca89c242c938508dfb0da
https://hg.mozilla.org/releases/mozilla-aurora/rev/d40958a0fe8511f927253b6d8b26766aa9870a86
https://hg.mozilla.org/releases/mozilla-aurora/rev/922334c70324e4295dcfcfafe6da4ae2e1154834

status-firefox47: affected → fixed

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago