Closed Bug 1207665 Opened 5 years ago Closed 5 years ago

startup crash in mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier()

Categories

(Core :: Graphics, defect)

41 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox41 + wontfix
firefox42 + fixed
firefox43 + fixed
firefox44 + fixed
relnote-firefox --- 41+

People

(Reporter: philipp, Assigned: dvander)

References

Details

(Keywords: crash)

Crash Data

Attachments

(5 files, 1 obsolete file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-10f6c07f-7aa4-4983-953c-4176c2150922.
=============================================================
0 	xul.dll 	mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier() 	gfx/layers/d3d11/CompositorD3D11.cpp
1 	xul.dll 	mozilla::layers::CompositorParent::AllocPLayerTransactionParent(nsTArray<mozilla::layers::LayersBackend> const&, unsigned __int64 const&, mozilla::layers::TextureFactoryIdentifier*, bool*) 	gfx/layers/ipc/CompositorParent.cpp
2 	xul.dll 	mozilla::layers::PCompositorParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) 	obj-firefox/ipc/ipdl/PCompositorParent.cpp
3 	xul.dll 	mozilla::ipc::MessageChannel::DispatchSyncMessage(IPC::Message const&, IPC::Message*&) 	ipc/glue/MessageChannel.cpp
4 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
5 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp

this crash seems to spike after the 41 release, at the moment it is the #4 crasher in 41.0 and on top of the crash score board due to crashing on startup - many user comments also reflect that this happened after a firefox update.

the gpu adapter mainly affected in early data seems to be  the "Intel Media Accelerator 3150":
1 	0xa001 	164 	98.80 %
2 	0x0106 	1 	0.60 %
3 	0x0116 	1 	0.60 %

Adapter driver version facet
1 	8.14.10.2230 	104 	62.65 %
2 	8.14.10.2117 	59 	35.54 %
3 	8.15.10.2253 	1 	0.60 %
4 	8.15.10.2567 	1 	0.60 %
5 	9.17.10.2867 	1 	0.60 %
> 1 	0xa001 	164 	98.80 %

I bet this fell through the cracks of bug 1137716?
Flags: needinfo?(jmuizelaar)
(Just a note in case this comes up in Fx41 post-mortem discussions)

41 Nightly had 14 crashes
41 Aurora had 36 crashes
41 Beta had 231 crashes
41 Release has 195 crashes

At these volumes we would not have seen it in pre-release topcrash reports. This is likely an instance of where our pre-release audience is not indicative of our release audience.
That 3150 seems to be a nettop Atom device.  I can see something like that not showing up on nightly.
For some reason this adapter did not crash with this signature in versions before 41. Are we doing something differently now that would send it down this codepath?
David (Anderson), would you expect to have this caught in the startup testing?  Can we check telemetry if these devices showed up before, or just started showing up now?

David (Major) - let me dig through some late fixed bugs, see if something we recently fixed could have introduced this.
Flags: needinfo?(dvander)
(In reply to Milan Sreckovic [:milan] from comment #5)
> David (Anderson), would you expect to have this caught in the startup
> testing?  Can we check telemetry if these devices showed up before, or just
> started showing up now?

No, there is no startup crash guard here. In fact it's a self-induced crash[1]. Instead of crashing we should probably just make the compositor fail, if possible, and fallback to software.

What do you mean by "if these devices showed up before"? Notably, using the hardware search on the dashboard, device 0x0106 is 2.53% of the sample population and device 0x0116 is 4.51%. Device 0xa001 with 98% of the crashes is 0.21% of the population. So that seems like a good candidate for blocking.

[1] http://hg.mozilla.org/releases/mozilla-release/annotate/78c82e5cd777/gfx/layers/d3d11/CompositorD3D11.cpp#l422
Flags: needinfo?(dvander)
The metadata for the report says: "Failed to get SharedHandle for sync texture. Result: 0x0" and has a bunch of other random d3d11 error messages.
David, since you already got emotionally involved, how about a patch? :)
Assignee: nobody → dvander
This should be backported as far as reasonable. But if someone knows how to update the downloadable blocklist, that would work for Firefox 41 and older.

A downloadable blocklist update will not work for Firefox 42+ because of a bug that I'll post a patch for in a moment.
Attachment #8665630 - Flags: review?(jmuizelaar)
Whoops. This one is my fault due to the gfxWindowsPlatform refactoring, and I'm glad we caught it before Firefox 42 went to release. D2D 1.1 is disabled for WARP, but D2D 1.0 was not. This patch just moves the check further up.
Attachment #8665634 - Flags: review?(jmuizelaar)
Attachment #8665634 - Flags: review?(jmuizelaar) → review+
Comment on attachment 8665630 [details] [diff] [review]
part 1, block device for d3d/d2d

Review of attachment 8665630 [details] [diff] [review]:
-----------------------------------------------------------------

Any reason to not block the entire family?
                "a001",
                "a002",
                "a011",
                "a012"
(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
> Comment on attachment 8665630 [details] [diff] [review]
> part 1, block device for d3d/d2d
> 
> Review of attachment 8665630 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Any reason to not block the entire family?
>                 "a001",
>                 "a002",
>                 "a011",
>                 "a012"

We aren't seeing any crashes on the other devices. 0xa011 appears ~0.85% of the time in Telemetry which is kinda high for one device. Should I stick the driver versions in there as well?
Flags: needinfo?(jmuizelaar)
This moves the GetSharedHandle call to CompositorD3D11::Initialize, instead of GetTextureFactoryIdentifier. It does not remove the MOZ_CRASH though.
Attachment #8665769 - Flags: review?(bas)
I'd like to turn the MOZ_CRASH here into a return false (with a telemetry ping), but it looks like Compositor::Initialize returning false has some weird side effects. Lots of gfx error log spam and it looks like we continue to use D2D1. Investigating how safe this is first.
Attached patch part 4, fail gracefully (obsolete) — Splinter Review
Filed bug 1208638 to make failing gracefully more reliable.
Attachment #8666203 - Flags: review?(bas)
Comment on attachment 8666203 [details] [diff] [review]
part 4, fail gracefully

Vladan, this histogram is really a proxy for having dropped the MOZ_CRASH. We think this value being true is severe enough that we want to monitor it on all channels. Is a boolean histogram the right thing to do?
Attachment #8666203 - Flags: review?(vladan.bugzilla)
Comment on attachment 8665630 [details] [diff] [review]
part 1, block device for d3d/d2d

Review of attachment 8665630 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, please block the entire family.
Attachment #8665630 - Flags: review?(jmuizelaar) → review+
David, is part 1 the patch that will fix this startup crash for 41? As you may know, updates are disabled for FF41 due to this bug. I am hoping to build a 41 dot release (like 41.0.1) in the next 12 hrs or so and would appreciate it if you nominated the right patch for uplift to mozilla-release branch.
Flags: needinfo?(dvander)
(In reply to Ritu Kothari (:ritu) from comment #20)
> David, is part 1 the patch that will fix this startup crash for 41? As you
> may know, updates are disabled for FF41 due to this bug. I am hoping to
> build a 41 dot release (like 41.0.1) in the next 12 hrs or so and would
> appreciate it if you nominated the right patch for uplift to mozilla-release
> branch.

No, I didn't know that.
Flags: needinfo?(dvander)
w/ Jeff's comment, carrying r+ forward.

Approval Request Comment
[Feature/regressing bug #]: None
[User impact if declined]: Startup crashes on some low-end Intel video cards
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: This is just a driver blocklist entry, so no risk. The video cards in this blocklist are about 1% of the population.
[String/UUID change made/needed]:
Attachment #8666610 - Flags: review+
Attachment #8666610 - Flags: approval-mozilla-release?
Attachment #8666610 - Flags: approval-mozilla-beta?
Attachment #8666610 - Flags: approval-mozilla-aurora?
Comment on attachment 8666610 [details] [diff] [review]
minimal fix for aurora/beta/release

Taking it for all branches.
Ritu, I am approving it for you because I know you want it for 41.0.1. Having it sooner will help with treeherder and potential rebasing issues.
Attachment #8666610 - Flags: approval-mozilla-release?
Attachment #8666610 - Flags: approval-mozilla-release+
Attachment #8666610 - Flags: approval-mozilla-beta?
Attachment #8666610 - Flags: approval-mozilla-beta+
Attachment #8666610 - Flags: approval-mozilla-aurora?
Attachment #8666610 - Flags: approval-mozilla-aurora+
Comment on attachment 8666610 [details] [diff] [review]
minimal fix for aurora/beta/release

This should land on nightly too but inbound is closed right now.
Attachment #8666610 - Flags: checkin?
Attachment #8666610 - Flags: checkin? → checkin+
Comment on attachment 8666203 [details] [diff] [review]
part 4, fail gracefully

Review of attachment 8666203 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1366,5 @@
>    if (FAILED(hr) || !mSyncHandle) {
>      gfxCriticalError() << "Failed to get SharedHandle for sync texture. Result: "
>                         << hexa(hr);
> +    NS_DispatchToMainThread(NS_NewRunnableFunction([] () -> void {
> +      Accumulate(Telemetry::D3D11_SYNC_HANDLE_FAILURE, true);

you never accumulate a false value?

::: toolkit/components/telemetry/Histograms.json
@@ +9658,5 @@
>      "description": "Attempt to notify ServiceWorker of push notification resubscription."
> +  },
> +  "D3D11_SYNC_HANDLE_FAILURE": {
> +    "alert_emails": ["bschouten@mozilla.com"],
> +    "expires_in_version": "never",

I don't like opt-out histograms that never expire, can you set it say for version 50 and then we can revisit its expiration then?
Attachment #8666203 - Flags: review?(vladan.bugzilla) → review-
Clarification: if you don't need to accumulate false values, then you should use a count histogram
Attachment #8666203 - Attachment is obsolete: true
Attachment #8666203 - Flags: review?(bas)
Attachment #8666960 - Flags: review?(vladan.bugzilla)
Comment on attachment 8666960 [details] [diff] [review]
part 4, fail gracefully

Review of attachment 8666960 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +9763,5 @@
> +    "alert_emails": ["bschouten@mozilla.com","danderson@mozilla.com","msreckovic@mozilla.com","ashughes@mozilla.com"],
> +    "expires_in_version": "50",
> +    "releaseChannelCollection": "opt-out",
> +    "kind": "count",
> +    "description": "True if the D3D11 compositor failed to get a texture sync handle."

Update the description (it's not a boolean histogram anymore)
Attachment #8666960 - Flags: review?(vladan.bugzilla) → review+
Attachment #8665769 - Flags: review?(bas) → review+
Attachment #8666960 - Flags: review?(bas) → review+
Comment on attachment 8665630 [details] [diff] [review]
part 1, block device for d3d/d2d

Review of attachment 8665630 [details] [diff] [review]:
-----------------------------------------------------------------

Fwiw, theoretically there's no reason to block D3D11 here as far as I can see. Shared textures are only required for D2D and WebGL, so we should try unblocking D3D11 layers on inbound.
Comment on attachment 8665634 [details] [diff] [review]
part 2, don't use d2d 1.0 on WARP

Approval Request Comment
[Feature/regressing bug #]: None
[User impact if declined]: Graphics corruption/instability due to using an unsupported combination of acceleration features: D3D10 content with software compositing.
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: None, we're just aborting acceleration earlier.
[String/UUID change made/needed]:

This patch should be taken for aurora/beta as well to eliminate another source of problems there.

(The rest of this bug is more speculative though so the other patches will be nightly only.)
Attachment #8665634 - Flags: approval-mozilla-beta?
Attachment #8665634 - Flags: approval-mozilla-aurora?
Comment on attachment 8665634 [details] [diff] [review]
part 2, don't use d2d 1.0 on WARP

OK, let's take in 42. Should be in beta 3.
Attachment #8665634 - Flags: approval-mozilla-beta?
Attachment #8665634 - Flags: approval-mozilla-beta+
Attachment #8665634 - Flags: approval-mozilla-aurora?
Attachment #8665634 - Flags: approval-mozilla-aurora+
Added to the release notes:
"Fix a startup crash with some Intel Media Accelerator 3150 graphic cards (1207665)"
This patch failed to apply on aurora. Unsure what that means.
Flags: needinfo?(dvander)
Comment on attachment 8665634 [details] [diff] [review]
part 2, don't use d2d 1.0 on WARP

Whoops, it's already on Aurora, sorry about that.
Flags: needinfo?(dvander)
Attachment #8665634 - Flags: approval-mozilla-beta+ → checkin+
Comment on attachment 8665634 [details] [diff] [review]
part 2, don't use d2d 1.0 on WARP

(Err.. removed the wrong flag.)
Attachment #8665634 - Flags: approval-mozilla-aurora+
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Wes, should this be marked fixed for all the branches where it landed?
Flags: needinfo?(wkocher)
I think the "leave-open" made our tools skip over marking the flags. Sorry about that.
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.