Closed Bug 1137716 Opened 7 years ago Closed 7 years ago

Startup crash on Optimus w/ Intel Ironlake Graphics mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier()

Categories

(Core :: Graphics, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox36 --- unaffected
firefox37 + fixed
firefox38 + fixed
firefox39 + fixed
firefox40 --- fixed

People

(Reporter: kairo, Assigned: jrmuizel)

References

(Depends on 1 open bug)

Details

(Keywords: crash, topcrash, Whiteboard: gfx-noted)

Crash Data

Attachments

(3 files, 1 obsolete file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-1d7d48e2-0107-466e-8e71-38a252150225.
=============================================================

Stack:
0 	xul.dll 	mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier() 	gfx/layers/d3d11/CompositorD3D11.cpp
1 	xul.dll 	mozilla::layers::CompositorParent::AllocPLayerTransactionParent(nsTArray<mozilla::layers::LayersBackend> const&, unsigned __int64 const&, mozilla::layers::TextureFactoryIdentifier*, bool*) 	gfx/layers/ipc/CompositorParent.cpp
2 	xul.dll 	mozilla::layers::PCompositorParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) 	obj-firefox/ipc/ipdl/PCompositorParent.cpp
3 	xul.dll 	mozilla::ipc::MessageChannel::DispatchSyncMessage(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
4 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp
5 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
6 	xul.dll 	`anonymous namespace'::ThreadFunc(void*) 	ipc/chromium/src/base/platform_thread_win.cc
7 	kernel32.dll 	BaseThreadInitThunk 	


This crash signature is #6 with 1.2% of all crashes in early 37.0b1 data. This is Win7-only and all crash addresses end in "caa1".
Not that in all crash reports I looked into, I found detoured.dll in the modules list, which according to https://coderrr.wordpress.com/2008/08/27/how-to-get-rid-of-microsoft-detours-detoureddll/ belongs to Microsoft Detours, http://research.microsoft.com/en-us/projects/detours/ seems to be the product page for that tool. I wonder why this would be used by any larger amount of people, though.

Firefox 36 is not affected by this, but we have crashes from 37 Beta, 37 Dev Edition, and 38 Nightly.
Tracking topcrash for 37+. (Going to assume 39 is affected.)

Milan - Do you have anyone available to investigate?
Flags: needinfo?(milan)
Keywords: topcrash
GetSharedHandle() seems to return S_OK, but the handle is null, and we just call MOZ_CRASH.  Seemingly only Windows 7, and not there on 36 may suggest some connection with D2D1.1 (trying to, even if we fail?)
NI :nical because of the caller stack.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
Flags: needinfo?(bas)
Regarding detours.dll, it would not surprise me at all if the graphics driver were hooking some APIs so that it could play dual-GPU tricks. I've seen such things before.

Rank 	App notes 	Count 	%
1 	has 	2768 	100.00 %
2 	gpus 	2768 	100.00 %
3 	gpu 	2768 	100.00 %
4 	dual 	2768 	100.00 %

 mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier()|EXCEPTION_BREAKPOINT (642 crashes)
    100% (641/642) vs.   2% (1353/60686) nvd3d9wrap.dll
    100% (641/642) vs.   2% (1354/60686) nvdxgiwrap.dll
    100% (641/642) vs.   2% (1445/60686) nvapi.dll
    100% (641/642) vs.   2% (1449/60686) nvumdshim.dll
    100% (641/642) vs.   3% (1962/60686) nvinit.dll
     97% (624/642) vs.   1% (627/60686) d3d8.dll
    100% (641/642) vs.   5% (2997/60686) nvwgf2um.dll
     97% (624/642) vs.   7% (4412/60686) d3d10.dll
     97% (624/642) vs.   7% (4412/60686) d3d10core.dll
    100% (641/642) vs.  11% (6405/60686) igd10umd32.dll
     97% (624/642) vs.  11% (6446/60686) d3d8thk.dll
     97% (624/642) vs.  14% (8258/60686) d3d9.dll
     64% (408/642) vs.   2% (934/60686) detoured.dll
     37% (239/642) vs.   1% (714/60686) _etoured.dll
(In reply to Milan Sreckovic [:milan] from comment #3)
> GetSharedHandle() seems to return S_OK, but the handle is null, and we just
> call MOZ_CRASH.  Seemingly only Windows 7, and not there on 36 may suggest
> some connection with D2D1.1 (trying to, even if we fail?)
> NI :nical because of the caller stack.

No, I strongly suspect not. At least some of these have D2D 1.1 running. I suspect this is related to the dual GPUs. This is all optimus GPUs and it seems to be a fairly narrow range of models. We may have to do something along the lines of blacklisting somehow.
Flags: needinfo?(bas)
(In reply to Milan Sreckovic [:milan] from comment #3)
> NI :nical because of the caller stack.

Nothing comes to mind as far as the stack is concerned, S_OK with a null handles looks like a driver not doing what it should.
Flags: needinfo?(nical.bugzilla)
OK, if we're going to blacklist, Bas, can you figure out what should be blacklisted?
Assignee: nobody → bas
Whiteboard: gfx-noted
Are there any driver version correlations?
Flags: needinfo?(dmajor)
The Intel adapter is always device 0x0046 and the Intel driver is versions 8.15.10.2008 to 8.15.10.2622 inclusive.

The nVidia adapter varies, mostly 0x0a70 0x0df4 0x0df1 0x0df0. The nVidia DLLs have versions 8.17.12.5730 to 8.17.12.6901 inclusive.

The crashes are only on Win7 and Win7SP1.
Flags: needinfo?(dmajor)
Bas - Can you blacklist based on the information in comment 9? If so, can you have a patch ready for Beta 7 gtb on Thu?
Flags: needinfo?(bas)
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #10)
> Bas - Can you blacklist based on the information in comment 9? If so, can
> you have a patch ready for Beta 7 gtb on Thu?

I don't know how blacklisting with Dual GPUs works.. I'm not sure if anyone does.. :( Jeff.. do you have any idea who we might ask?
Flags: needinfo?(bas) → needinfo?(jmuizelaar)
Flags: needinfo?(jmuizelaar)
Summary: crash in mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier() → Startup crash in mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier()
I'll try to get a patch together.
(In reply to David Major [:dmajor] (UTC+13) from comment #9)
> The Intel adapter is always device 0x0046 and the Intel driver is versions
> 8.15.10.2008 to 8.15.10.2622 inclusive.
> 
> The nVidia adapter varies, mostly 0x0a70 0x0df4 0x0df1 0x0df0. The nVidia
> DLLs have versions 8.17.12.5730 to 8.17.12.6901 inclusive.
> 
> The crashes are only on Win7 and Win7SP1.

David, can you get an exhaustive list of adapter id's?
Flags: needinfo?(dmajor)
So we don't really have infrastructure to handle dual gpu blacklisting...
Attachment #8580166 - Flags: review?(bas)
Attachment #8580166 - Attachment is obsolete: true
Attachment #8580166 - Flags: review?(bas)
Attachment #8580189 - Flags: review?(bas)
> David, can you get an exhaustive list of adapter id's?

0x0a70 0x0df1 0x0df4 0x0df0 0x0a7a 0x0a35 0x0dee 0x0a6c 0x0dd3 0x0a2d 0x0caf 0x0df2 0x0a2b 0x0a72 0x0a29 0x0df3
Flags: needinfo?(dmajor)
Comment on attachment 8580189 [details] [diff] [review]
A version that builds

Review of attachment 8580189 [details] [diff] [review]:
-----------------------------------------------------------------

It's a shame we will also blacklist these NVidia devices as secondary GPUs now when the intel device is not 0x0046.. But for 37 let's do this, can you put a comment in to look at this?

::: widget/GfxInfoBase.cpp
@@ +630,3 @@
>  
>  #if defined(XP_WIN) || defined(ANDROID)
> +    uint64_t driverVersion;

What changed here?
Attachment #8580189 - Flags: review?(bas) → review+
Summary: Startup crash in mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier() → Startup crash on Optimus w/ Intel Ironlake Graphics mozilla::layers::CompositorD3D11::GetTextureFactoryIdentifier()
I tried to find a laptop that reproduced this but made a bad assumption about what kind of intel graphics it was happening on.
Backed both out in https://hg.mozilla.org/integration/mozilla-inbound/rev/a2e34f98c85a - whether or not it's going to eventually work on Windows, that broke tests on Mac and Android.
I agree that WINDOWS_7 is better, but what's weird is that WINDOWS7 seems to have built here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=47ac91f92b2d
Ah never mind, I accidentally made the fix in the try push: https://hg.mozilla.org/try/rev/47ac91f92b2d
Comment on attachment 8580189 [details] [diff] [review]
A version that builds

Approval Request Comment
[Feature/regressing bug #]: Unknown
[User impact if declined]: Startup crashes on people's machines that have particular hardware.
[Describe test coverage new/current, TreeHerder]: Very limited. Hasn't been in Nightly yet. We don't have the hardware to test it on.
[Risks and why]: This changes the blocklisting infrastructure so it definitely has some risk, especially this late in the cycle
Attachment #8580189 - Flags: approval-mozilla-beta?
Attachment #8580189 - Flags: approval-mozilla-aurora?
Comment on attachment 8580189 [details] [diff] [review]
A version that builds

I am going to take in aurora even if it didn't land in m-c to maximize testing for beta.
Attachment #8580189 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This is a bad bug and a new issue in 37. However, this is too risky to land directly on Beta. We're going to wait until at least tomorrow to try and get some data. We may decide to land this later and test over the weekend with the risk of pushing the release if required.
https://hg.mozilla.org/mozilla-central/rev/027c4d441a02
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Assignee: bas → jmuizelaar
Comment on attachment 8580189 [details] [diff] [review]
A version that builds

This change hasn't produced obvious problems on Nightly or Aurora but won't be able to really be tested until we get it onto Beta. We'll take this in the 37 RC as this looks like a significant enough issue to block the release. Beta+
Attachment #8580189 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I see some Intel devices with DeviceID 0x0116 that hit this crash in the csv files. David can you confirm that you don't see any 0x0116 intel devices?
Flags: needinfo?(dmajor)
I do see them now. This may be a new development. I'm pretty sure it was more like 99% 0x0046 when I originally posted.

 Rank 	Adapter device id 	Count 	%
1 	0x0046 	1703 	91.95 %
2 	0x0116 	107 	5.78 %
3 	0x0106 	39 	2.11 %
4 	0x0126 	3 	0.16 %
Flags: needinfo?(dmajor)
This crash was supposed to be fixed, but it is the #1 crash in early 37.0 release data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
So I noticed these crash reports of WARP- which is not really expected. Something weird might be going on there.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #36)
> So I noticed these crash reports of WARP- which is not really expected.
> Something weird might be going on there.

Might that be another case of or connected to bug 1149761?
The WARP- seems innocuous. The current code will have WARP- when ever we call InitD3D11Devices and don't succeed at WARP even if we never tried. I've filed bug 1150124 to improve this reporting.
It seems as though the D3D11 compositor is being used for reasons unknown.
So I just realized that our ScopedGfxFeatureReporter writes to the AppNotes using an event posted to the main thread. This means that during startup they will not necessarily contain all of the data that we would like to see. This likely explains why it seems like we're using the D3D11 compositor without reporting that in the AppNotes.

It's conceivable that the block listing code is just not working properly and not blocking this laptops.
Depends on: 1150324, 1150124
I typo'd the version number in the blacklisting patch. That explains why the blacklist didn't work.
Approval Request Comment
[Feature/regressing bug #]: 1137716
[User impact if declined]: Crashes on startup
[Describe test coverage new/current, TreeHerder]: None
[Risks and why]: Unintentional blacklisting
Attachment #8587560 - Flags: approval-mozilla-release?
Attachment #8587560 - Flags: approval-mozilla-beta?
Attachment #8587560 - Flags: approval-mozilla-aurora?
Comment on attachment 8587560 [details] [diff] [review]
Fix driver version typo

We're going to take this blacklist typo correction for a start-up crash in 37.0.1 Release+ Beta+ Aurora+
Attachment #8587560 - Flags: approval-mozilla-release?
Attachment #8587560 - Flags: approval-mozilla-release+
Attachment #8587560 - Flags: approval-mozilla-beta?
Attachment #8587560 - Flags: approval-mozilla-beta+
Attachment #8587560 - Flags: approval-mozilla-aurora?
Attachment #8587560 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/86d34f434aa5
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: mozilla39 → mozilla40
We do not have any dual GPU setups to test this, so I guess verification of the fix can only be done by analyzing Socorro data. Please let me know if you think there is a way to manually verify this.
There are still some hits on this in 37.0.1 but the volume is greatly reduced. I think it is a matter of additional device ID's. The stragglers are: 0x0dd2 0x0dd3 0x1050 0x1051 0x1054.
I had listed 0x0dd3 in comment 17 but I don't see it in the patch (it's about 2/3 of the remaining crashes). The other device IDs must be ones that were too low volume to notice on beta.
I have one of the machines that should reproduce this (Identical machine, identical drivers), but I'm not able to for some reason...

I'll push a patch that adds the additional device ids...
Approval Request Comment
[Feature/regressing bug #]: 37
[User impact if declined]: Startup crashes
[Risks and why]: Just more devices being blocked
Attachment #8588711 - Flags: approval-mozilla-release?
Attachment #8588711 - Flags: approval-mozilla-beta?
Attachment #8588711 - Flags: approval-mozilla-aurora?
Comment on attachment 8588711 [details] [diff] [review]
Block more devices

Should be in 38 beta 2 or 3.
Attachment #8588711 - Flags: approval-mozilla-beta?
Attachment #8588711 - Flags: approval-mozilla-beta+
Attachment #8588711 - Flags: approval-mozilla-aurora?
Attachment #8588711 - Flags: approval-mozilla-aurora+
For whoever decides the release approval on that patch: The relative volume of this signature is much lower now: 0.5% of 37.0.1 crashes, versus 6.6% of 37.0 crashes. But we'll need to weigh the low volume against the fact that it's a startup crash.
set back to affected to make sure sheriffs see it
I was able to reproduce this by forcing firefox to use the nvidia gpu on the machine in question.
I believe this was caused by our current blacklist breaking in some way.
Comment on attachment 8588711 [details] [diff] [review]
Block more devices

This will ride along in 37.0.2. Release+
Attachment #8588711 - Flags: approval-mozilla-release? → approval-mozilla-release+
Jeff, could you confirm that the fix works correctly on the machine that you've reproduced this? At least for Firefox 37.0.2: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/37.0.2-candidates/build1/.
Flags: needinfo?(jmuizelaar)
The original patch fixed this on the machine that I have. The latest patch only impacts more rare machines.
Flags: needinfo?(jmuizelaar)
Thanks Jeff! I guess that means we need more crash data to confirm.
You need to log in before you can comment on or make changes to this bug.