Crash in nvwgf2um.dll | NDXGI::CDevice::DestroyDriverInstance

RESOLVED FIXED in Firefox 50

Status

()

--
critical
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: ashughes, Assigned: eflores)

Tracking

(Blocks: 1 bug, {crash, topcrash})

38 Branch
mozilla52
x86
Windows
crash, topcrash
Points:
---

Firefox Tracking Flags

(firefox49+ wontfix, firefox-esr45 affected, firefox50+ fixed, firefox51+ fixed, firefox52+ fixed)

Details

(Whiteboard: [gfx-noted], crash signature)

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-b3fb7219-817d-4549-ad7d-31fbd2160804.
=============================================================
Ø 0 	nvwgf2um.dll 	nvwgf2um.dll@0x45db4 	
Ø 1 	nvwgf2um.dll 	nvwgf2um.dll@0xa9dd7 	
Ø 2 	nvwgf2um.dll 	nvwgf2um.dll@0x5d3ac 	
Ø 3 	nvwgf2um.dll 	nvwgf2um.dll@0x506b5 	
Ø 4 	nvwgf2um.dll 	nvwgf2um.dll@0x4e791 	
Ø 5 	nvwgf2um.dll 	nvwgf2um.dll@0x2e6cf 	
6 	d3d11.dll 	NDXGI::CDevice::DestroyDriverInstance() 	
7 	d3d11.dll 	CContext::LUCBeginLayerDestruction() 	
8 	d3d11.dll 	CBridgeImpl<ILayeredUseCounted, ID3D11LayeredUseCounted, CLayeredObject<CContext> >::LUCBeginLayerDestruction() 	
9 	d3d11.dll 	NOutermost::CDeviceChild::LUCBeginLayerDestruction() 	
10 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::FinalRelease() 	
11 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::~CUseCountedObject<NOutermost::CDeviceChild>() 	
12 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::`scalar deleting destructor'(unsigned int) 	
13 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::UCDestroy() 	
14 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::UCReleaseUse() 	
15 	d3d11.dll 	CDevice::LLOBeginLayerDestruction() 	
16 	d3d11.dll 	CBridgeImpl<ILayeredLockOwner, ID3D11LayeredDevice, CLayeredObject<CDevice> >::LLOBeginLayerDestruction() 	
17 	d3d11.dll 	NDXGI::CDevice::LLOBeginLayerDestruction() 	
18 	d3d11.dll 	CBridgeImpl<ILayeredLockOwner, ID3D11LayeredDevice, CLayeredObject<NDXGI::CDevice> >::LLOBeginLayerDestruction() 	
19 	d3d11.dll 	NOutermost::CDevice::LLOBeginLayerDestruction() 	
20 	d3d11.dll 	TComObject<NOutermost::CDevice>::FinalRelease() 	
21 	d3d11.dll 	TComObject<NOutermost::CDevice>::~TComObject<NOutermost::CDevice>() 	
22 	d3d11.dll 	TComObject<NOutermost::CDevice>::`scalar deleting destructor'(unsigned int) 	
23 	d3d11.dll 	TComObject<NOutermost::CDevice>::Release() 	
24 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::Release() 	
25 	d3d11.dll 	CLayeredObjectWithCLS<CRenderTargetView>::CContainedObject::Release() 	
26 	d2d1.dll 	CHwSurfaceRenderTargetSharedData::~CHwSurfaceRenderTargetSharedData() 	
27 	d2d1.dll 	CD3DDeviceLevel1::~CD3DDeviceLevel1() 	
28 	d2d1.dll 	RefCountedObject<CD3DDeviceLevel1, LockingRequired, DeleteOnZeroReference>::`scalar deleting destructor'(unsigned int) 	
29 	d2d1.dll 	RefCountedObject<CD3DDeviceLevel1, LockingRequired, DeleteOnZeroReference>::Release() 	
30 	d2d1.dll 	CMemoryManager::~CMemoryManager() 	
31 	d2d1.dll 	D2DDevice::~D2DDevice() 	
32 	d2d1.dll 	RefCountedObject<D2DDevice, LockingRequired, DeleteOnZeroReference>::`vector deleting destructor'(unsigned int) 	
33 	d2d1.dll 	RefCountedObject<D2DDevice, LockingRequired, DeleteOnZeroReference>::Release() 	
34 	d2d1.dll 	D2DResource<ID2D1RenderTarget, IRenderTargetInternal, ID2D1DeviceContext>::~D2DResource<ID2D1RenderTarget, IRenderTargetInternal, ID2D1DeviceContext>() 	
35 	d2d1.dll 	D2DDeviceContextBase<ID2D1DeviceContext, ID2D1DeviceContext, null_type>::~D2DDeviceContextBase<ID2D1DeviceContext, ID2D1DeviceContext, null_type>() 	
36 	d2d1.dll 	RefCountedObject<D2DDeviceContext, LockingRequired, LockFactoryOnReferenceReachedZero>::`vector deleting destructor'(unsigned int) 	
37 	d2d1.dll 	RefCountedObject<D2DDeviceContext, LockingRequired, LockFactoryOnReferenceReachedZero>::Release() 	
38 	xul.dll 	mozilla::gfx::DrawTargetD2D1::~DrawTargetD2D1() 	gfx/2d/DrawTargetD2D1.cpp:80
39 	xul.dll 	mozilla::gfx::DrawTargetD2D1::`scalar deleting destructor'(unsigned int) 	
40 	xul.dll 	mozilla::detail::RefCounted<mozilla::layers::TextureSource, 1>::Release() 	obj-firefox/dist/include/mozilla/RefCounted.h:135
41 	xul.dll 	RefPtr<mozilla::gfx::DrawTarget>::assign_with_AddRef(mozilla::gfx::DrawTarget*) 	obj-firefox/dist/include/mozilla/RefPtr.h:55
42 	xul.dll 	gfxPlatform::~gfxPlatform() 	gfx/thebes/gfxPlatform.cpp:931
43 	xul.dll 	gfxWindowsPlatform::`scalar deleting destructor'(unsigned int) 	
44 	xul.dll 	gfxPlatform::Shutdown() 	gfx/thebes/gfxPlatform.cpp:868
45 	xul.dll 	LayoutModuleDtor 	layout/build/nsLayoutModule.cpp:1393
46 	xul.dll 	nsComponentManagerImpl::KnownModule::`scalar deleting destructor'(unsigned int) 	
47 	xul.dll 	nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned int, unsigned int) 	obj-firefox/dist/include/nsTArray.h:1656
48 	xul.dll 	nsComponentManagerImpl::Shutdown() 	xpcom/components/nsComponentManager.cpp:910
49 	xul.dll 	mozilla::ShutdownXPCOM(nsIServiceManager*) 	xpcom/build/XPCOMInit.cpp:992
50 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1470
51 	xul.dll 	xul.dll@0x1e4c43b 	
52 	ntdll.dll 	RtlInterlockedPushEntrySList 	
53 	mozglue.dll 	arena_dalloc_small 	memory/mozjemalloc/jemalloc.c:4636
54 	mozglue.dll 	je_free 	memory/mozjemalloc/jemalloc.c:6479
55 	firefox.exe 	do_main 	browser/app/nsBrowserApp.cpp:242
56 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:127
57 	ucrtbase.dll 	_initterm 	
58 	firefox.exe 	_SEH_epilog4 	
=============================================================
More reports: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=nvwgf2um.dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstance

It looks like these crashes go back at least to Firefox 38 at this point. It is currently #15 in Beta @ 0.57%.
Crash volume for signature 'nvwgf2um.dll | NDXGI::CDevice::DestroyDriverInstance':
 - nightly (version 51): 7 crashes from 2016-08-01.
 - aurora  (version 50): 20 crashes from 2016-08-01.
 - beta    (version 49): 1238 crashes from 2016-08-02.
 - release (version 48): 431 crashes from 2016-07-25.
 - esr     (version 45): 62 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       4       0       2
 - aurora        9       3       6
 - beta        433     452     187
 - release     138     112      82
 - esr          10       4       9

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #191
 - aurora  #186      #1083
 - beta    #29       #685
 - release #161
 - esr     #933
status-firefox51: --- → affected
(Reporter)

Comment 3

2 years ago
[Tracking Requested - why for this release]: This is the #5 topcrash in Firefox 49 @ 1.1%
status-firefox48: affected → ---
status-firefox52: --- → affected
tracking-firefox49: --- → ?
Keywords: topcrash

Comment 4

2 years ago
a number of user comments indicate that this crash is occurring when they are trying to close the browser.
Tracking for 49+. Milan can you help find an owner to investigate? Thanks.
tracking-firefox49: ? → +
tracking-firefox50: --- → +
tracking-firefox51: --- → +
tracking-firefox52: --- → +
Flags: needinfo?(milan)
Some of the follow may be duplicated between this and bug 1285333.  This started on June 22nd, which I believe means this: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d224fc999cb6&tochange=2e3390571fdb for the regression range.
Flags: needinfo?(milan)
Speculative uplift of bug 1296749 has been requested, and would at least deal with the crash in comment 0.
Comparison between 50.0b2 and 50.0b3 with the query from comment 6: https://crash-analysis.mozilla.com/rkaiser/datil/searchcompare/?common=product%3DFirefox%26proto_signature%3D%7EDeviceManagerD3D11%253A%253A%7EDeviceManagerD3D11&p1=version%3D50.0b2&p2=version%3D50.0b3.

There are some signature changes, but looks like the bug is still there.
Assignee: nobody → edwin
Just noticed now: this crash started on a specific date (22/06), but not on a specific version or release channel. There are plenty of crashes from 47, but they only started a couple of weeks after 47 was released. There are a few examples of older versions (down to 34.0b9!) but again, these didn't start until June.

What *did* happen around that time was an update for Windows 7 SP1, KB3161608. Of these crashes, >97% are from Windows 7 and of those, ~92% are SP1.

I don't yet know why this spiked in 49, but our regression range just became much wider...
On the June 22nd side - since that's when we elevated some signatures from proto signature to signature, we could see this kind of a change just from that.  For example:
https://crash-stats.mozilla.com/report/index/68ff7cc4-40a2-4dfa-83a2-e550e2160614
is the same crash, but it shows up as nvwgf2um.dll@0x1bb23c, rather than nvwgf2um.dll | NDXGI::CDevice::DestroyDriverInstance
The September spike, when we changed trains, now that's real.
Created attachment 8799357 [details]
Crashes vs. build ID where proto sig contains DestroyDriverInstance

Charts!
Created attachment 8799358 [details]
Crashes vs. time where proto sig contains DestroyDriverInstance

More charts!

The beta chart shows pretty clearly that this happened either before 49 hit beta, or something was enabled on 49 beta.

Aurora and Nightly don't seem to spike on 49, but they're pretty noisy. It's difficult to tell.
Created attachment 8799698 [details]
Release chart of crashes over time where proto sig contains "LLOBeginLayerDestruction"

Mo' data, mo' problems.

In looking for a more accurate regression range, I plotted all the crash reports whose proto sig contains "LLOBeginLayerDestruction". That appears to capture this crash across different device vendors.

That leads us back to June 22. I'm hesitant to treat it as actual signal this time, but it's probably worth trying to explain it away. Perhaps we're processing the proto sig field differently as well.
Previous chart broken down by top 60 signatures: https://plot.ly/~edwinfloresii/0/release/
Created attachment 8799708 [details]
Release crashes over time where proto sig contains LLOBeginLayerDestruction, broken down by signature.

Previous chart is a bit misleading with the smaller signatures filtered out. This one makes a lot more sense.

It shows the ATI crash peaking in 47 and dying down over time, and the nVidia crash being largely unrelated. This is slightly surprising.

Attaching as PNG for slowness reasons. The larger crashes can be seen in the interactive chart above.
I'm going to ask that those crashes before June 22 be reprocessed with the new signature generation. This whole process is getting pretty ridiculous.

Comment 18

2 years ago
hi, when looking just for crashes on the nightly channel, they seem to have become more regular starting with 50.0a1 build 20160713030216.

the nightly pushlog for 20160713030216 -1 day would be:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=aac8ff1024c553d9c92b85b8b6ba90f65de2ed08&tochange=04821a70c739a00d12e12df651c0989441e22728
the only gfx related patch there sticking out to me would be bug 1276467

and going back another 2 days:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=679118259e91f40d4a8f968f03ec4cff066cdb5b&tochange=aac8ff1024c553d9c92b85b8b6ba90f65de2ed08

Comment 19

2 years ago
bug 1284322 would fall into the 2nd pushlog window and it seems that it's really mainly those older nvidia driver versions are involved in this crash signature.

this is the correlation of crashing driver versions on 50.0b so far:
1 	8.15.11.8660 	60 	11.65 %
2 	8.15.11.8627 	56 	10.87 %
3 	8.16.11.8691 	44 	8.54 %
4 	8.15.11.8647 	40 	7.77 %
5 	8.15.11.8652 	37 	7.18 %
6 	8.15.11.8642 	36 	6.99 %
7 	8.15.11.8688 	23 	4.47 %
8 	9.18.13.697 	17 	3.30 %
9 	8.15.11.8634 	14 	2.72 %
10 	8.15.11.8631 	12      2.33 %

so maybe it would make sense to keep versions up to/including 8.16.11.8691 blocklisted from this perspective?
Blocks: 1284322

Comment 20

2 years ago
oops, i failed to notice that bug 1284322 was uplifted to 49 too, so we have a broader range of samples on release as well:

1 	8.16.11.8691 	774 	16.50 %
2 	8.15.11.8652 	746 	15.90 %
3 	8.15.11.8627 	669 	14.26 %
4 	8.15.11.8647 	483 	10.29 %
5 	8.15.11.8688 	337 	7.18 %
6 	8.15.11.8642 	288 	6.14 %
7 	8.15.11.8631 	198 	4.22 %
8 	8.15.11.8660 	152 	3.24 %
9 	8.15.11.8634 	137 	2.92 %
10 	8.15.11.8644 	119 	2.54 %
11 	8.15.11.8637 	113 	2.41 %
12 	8.15.11.8670 	39 	0.83 %
13 	8.15.11.8725 	39 	0.83 %
14 	8.17.12.8026 	39 	0.83 %
15 	8.15.11.8664 	34 	0.72 %
16 	8.15.11.8675 	26 	0.55 %
17 	8.17.12.7533 	26 	0.55 %
18 	8.16.11.8745 	24 	0.51 %
19 	8.17.12.6893 	24 	0.51 %
20 	8.15.11.8636 	23 	0.49 %
(In reply to [:philipp] from comment #18)
> hi, when looking just for crashes on the nightly channel, they seem to have
> become more regular starting with 50.0a1 build 20160713030216.

> and going back another 2 days:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=679118259e91f40d4a8f968f03ec4cff066cdb5b&tochange=aac8
> ff1024c553d9c92b85b8b6ba90f65de2ed08

Interesting! Not sure how I didn't notice that. It's not particularly obvious from the build graph, but that seems to be because the graph is missing a lot of points compared to the reports list. Weird.

I think you're right with bug 1284322.
(In reply to Marco Castelluccio [:marco] from comment #22)
> With Aurora the trend doesn't seem to have changed much before/after the
> uplift in bug 1284322:
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&release_channel=nightly&release_channel=aurora&signature=nvw
> gf2um.
> dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstance&date=%3E%3D2016-04-
> 12T11%3A09%3A00.000Z&date=%3C2016-10-12T11%3A09%3A00.000Z#graph

Yeah, but Aurora just didn't change much at all. That's why this bug was such a pain.

I think it's likely to be a result of the driver version distribution of Aurora. Might be able to confirm this with ping data.
Comment on attachment 8800206 [details] [diff] [review]
1292311.patch

Review of attachment 8800206 [details] [diff] [review]:
-----------------------------------------------------------------

Can you include a rationale for this driver version?
Created attachment 8800296 [details] [diff] [review]
1292311.patch
Attachment #8800206 - Attachment is obsolete: true
Attachment #8800296 - Flags: review?(jmuizelaar)
Attachment #8800206 - Flags: review?(jmuizelaar)
Attachment #8800296 - Flags: review?(jmuizelaar) → review+

Comment 27

2 years ago
Pushed by eflores@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1ed33f3fdd2
Blacklist nVidia drivers <= 187.45 for frequent shutdown crashes - r=jrmuizel
Comment on attachment 8800296 [details] [diff] [review]
1292311.patch

Approval Request Comment
[Feature/regressing bug #]: bug 1284322.
[User impact if declined]: crashes.
[Describe test coverage new/current, TreeHerder]: -
[Risks and why]: very low. just blacklisting some old hardware.
[String/UUID change made/needed]:
Attachment #8800296 - Flags: approval-mozilla-beta?
Attachment #8800296 - Flags: approval-mozilla-aurora?
Comment on attachment 8800296 [details] [diff] [review]
1292311.patch

Crash fix, Aurora51+, Beta50+

(I hope we can land this in time for inclusion in 50.0b7)
Attachment #8800296 - Flags: approval-mozilla-beta?
Attachment #8800296 - Flags: approval-mozilla-beta+
Attachment #8800296 - Flags: approval-mozilla-aurora?
Attachment #8800296 - Flags: approval-mozilla-aurora+

Comment 30

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/7f07acca81b9
status-firefox51: affected → fixed

Comment 31

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/c62114e522a3
status-firefox50: affected → fixed
Andre, note that we're disabling acceleration on a bunch of Nvidia cards due to an increase in crashes we've detected in 49.  Enabling acceleration in the first place was result of an effort to enable WebGL on more machines - this is now a step backwards.  I don't know exactly how the numbers will be affected, there are some indications in bug 1284322, but we should look for changes in the statistics.
Flags: needinfo?(avrignaud)
Edwin, what kinds of uptime distribution are we seeing on these crashes?  Given the WebGL angle, if these are startup crashes, the story is clear, but if it's after a long use...
Flags: needinfo?(edwin)

Comment 34

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d1ed33f3fdd2
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox52: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Thanks Milan. Is this something we can eventually fix/work around? Or are they Nvidia driver issues, and the only fix is for the users to update the drivers (if working ones even exist)?
Flags: needinfo?(avrignaud)
We don't have a reproducible case for these crashes, so it's difficult for us to fix the underlying problem, even if it is on our side.
(In reply to Milan Sreckovic [:milan] from comment #33)
> Edwin, what kinds of uptime distribution are we seeing on these crashes? 
> Given the WebGL angle, if these are startup crashes, the story is clear, but
> if it's after a long use...

The distribution looks pretty uniform.

Most of the users that were un-blacklisted in bug 1284322 should still be off the blacklist. Ran some rough numbers [1], and this change accounts adds about 0.7% of users on top of those already on the blacklist (from ~0.88% to ~1.56%) -- not a small number, but not unreasonable IMO.

[1] https://sql.telemetry.mozilla.org/queries/1425/source
Flags: needinfo?(edwin)
See Also: → bug 1310600

Comment 38

2 years ago
Hey, indeed like commented here this might be a step backwards. It feels unfortunate, but I can appreciate if we need to do this. There has been a big push that Milan's been driving in the past months to be much more aware of the blacklisting activities we do for graphics, because the feedback from notable companies utilizing WebGL commercially has revealed that WebGL adoption is one of the largest pain points affecting their migration to WebGL based technologies.

Btw, this is an example of a blacklisting entry that is being introduced without understanding what causes it. We know that a crash occurs at exit, but that is not really the cause, but presumably the WebGL stack operates the driver in some fashion that causes a late delayed crash when closing. If we had a repro, we might be able to connect the crash-at-closing-down to a specific feature or API call that causes it, and possibly work around. Although debugging that type of issue can be practically impossible if there is no repro case, so I can appreciate if we need to forfeit that line of attempt to investigate and just strike off these driver versions.

Based on Edwin's comment above, this is likely a small population, but I'd like to be certain. When making this kind of blacklist, I'm hoping that we get the reasoning written down precisely, so that in the future we will remember our blacklist landscape easily. There have been some blacklists that have been introduced in the past and then forgotten (until Milan and Jeffs and others came back to them), so we want to keep a comfortably good track of the reasons that we blacklisted so later auditing is easy.

Reading the conversation trail, the new blacklist would cover all Windows OSes(Xp, 7, Vista, 8, 10?) who have NVidia driver 187.45 or older. Though presumably there do not exist any Windows 8 or 10 users with this driver version at all, so this would be Xp, 7 and Vista specific?

1) This looks like this is blacklist will not be not a "hole" in the series of driver versions, but practically will introduce a new minimum driver version requirement?
2) Why was this exact driver version chosen? (was the last driver version that had this crash signature?)
3) How big percentage of overall users in the wild do we expect to lose for WebGL with this blacklist? (0.7% of all Firefox users?)
   - Especially I am surprised with a seeming conflict that this is the #5 highest top crasher, which suggests perhaps we might have a larger user base affected?
4) What was the driver release date for that version? (October 2009?)
5) What is the next driver version that we know to work, and its release date?
6) Did the WebGL context creation error message infrastructure that was talked about earlier go live so that attempting to create WebGL context on a now blocked driver will get an error message about this? Can we make the message point to this bug entry? ("WebGL on this system is disabled. See https://bugzilla.mozilla.org/show_bug.cgi?id=1292311") We'd like to offer a machinery to developers so that they know how to present the appropriate error dialogs to users, along the lines of "Try updating your graphics drivers".

In particular #5 and #6 would make my mind at ease with these types of blacklist items, since those are an effective way to help the WebGL adoption problem on the developer side. Developers love minimum hardware specifications, so having the exact specs nailed down, like "Minimum requirement: NVidia graphics cards: October 2009/driver 187.45 or newer" is the way that developers like to manage their responsibility with their user bases. Minimum specs are good for us as well, since occassionally we have received comments from devs saying that we have poor WebGL adoption, so being able to explicitly list out "WebGL works on these hardware/OS/driver combos" is effective, especially if we are able to note that we support e.g. NVidia drivers way back to 2009, which is not exactly a new one.

Great work nailing this down!
status-firefox49: affected → wontfix
You need to log in before you can comment on or make changes to this bug.