Closed Bug 1292311 Opened 6 years ago Closed 6 years ago
Crash in nvwgf2um
.dll | NDXGI::CDevice::Destroy Driver Instance
96.23 KB, image/png
132.98 KB, image/png
43.20 KB, image/png
Release crashes over time where proto sig contains LLOBeginLayerDestruction, broken down by signature.
115.64 KB, image/png
4.81 KB, patch
|Details | Diff | Splinter Review|
This bug was filed from the Socorro interface and is report bp-b3fb7219-817d-4549-ad7d-31fbd2160804. ============================================================= Ø 0 nvwgf2um.dll nvwgf2um.dll@0x45db4 Ø 1 nvwgf2um.dll nvwgf2um.dll@0xa9dd7 Ø 2 nvwgf2um.dll nvwgf2um.dll@0x5d3ac Ø 3 nvwgf2um.dll nvwgf2um.dll@0x506b5 Ø 4 nvwgf2um.dll nvwgf2um.dll@0x4e791 Ø 5 nvwgf2um.dll nvwgf2um.dll@0x2e6cf 6 d3d11.dll NDXGI::CDevice::DestroyDriverInstance() 7 d3d11.dll CContext::LUCBeginLayerDestruction() 8 d3d11.dll CBridgeImpl<ILayeredUseCounted, ID3D11LayeredUseCounted, CLayeredObject<CContext> >::LUCBeginLayerDestruction() 9 d3d11.dll NOutermost::CDeviceChild::LUCBeginLayerDestruction() 10 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::FinalRelease() 11 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::~CUseCountedObject<NOutermost::CDeviceChild>() 12 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::`scalar deleting destructor'(unsigned int) 13 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::UCDestroy() 14 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::UCReleaseUse() 15 d3d11.dll CDevice::LLOBeginLayerDestruction() 16 d3d11.dll CBridgeImpl<ILayeredLockOwner, ID3D11LayeredDevice, CLayeredObject<CDevice> >::LLOBeginLayerDestruction() 17 d3d11.dll NDXGI::CDevice::LLOBeginLayerDestruction() 18 d3d11.dll CBridgeImpl<ILayeredLockOwner, ID3D11LayeredDevice, CLayeredObject<NDXGI::CDevice> >::LLOBeginLayerDestruction() 19 d3d11.dll NOutermost::CDevice::LLOBeginLayerDestruction() 20 d3d11.dll TComObject<NOutermost::CDevice>::FinalRelease() 21 d3d11.dll TComObject<NOutermost::CDevice>::~TComObject<NOutermost::CDevice>() 22 d3d11.dll TComObject<NOutermost::CDevice>::`scalar deleting destructor'(unsigned int) 23 d3d11.dll TComObject<NOutermost::CDevice>::Release() 24 d3d11.dll CUseCountedObject<NOutermost::CDeviceChild>::Release() 25 d3d11.dll CLayeredObjectWithCLS<CRenderTargetView>::CContainedObject::Release() 26 d2d1.dll CHwSurfaceRenderTargetSharedData::~CHwSurfaceRenderTargetSharedData() 27 d2d1.dll CD3DDeviceLevel1::~CD3DDeviceLevel1() 28 d2d1.dll RefCountedObject<CD3DDeviceLevel1, LockingRequired, DeleteOnZeroReference>::`scalar deleting destructor'(unsigned int) 29 d2d1.dll RefCountedObject<CD3DDeviceLevel1, LockingRequired, DeleteOnZeroReference>::Release() 30 d2d1.dll CMemoryManager::~CMemoryManager() 31 d2d1.dll D2DDevice::~D2DDevice() 32 d2d1.dll RefCountedObject<D2DDevice, LockingRequired, DeleteOnZeroReference>::`vector deleting destructor'(unsigned int) 33 d2d1.dll RefCountedObject<D2DDevice, LockingRequired, DeleteOnZeroReference>::Release() 34 d2d1.dll D2DResource<ID2D1RenderTarget, IRenderTargetInternal, ID2D1DeviceContext>::~D2DResource<ID2D1RenderTarget, IRenderTargetInternal, ID2D1DeviceContext>() 35 d2d1.dll D2DDeviceContextBase<ID2D1DeviceContext, ID2D1DeviceContext, null_type>::~D2DDeviceContextBase<ID2D1DeviceContext, ID2D1DeviceContext, null_type>() 36 d2d1.dll RefCountedObject<D2DDeviceContext, LockingRequired, LockFactoryOnReferenceReachedZero>::`vector deleting destructor'(unsigned int) 37 d2d1.dll RefCountedObject<D2DDeviceContext, LockingRequired, LockFactoryOnReferenceReachedZero>::Release() 38 xul.dll mozilla::gfx::DrawTargetD2D1::~DrawTargetD2D1() gfx/2d/DrawTargetD2D1.cpp:80 39 xul.dll mozilla::gfx::DrawTargetD2D1::`scalar deleting destructor'(unsigned int) 40 xul.dll mozilla::detail::RefCounted<mozilla::layers::TextureSource, 1>::Release() obj-firefox/dist/include/mozilla/RefCounted.h:135 41 xul.dll RefPtr<mozilla::gfx::DrawTarget>::assign_with_AddRef(mozilla::gfx::DrawTarget*) obj-firefox/dist/include/mozilla/RefPtr.h:55 42 xul.dll gfxPlatform::~gfxPlatform() gfx/thebes/gfxPlatform.cpp:931 43 xul.dll gfxWindowsPlatform::`scalar deleting destructor'(unsigned int) 44 xul.dll gfxPlatform::Shutdown() gfx/thebes/gfxPlatform.cpp:868 45 xul.dll LayoutModuleDtor layout/build/nsLayoutModule.cpp:1393 46 xul.dll nsComponentManagerImpl::KnownModule::`scalar deleting destructor'(unsigned int) 47 xul.dll nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned int, unsigned int) obj-firefox/dist/include/nsTArray.h:1656 48 xul.dll nsComponentManagerImpl::Shutdown() xpcom/components/nsComponentManager.cpp:910 49 xul.dll mozilla::ShutdownXPCOM(nsIServiceManager*) xpcom/build/XPCOMInit.cpp:992 50 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1470 51 xul.dll xul.dll@0x1e4c43b 52 ntdll.dll RtlInterlockedPushEntrySList 53 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c:4636 54 mozglue.dll je_free memory/mozjemalloc/jemalloc.c:6479 55 firefox.exe do_main browser/app/nsBrowserApp.cpp:242 56 firefox.exe wmain toolkit/xre/nsWindowsWMain.cpp:127 57 ucrtbase.dll _initterm 58 firefox.exe _SEH_epilog4 ============================================================= More reports: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=nvwgf2um.dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstance It looks like these crashes go back at least to Firefox 38 at this point. It is currently #15 in Beta @ 0.57%.
There's a bunch of other signatures that look similar: https://crash-stats.mozilla.com/search/?signature=~nvwgf2um.dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstanc&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Crash volume for signature 'nvwgf2um.dll | NDXGI::CDevice::DestroyDriverInstance': - nightly (version 51): 7 crashes from 2016-08-01. - aurora (version 50): 20 crashes from 2016-08-01. - beta (version 49): 1238 crashes from 2016-08-02. - release (version 48): 431 crashes from 2016-07-25. - esr (version 45): 62 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 4 0 2 - aurora 9 3 6 - beta 433 452 187 - release 138 112 82 - esr 10 4 9 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #191 - aurora #186 #1083 - beta #29 #685 - release #161 - esr #933
[Tracking Requested - why for this release]: This is the #5 topcrash in Firefox 49 @ 1.1%
a number of user comments indicate that this crash is occurring when they are trying to close the browser.
Tracking for 49+. Milan can you help find an owner to investigate? Thanks.
Yes, it's a shutdown hang/crash, and isn't specific to Nvidia (for example, this is the same issue on Intel - https://crash-stats.mozilla.com/report/index/6b94f054-1d37-4478-ac44-3bf702160926) and we have a whole separate bug 1285333 for the AMD version. Chances are, all of these https://crash-stats.mozilla.com/search/?proto_signature=~DeviceManagerD3D11%3A%3A~DeviceManagerD3D11&product=Firefox&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature are the same crash.
See Also: → 1285333
Some of the follow may be duplicated between this and bug 1285333. This started on June 22nd, which I believe means this: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d224fc999cb6&tochange=2e3390571fdb for the regression range.
Speculative uplift of bug 1296749 has been requested, and would at least deal with the crash in comment 0.
Comparison between 50.0b2 and 50.0b3 with the query from comment 6: https://crash-analysis.mozilla.com/rkaiser/datil/searchcompare/?common=product%3DFirefox%26proto_signature%3D%7EDeviceManagerD3D11%253A%253A%7EDeviceManagerD3D11&p1=version%3D50.0b2&p2=version%3D50.0b3. There are some signature changes, but looks like the bug is still there.
Assignee: nobody → edwin
Just noticed now: this crash started on a specific date (22/06), but not on a specific version or release channel. There are plenty of crashes from 47, but they only started a couple of weeks after 47 was released. There are a few examples of older versions (down to 34.0b9!) but again, these didn't start until June. What *did* happen around that time was an update for Windows 7 SP1, KB3161608. Of these crashes, >97% are from Windows 7 and of those, ~92% are SP1. I don't yet know why this spiked in 49, but our regression range just became much wider...
On the June 22nd side - since that's when we elevated some signatures from proto signature to signature, we could see this kind of a change just from that. For example: https://crash-stats.mozilla.com/report/index/68ff7cc4-40a2-4dfa-83a2-e550e2160614 is the same crash, but it shows up as nvwgf2um.dll@0x1bb23c, rather than nvwgf2um.dll | NDXGI::CDevice::DestroyDriverInstance The September spike, when we changed trains, now that's real.
More charts! The beta chart shows pretty clearly that this happened either before 49 hit beta, or something was enabled on 49 beta. Aurora and Nightly don't seem to spike on 49, but they're pretty noisy. It's difficult to tell.
Mo' data, mo' problems. In looking for a more accurate regression range, I plotted all the crash reports whose proto sig contains "LLOBeginLayerDestruction". That appears to capture this crash across different device vendors. That leads us back to June 22. I'm hesitant to treat it as actual signal this time, but it's probably worth trying to explain it away. Perhaps we're processing the proto sig field differently as well.
Previous chart broken down by top 60 signatures: https://plot.ly/~edwinfloresii/0/release/
Previous chart is a bit misleading with the smaller signatures filtered out. This one makes a lot more sense. It shows the ATI crash peaking in 47 and dying down over time, and the nVidia crash being largely unrelated. This is slightly surprising. Attaching as PNG for slowness reasons. The larger crashes can be seen in the interactive chart above.
I'm going to ask that those crashes before June 22 be reprocessed with the new signature generation. This whole process is getting pretty ridiculous.
hi, when looking just for crashes on the nightly channel, they seem to have become more regular starting with 50.0a1 build 20160713030216. the nightly pushlog for 20160713030216 -1 day would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=aac8ff1024c553d9c92b85b8b6ba90f65de2ed08&tochange=04821a70c739a00d12e12df651c0989441e22728 the only gfx related patch there sticking out to me would be bug 1276467 and going back another 2 days: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=679118259e91f40d4a8f968f03ec4cff066cdb5b&tochange=aac8ff1024c553d9c92b85b8b6ba90f65de2ed08
bug 1284322 would fall into the 2nd pushlog window and it seems that it's really mainly those older nvidia driver versions are involved in this crash signature. this is the correlation of crashing driver versions on 50.0b so far: 1 22.214.171.12460 60 11.65 % 2 126.96.36.19927 56 10.87 % 3 188.8.131.5291 44 8.54 % 4 184.108.40.20647 40 7.77 % 5 220.127.116.1152 37 7.18 % 6 18.104.22.16842 36 6.99 % 7 22.214.171.12488 23 4.47 % 8 126.96.36.1997 17 3.30 % 9 188.8.131.5234 14 2.72 % 10 184.108.40.20631 12 2.33 % so maybe it would make sense to keep versions up to/including 220.127.116.1191 blocklisted from this perspective?
oops, i failed to notice that bug 1284322 was uplifted to 49 too, so we have a broader range of samples on release as well: 1 18.104.22.16891 774 16.50 % 2 22.214.171.12452 746 15.90 % 3 126.96.36.19927 669 14.26 % 4 188.8.131.5247 483 10.29 % 5 184.108.40.20688 337 7.18 % 6 220.127.116.1142 288 6.14 % 7 18.104.22.16831 198 4.22 % 8 22.214.171.12460 152 3.24 % 9 126.96.36.19934 137 2.92 % 10 188.8.131.5244 119 2.54 % 11 184.108.40.20637 113 2.41 % 12 220.127.116.1170 39 0.83 % 13 18.104.22.16825 39 0.83 % 14 22.214.171.12426 39 0.83 % 15 126.96.36.19964 34 0.72 % 16 188.8.131.5275 26 0.55 % 17 184.108.40.20633 26 0.55 % 18 220.127.116.1145 24 0.51 % 19 18.104.22.16893 24 0.51 % 20 22.214.171.12436 23 0.49 %
(In reply to [:philipp] from comment #18) > hi, when looking just for crashes on the nightly channel, they seem to have > become more regular starting with 50.0a1 build 20160713030216. > and going back another 2 days: > https://hg.mozilla.org/mozilla-central/ > pushloghtml?fromchange=679118259e91f40d4a8f968f03ec4cff066cdb5b&tochange=aac8 > ff1024c553d9c92b85b8b6ba90f65de2ed08 Interesting! Not sure how I didn't notice that. It's not particularly obvious from the build graph, but that seems to be because the graph is missing a lot of points compared to the reports list. Weird. I think you're right with bug 1284322.
With Aurora the trend doesn't seem to have changed much before/after the uplift in bug 1284322: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&release_channel=aurora&signature=nvwgf2um.dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstance&date=%3E%3D2016-04-12T11%3A09%3A00.000Z&date=%3C2016-10-12T11%3A09%3A00.000Z#graph
(In reply to Marco Castelluccio [:marco] from comment #22) > With Aurora the trend doesn't seem to have changed much before/after the > uplift in bug 1284322: > https://crash-stats.mozilla.com/signature/ > ?product=Firefox&release_channel=nightly&release_channel=aurora&signature=nvw > gf2um. > dll%20%7C%20NDXGI%3A%3ACDevice%3A%3ADestroyDriverInstance&date=%3E%3D2016-04- > 12T11%3A09%3A00.000Z&date=%3C2016-10-12T11%3A09%3A00.000Z#graph Yeah, but Aurora just didn't change much at all. That's why this bug was such a pain. I think it's likely to be a result of the driver version distribution of Aurora. Might be able to confirm this with ping data.
Comment on attachment 8800206 [details] [diff] [review] 1292311.patch Review of attachment 8800206 [details] [diff] [review]: ----------------------------------------------------------------- Can you include a rationale for this driver version?
Attachment #8800296 - Flags: review?(jmuizelaar) → review+
Pushed by email@example.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d1ed33f3fdd2 Blacklist nVidia drivers <= 187.45 for frequent shutdown crashes - r=jrmuizel
Comment on attachment 8800296 [details] [diff] [review] 1292311.patch Approval Request Comment [Feature/regressing bug #]: bug 1284322. [User impact if declined]: crashes. [Describe test coverage new/current, TreeHerder]: - [Risks and why]: very low. just blacklisting some old hardware. [String/UUID change made/needed]:
Comment on attachment 8800296 [details] [diff] [review] 1292311.patch Crash fix, Aurora51+, Beta50+ (I hope we can land this in time for inclusion in 50.0b7)
Andre, note that we're disabling acceleration on a bunch of Nvidia cards due to an increase in crashes we've detected in 49. Enabling acceleration in the first place was result of an effort to enable WebGL on more machines - this is now a step backwards. I don't know exactly how the numbers will be affected, there are some indications in bug 1284322, but we should look for changes in the statistics.
Edwin, what kinds of uptime distribution are we seeing on these crashes? Given the WebGL angle, if these are startup crashes, the story is clear, but if it's after a long use...
Thanks Milan. Is this something we can eventually fix/work around? Or are they Nvidia driver issues, and the only fix is for the users to update the drivers (if working ones even exist)?
We don't have a reproducible case for these crashes, so it's difficult for us to fix the underlying problem, even if it is on our side.
(In reply to Milan Sreckovic [:milan] from comment #33) > Edwin, what kinds of uptime distribution are we seeing on these crashes? > Given the WebGL angle, if these are startup crashes, the story is clear, but > if it's after a long use... The distribution looks pretty uniform. Most of the users that were un-blacklisted in bug 1284322 should still be off the blacklist. Ran some rough numbers , and this change accounts adds about 0.7% of users on top of those already on the blacklist (from ~0.88% to ~1.56%) -- not a small number, but not unreasonable IMO.  https://sql.telemetry.mozilla.org/queries/1425/source
Hey, indeed like commented here this might be a step backwards. It feels unfortunate, but I can appreciate if we need to do this. There has been a big push that Milan's been driving in the past months to be much more aware of the blacklisting activities we do for graphics, because the feedback from notable companies utilizing WebGL commercially has revealed that WebGL adoption is one of the largest pain points affecting their migration to WebGL based technologies. Btw, this is an example of a blacklisting entry that is being introduced without understanding what causes it. We know that a crash occurs at exit, but that is not really the cause, but presumably the WebGL stack operates the driver in some fashion that causes a late delayed crash when closing. If we had a repro, we might be able to connect the crash-at-closing-down to a specific feature or API call that causes it, and possibly work around. Although debugging that type of issue can be practically impossible if there is no repro case, so I can appreciate if we need to forfeit that line of attempt to investigate and just strike off these driver versions. Based on Edwin's comment above, this is likely a small population, but I'd like to be certain. When making this kind of blacklist, I'm hoping that we get the reasoning written down precisely, so that in the future we will remember our blacklist landscape easily. There have been some blacklists that have been introduced in the past and then forgotten (until Milan and Jeffs and others came back to them), so we want to keep a comfortably good track of the reasons that we blacklisted so later auditing is easy. Reading the conversation trail, the new blacklist would cover all Windows OSes(Xp, 7, Vista, 8, 10?) who have NVidia driver 187.45 or older. Though presumably there do not exist any Windows 8 or 10 users with this driver version at all, so this would be Xp, 7 and Vista specific? 1) This looks like this is blacklist will not be not a "hole" in the series of driver versions, but practically will introduce a new minimum driver version requirement? 2) Why was this exact driver version chosen? (was the last driver version that had this crash signature?) 3) How big percentage of overall users in the wild do we expect to lose for WebGL with this blacklist? (0.7% of all Firefox users?) - Especially I am surprised with a seeming conflict that this is the #5 highest top crasher, which suggests perhaps we might have a larger user base affected? 4) What was the driver release date for that version? (October 2009?) 5) What is the next driver version that we know to work, and its release date? 6) Did the WebGL context creation error message infrastructure that was talked about earlier go live so that attempting to create WebGL context on a now blocked driver will get an error message about this? Can we make the message point to this bug entry? ("WebGL on this system is disabled. See https://bugzilla.mozilla.org/show_bug.cgi?id=1292311") We'd like to offer a machinery to developers so that they know how to present the appropriate error dialogs to users, along the lines of "Try updating your graphics drivers". In particular #5 and #6 would make my mind at ease with these types of blacklist items, since those are an effective way to help the WebGL adoption problem on the developer side. Developers love minimum hardware specifications, so having the exact specs nailed down, like "Minimum requirement: NVidia graphics cards: October 2009/driver 187.45 or newer" is the way that developers like to manage their responsibility with their user bases. Minimum specs are good for us as well, since occassionally we have received comments from devs saying that we have poor WebGL adoption, so being able to explicitly list out "WebGL works on these hardware/OS/driver combos" is effective, especially if we are able to note that we support e.g. NVidia drivers way back to 2009, which is not exactly a new one. Great work nailing this down!
See Also: → 1375151
You need to log in before you can comment on or make changes to this bug.