Crash in nvwgf2um.dll | BaseThreadInitThunk
Categories
(Core :: Graphics, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
platform-rel | --- | - |
firefox48 | --- | wontfix |
firefox49 | --- | wontfix |
firefox-esr45 | --- | wontfix |
firefox50 | + | wontfix |
firefox51 | --- | wontfix |
firefox52 | --- | wontfix |
firefox-esr52 | --- | affected |
firefox53 | --- | wontfix |
firefox54 | --- | fix-optional |
firefox55 | --- | wontfix |
firefox56 | --- | wontfix |
firefox57 | --- | wontfix |
firefox58 | --- | fix-optional |
firefox59 | --- | ? |
People
(Reporter: u279076, Unassigned)
References
Details
(Keywords: crash, regression, Whiteboard: [gfx-noted][platform-rel-nVidia])
Crash Data
Attachments
(5 files, 7 obsolete files)
43.32 KB,
image/png
|
Details | |
1.80 KB,
patch
|
kechen
:
review+
gchang
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
2.34 KB,
patch
|
kechen
:
review+
gchang
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
2.28 KB,
patch
|
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
3.07 KB,
patch
|
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-c27d44ac-645d-48d4-8e4f-263892160804. ============================================================= Ø 0 nvwgf2um.dll nvwgf2um.dll@0x8078a3 Ø 1 nvwgf2um.dll nvwgf2um.dll@0x87d1af Ø 2 nvwgf2um.dll nvwgf2um.dll@0x8ad8cd Ø 3 nvwgf2um.dll nvwgf2um.dll@0x8ac8e9 Ø 4 nvwgf2um.dll nvwgf2um.dll@0x80fcd3 Ø 5 nvwgf2um.dll nvwgf2um.dll@0x801378 Ø 6 nvwgf2um.dll nvwgf2um.dll@0xa2dfb Ø 7 nvwgf2um.dll nvwgf2um.dll@0x92f64 Ø 8 nvwgf2um.dll nvwgf2um.dll@0x1ad494 Ø 9 nvwgf2um.dll nvwgf2um.dll@0x90d52e Ø 10 nvwgf2um.dll nvwgf2um.dll@0x90d656 11 kernel32.dll BaseThreadInitThunk 12 ntdll.dll __RtlUserThreadStart 13 ntdll.dll _RtlUserThreadStart ============================================================= More reports: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk There are no strong correlations to a particular device/driver.
Comment 1•7 years ago
|
||
~40% of crashes with this signature are occurring with 9.18.13.4192 vs ~5% of all crashes with AMD cards. I don't know if this info is useful.
Comment 2•7 years ago
|
||
Crash volume for signature 'nvwgf2um.dll | BaseThreadInitThunk': - nightly (version 51): 4 crashes from 2016-08-01. - aurora (version 50): 100 crashes from 2016-08-01. - beta (version 49): 728 crashes from 2016-08-02. - release (version 48): 1972 crashes from 2016-07-25. - esr (version 45): 235 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 1 2 0 - aurora 36 29 14 - beta 270 221 99 - release 683 541 340 - esr 10 26 27 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #307 - aurora #74 #24 - beta #53 #184 #1320 - release #27 #45 #2322 - esr #512
This crash continues to rise in Release and is now at #17 @ 0.56% with 1367 crashes.
While this didn't originally show a strong correlation it seems like there is a strong correlation forming. Perhaps unsurprisingly given the signature, 98.38% of crashes in the last week were on NVidia devices. There is a broad swath of GPUs, 30% are on Tesla, 17% are on Kepler, 12% are on Maxwell, 12% are on Fermi. The most common device is the Geforce 210 @ 9.12%. There is also a broad swatch of drivers with 39% on 341.*, 15% on 368.*, 10% on 353.*, and 9% on 372.*. The most common driver is 341.92 at 26.86%. It's also worth noting that the spike seems to be driven at least in part by the Geforce 210 on driver 341.92 (see graphs in attached screenshot). The live version can be generated by pasting the signature into http://ashughes1.github.io/metrics-graphics-gfx/.
Comment 5•7 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #1) > ~40% of crashes with this signature are occurring with 9.18.13.4192 vs ~5% > of all crashes with AMD cards. I don't know if this info is useful. Obviously here I meant NVIDIA :) Also interesting: (69.36% in signature vs 26.39% overall) platform_pretty_version = Windows 10 (31.33% in signature vs 01.16% overall) adapter_driver_version = 9.18.13.4192
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #4) > ... Perhaps unsurprisingly given the signature, > 98.38% of crashes in the last week were on NVidia devices. Yes, these are Nvidia driver DLLs, I'm surprised it isn't 100% Nvidia.
(In reply to Milan Sreckovic [:milan] from comment #6) > Yes, these are Nvidia driver DLLs, I'm surprised it isn't 100% Nvidia. The other 1.62% are Intel which are likely dual GPU systems where we detect Intel as the primary GPU.
Comment 8•7 years ago
|
||
Crash volume for signature 'nvwgf2um.dll | BaseThreadInitThunk': - nightly (version 52): 33 crashes from 2016-09-19. - aurora (version 51): 11 crashes from 2016-09-19. - beta (version 50): 1057 crashes from 2016-09-20. - release (version 49): 2340 crashes from 2016-09-05. - esr (version 45): 310 crashes from 2016-06-01. Crash volume on the last weeks (Week N is from 10-03 to 10-09): W. N-1 W. N-2 - nightly 15 18 - aurora 11 0 - beta 842 215 - release 1879 460 - esr 5 12 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #72 - aurora #447 #132 #362 - beta #14 #20 - release #25 #21 #616 - esr #1422
Updated•7 years ago
|
Comment 9•7 years ago
|
||
[Tracking Requested - why for this release]: this crash signature is quadrupling in 50.0b builds compared to 49.0b (from ~40 to ~160 daily crashes).
I'm not sure how actionable this is. The stacks look like this: Frame Module Signature Source Ø 0 nvwgf2um.dll nvwgf2um.dll@0xaedb35 Ø 1 nvwgf2um.dll nvwgf2um.dll@0xaed4d8 Ø 2 nvwgf2um.dll nvwgf2um.dll@0xaecc99 Ø 3 nvwgf2um.dll nvwgf2um.dll@0xaea74c Ø 4 nvwgf2um.dll nvwgf2um.dll@0xae0994 Ø 5 nvwgf2um.dll nvwgf2um.dll@0x157cb1 Ø 6 nvwgf2um.dll nvwgf2um.dll@0x633f04 Ø 7 nvwgf2um.dll nvwgf2um.dll@0x7e53f1 Ø 8 nvwgf2um.dll nvwgf2um.dll@0x16b619 Ø 9 nvwgf2um.dll nvwgf2um.dll@0x16b3d1 Ø 10 nvwgf2um.dll nvwgf2um.dll@0x16354c Ø 11 nvwgf2um.dll nvwgf2um.dll@0x169818 Ø 12 nvwgf2um.dll nvwgf2um.dll@0x16973c Ø 13 nvwgf2um.dll nvwgf2um.dll@0x6db3c4 Ø 14 nvwgf2um.dll nvwgf2um.dll@0xbd919a Ø 15 nvwgf2um.dll nvwgf2um.dll@0xbd92c2 16 kernel32.dll BaseThreadInitThunk 17 ntdll.dll __RtlUserThreadStart 18 ntdll.dll _RtlUserThreadStart and now that we aggregate on the first function with symbols, we get BaseThreadInitThunk. Which is like saying, look, this thread crashed somewhere, and we don't know where. The top address is driver specific, so we had a number of these before - now that they're all together, all the driver versions show up under this bug. For example, the driver version 9.18.13.4192 would have crashed at nvwgf2um.dll@0x8078a3, and the driver version 21.21.13.7290 at nvwgf2um.dll@0xaedb35. Those were separate reports before, now they're together. As in, I'm not convinced the actual crashes quadrupled.
Comment 11•7 years ago
|
||
i was basing that on this graph on how the signature's volume developed on the beta channel: https://crash-stats.mozilla.com/signature/?release_channel=beta&_sort=-date&signature=nvwgf2um.dll%20|%20BaseThreadInitThunk&date=%3E2016-06-01#graphs
As in, the volume goes up when we release 50 into beta? Fair enough. Still not sure they're actionable.
Comment 13•7 years ago
|
||
I listed the address range for this crash. It seems like we access the member of class/struct which was alraedy NULL for half of crash volume. 1 0x14 989 27.79 % 2 0x48 686 19.28 % 3 0x0 272 7.64 %
Comment 14•7 years ago
|
||
I found two report with 0x0 crash address, and both of them are related to driver reset. https://crash-stats.mozilla.com/report/index/c3fed968-9592-4666-9201-fbb2c2161006#tab-metadata https://crash-stats.mozilla.com/report/index/5379fcf4-e64a-4a39-b4e2-210862161007#tab-metadata [12][GFX1-]: GFX: D3D11 lock mutex abandoned (t=1253.11) | [13][GFX1-]: [D3D11] TextureSourceD3D11:GetShaderResourceView CreateSRV failure 0x887a0005 (t=1253.11) [14][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=1253.11) | [15][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0006, 0 (t=1253.16)
Comment 15•7 years ago
|
||
From the crash reports in comment 14, I saw we had called swapchain->present several times even we know device was removed and then firefox crashed at nv d3d library. I think we can do the same error handling in EndFrame() just like beginFrame()[1]. |[0][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=265.555) |[1][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0005, 0 (t=265.555) |[2][GFX1-]: Detected rendering device reset on refresh: 7 (t=265.555) |[3][GFX1-]: Detected rendering device reset on refresh: 7 (t=265.555) |[4][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=266.868) |[5][GFX1-]: Out of sync D3D11 devices in HandleError, 0 (t=266.868) |[6][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0020, 0 (t=266.868) |[7][GFX1-]: GFX: D3D11 abandoned sync (t=277.546) |[8][GFX1-]: GFX: D3D11 lock mutex abandoned (t=277.546) |[9][GFX1-]: GFX: D3D11 lock mutex abandoned (t=277.546) |[10][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=287.4) |[11][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0007, 0 (t=287.4) [1]http://searchfox.org/mozilla-central/source/gfx/layers/d3d11/CompositorD3D11.cpp#967
Comment 16•7 years ago
|
||
Jerry, please work with kevin to debug this issue.
Comment 17•7 years ago
|
||
Hi Jerry, I've added an early return condition in EndFrame according to comment 15, could you help me to check if I miss anything here ? Thank you.
Comment 18•7 years ago
|
||
Comment on attachment 8800572 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened. Review of attachment 8800572 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +1069,5 @@ > > void > CompositorD3D11::EndFrame() > { > + if (!mDefaultRT || mDevice->GetDeviceRemovedReason() != S_OK) { I think we would better to log early return here when device got removed. And I think we should do more cleanup if we are going to early return in EndFrame().
Comment 19•7 years ago
|
||
Comment on attachment 8800572 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened. Review of attachment 8800572 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +1074,4 @@ > Compositor::EndFrame(); > return; > } > put the checking at this line, then we can know the early return is just for device-removed. and also add some gfxcriticalnote and clean the rendertarget. mCurrentRT = nullptr
Comment 20•7 years ago
|
||
The CompositorD3D11::BeginFrame() should also have driver-removed log.
Comment 21•7 years ago
|
||
Hi Jerry, I've modified the patch according to the comments, please help me to check it. Thank you.
Comment hidden (spam) |
Comment hidden (spam) |
Comment 24•7 years ago
|
||
I added some logs to the place where we fail to recreate new compositors. In current code, without really handling this situation, we just finish the device-removed handle process and wait for other codes to trigger the next device-removed handle process. I think this patch can help us to figure out if the device-removed crash in this signature is caused by the situation mentioned above. And I found that I misunderstood the code flow in comment 22. The code flow[1] is okay. Please ignore the comment. [1] https://dxr.mozilla.org/mozilla-central/rev/01ab78dd98805e150b0311cce2351d5b408f3001/gfx/thebes/DeviceManagerDx.cpp#318-326
Comment 25•7 years ago
|
||
Comment on attachment 8801968 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v2 Review of attachment 8801968 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +974,5 @@ > // Don't composite if we are minimised. Other than for the sake of efficency, > // this is important because resizing our buffers when mimised will fail and > // cause a crash when we're restored. > NS_ASSERTION(mHwnd, "Couldn't find an HWND when initialising?"); > + if (::IsIconic(mHwnd)) { I would prefer you add additional GetDeviceRemovedReason() here and call gfxCriticalNote. Or you can just dump handle and device status at the same time in one gfxCritialNote. @@ +1085,5 @@ > > + if (mDevice->GetDeviceRemovedReason() != S_OK) { > + gfxCriticalNote << "GFX: D3D11 skip EndFrame with device-removed."; > + Compositor::EndFrame(); > + mCurrentRT = nullptr; I think we don't need to take care of mCurrentRT because it will be changed by SetRenderTarget() in BeginFrame() call. And again, you can merge this with line 1081 and output gfxCriticalNote with mDefaultRT and device status.
Comment 26•7 years ago
|
||
Comment on attachment 8802427 [details] [diff] [review] Add logs to record the failure of compositor creation. Review of attachment 8802427 [details] [diff] [review]: ----------------------------------------------------------------- Could we dump failure reason in CompositorBridgeParent::NewCompositor[1]? Then it covers not only D3D11 but also D3D9 backend. [1]https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorBridgeParent.cpp#1762
Comment 27•7 years ago
|
||
Hi Jerry, I've modified the patch according to comment 25. Can you help me to review it ? Thank you.
Comment 28•7 years ago
|
||
Hi Jerry, I've modified the patch according to comment 26. Can you help me to review it? Thank you.
Comment 29•7 years ago
|
||
NVIDIA is tracking this as NVIDIA bug # 1828302
Comment 30•7 years ago
|
||
Comment on attachment 8803206 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v3 Review of attachment 8803206 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +981,5 @@ > // read-unlock our textures to prevent them from accumulating. > ReadUnlockTextures(); > *aRenderBoundsOut = IntRect(); > + > + gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \ I think this message should be out of "IsIconic()" if block. It's more possible to trigger window-minimizing. Then, we will have a lot of gfxCriticalNote message. @@ +1081,3 @@ > Compositor::EndFrame(); > + > + gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \ Make this message out of "mDefaultRT" if block.
Comment 31•7 years ago
|
||
(In reply to Mark Kilgard from comment #29) > NVIDIA is tracking this as NVIDIA bug # 1828302 Is it possible to show the bug link here?
Comment 32•7 years ago
|
||
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #30) > Comment on attachment 8803206 [details] [diff] [review] > Skip CompositorD3D11::EndFrame when device remove happened_v3 > > Review of attachment 8803206 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: gfx/layers/d3d11/CompositorD3D11.cpp > @@ +981,5 @@ > > // read-unlock our textures to prevent them from accumulating. > > ReadUnlockTextures(); > > *aRenderBoundsOut = IntRect(); > > + > > + gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \ > > I think this message should be out of "IsIconic()" if block. It's more > possible to trigger window-minimizing. Then, we will have a lot of > gfxCriticalNote message. > > @@ +1081,3 @@ > > Compositor::EndFrame(); > > + > > + gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \ > > Make this message out of "mDefaultRT" if block. Peter, my opinion is different from your comment 25. Please check comment 30.
Updated•7 years ago
|
Updated•7 years ago
|
Comment 33•7 years ago
|
||
We also have similar AMD signatures ("atidxx32.dll | BaseThreadInitThunk" and "atidxx64.dll | BaseThreadInitThunk"), but their volume is smaller.
Comment 34•7 years ago
|
||
Comment on attachment 8803206 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v3 Review of attachment 8803206 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +981,5 @@ > // read-unlock our textures to prevent them from accumulating. > ReadUnlockTextures(); > *aRenderBoundsOut = IntRect(); > + > + gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \ Yes, you are right. To keep this simple, I'm fine to move this log out of this loop. @@ +1081,3 @@ > Compositor::EndFrame(); > + > + gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \ After second thought, I also agree to move this out to keep the logic simple.
Updated•7 years ago
|
Comment 35•7 years ago
|
||
(79.84% in signature vs 05.45% overall) "DXVA2D3D11+" in app_notes = true (80.58% in signature vs 06.80% overall) "DXVA2D3D11?" in app_notes = true (80.97% in signature vs 23.83% overall) platform_pretty_version = Windows 10 At least a subset of the crashes looks related to DXVA2D3D11. Have we changed anything with regards to this in 48 or 49? The volume for both this signature and the signature from bug 1299520 increased in 48/49 (they became more frequent in 48 and probably even more frequent in 49). The crash in bug 1299520 was probably fixed by an update to the AMD driver, but one of the reporters is saying that, even after the crash fix, he's having smoothness issues.
Comment 36•7 years ago
|
||
IIRC, we've enabled DXVA2D3D11 in 48. Perhaps we need to increase the driver versions we blocklist, given this bug and bug 1299520.
Comment 37•7 years ago
|
||
I'm not sure where the regression in 50 comes from. Comparing the data for this signature between 49 release and 50 beta is interesting: Release (80.97% in signature vs 23.83% overall) platform_pretty_version = Windows 10 (71.15% in signature vs 17.70% overall) platform_version = 10.0.14393 (79.84% in signature vs 05.45% overall) "DXVA2D3D11+" in app_notes = true (77.64% in signature vs 49.81% overall) "D3D11 Layers+" in app_notes = true (97.63% in signature vs 48.63% overall) "D2D1.1+" in app_notes = true (07.23% in signature vs 19.65% overall) "DXVA2D3D9+" in app_notes = true Beta (61.96% in signature vs 40.22% overall) platform_version = 6.1.7601 Service Pack 1 (24.50% in signature vs 05.98% overall) "DXVA2D3D11+" in app_notes = true (88.04% in signature vs 49.04% overall) "D3D11 Layers+" in app_notes = true (73.34% in signature vs 36.12% overall) "D2D1.1+" in app_notes = true (30.47% in signature vs 26.97% overall) "DXVA2D3D9+" in app_notes = true (33.29% in signature vs 04.45% overall) GFX_ERROR "DXVA2D3D9 video decoding is disabled due to a previous crash." = true So looks like this crash is more frequent on Release with Windows 10 Anniversary Edition and DXVA2D3D11. It's more frequent on Beta with Windows 7 SP1 and both DXVA2D3D11 and DXVA2D3D9.
Updated•7 years ago
|
Comment 38•7 years ago
|
||
Comment on attachment 8801968 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v2 Hi David, This patch try to show some informations and skip the composition call for driver-remove in "BeginFrame" and "EndFrame".
Comment 39•7 years ago
|
||
Comment on attachment 8803206 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v3 Use version 2(attachment 8801968 [details] [diff] [review]) is enough.
Updated•7 years ago
|
Comment on attachment 8801968 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v2 Review of attachment 8801968 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +983,5 @@ > return; > } > > + if (mDevice->GetDeviceRemovedReason() != S_OK) { > + gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \ nit: This reads a little better without "and the following actions".
Comment on attachment 8803207 [details] [diff] [review] Add logs to record the failure of compositor creation_v2 Review of attachment 8803207 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorBridgeParent.cpp @@ +2390,1 @@ > return Nothing(); This should never happen because the BasicCompositor is always a (hopefully infallible) fallback. It'd be better to MOZ_RELEASE_ASSERT here if anything.
Comment 42•7 years ago
|
||
Carry r+ from comment 40.
Comment 43•7 years ago
|
||
Carry r+ from comment 41.
Updated•7 years ago
|
Comment 44•7 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #36) > IIRC, we've enabled DXVA2D3D11 in 48. Perhaps we need to increase the driver > versions we blocklist, given this bug and bug 1299520. Looks like Jerry&Kevin are working on this, so I'll step back into the shadows. Please ping me if I'm needed again.
Comment 45•7 years ago
|
||
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/34fd4c7639ec Skip CompositorD3D11::EndFrame when device-removed happened and add some logs for tracking the behavior. r=dvander https://hg.mozilla.org/integration/mozilla-inbound/rev/0c0b78fc1a6d Add logs to record the failure of compositor creation. r=dvander
Comment 46•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/34fd4c7639ec https://hg.mozilla.org/mozilla-central/rev/0c0b78fc1a6d
Comment 48•7 years ago
|
||
Approval Request Comment [Feature/regressing bug #]: crash [User impact if declined]: May cause the crash. [Describe test coverage new/current, TreeHerder]: No test. [Risks and why]: I think the risk is low, there's already another early return path and we just add another early return condition here. [String/UUID change made/needed]: None
Comment 49•7 years ago
|
||
Approval Request Comment [Feature/regressing bug #]: Crash [User impact if declined]: Make us more difficult to trace the crash. [Describe test coverage new/current, TreeHerder]: No test [Risks and why]: Low risk, just add several logs for tracing this crash more easier. [String/UUID change made/needed]: No
Comment 50•7 years ago
|
||
Comment on attachment 8803673 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v2. Approval Request Comment [Feature/regressing bug #]: crash [User impact if declined]: May cause the crash. [Describe test coverage new/current, TreeHerder]: No test. [Risks and why]: Low risk, there's already another early return path and we just add another early return condition here. [String/UUID change made/needed]: None
Comment 51•7 years ago
|
||
Comment on attachment 8803674 [details] [diff] [review] Add logs to record the failure of compositor creation_v2 Approval Request Comment [Feature/regressing bug #]: Crash [User impact if declined]: Make us more difficult to trace the crash. [Describe test coverage new/current, TreeHerder]: No test [Risks and why]: Low risk, just add several logs for tracing this crash more easier. [String/UUID change made/needed]: No
Comment on attachment 8803752 [details] [diff] [review] [beta-uplift] Add logs to record the failure of compositor creation. Patch looks simple enough and low risk, the crash volume of the first signature is pretty high on 50, let's uplift.
Hi Milan, this fix seemed like a low risk candidate for beta uplift. Please let me know if you think otherwise. Thanks!
I'm hitting conflicts applying this to beta.
Comment 55•7 years ago
|
||
I'm concerned this bug is due to an NVIDIA GPU reset. The suggested to Skip CompositorD3D11::EndFrame when device remove happened and falling back to an infallible compositor may be masking/hiding a more fundamental NVIDIA driver bug Firefox is tickling. I'm worried that what is getting reported as a "device removal" is not actually that but a GPU reset and the only way that unexpected not-supposed-to-happen fault can be reported is as a device removal, but no device is actually being removed. I'm still looking for an appropriate NVIDIA engineer to triage this issue appropriately. Having the additional logging is certainly a good idea.
Matt, this code got changed enough that I wouldn't mind an explicit confirmation as to what we want it to look like. For example, do we want to call the base class' EndFrame() before or after EnsureSize(), or does it matter?
Comment 57•7 years ago
|
||
Some users are reporting things like: > Crash happened (again) when updating Nvidia Graphics Drivers to their latest version using the GeForce Experience program. This has happened (and I've reported it) before in previous version of both firefox and nvidia drivers. > I was updating my graphics drivers during the crash > Crashed during installation of Nvidia graphics driver > Was installing a graphics driver, when the screen "refreshed" firefox crashed. > Nvidia graphics driver update just completed > Was browsing reddit while my nvidia graphics card updated. > Probably due to updating video drivers? > videocard or drive from videocard? And there are some reports with URLs that are NVIDIA driver download pages. This might explain a subset of the crashes, I find it hard to believe that all these people are updating the drivers so often.
Comment 58•7 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #56) > Matt, this code got changed enough that I wouldn't mind an explicit > confirmation as to what we want it to look like. For example, do we want to > call the base class' EndFrame() before or after EnsureSize(), or does it > matter? It shouldn't matter.
Comment 59•7 years ago
|
||
(In reply to Mark Kilgard from comment #55) > I'm concerned this bug is due to an NVIDIA GPU reset. > > The suggested to Skip CompositorD3D11::EndFrame when device remove happened > and falling back to an infallible compositor may be masking/hiding a more > fundamental NVIDIA driver bug Firefox is tickling. > > I'm worried that what is getting reported as a "device removal" is not > actually that but a GPU reset and the only way that unexpected > not-supposed-to-happen fault can be reported is as a device removal, but no > device is actually being removed. > > I'm still looking for an appropriate NVIDIA engineer to triage this issue > appropriately. > > Having the additional logging is certainly a good idea. This is almost certainly the case in some (possibly most) of these crash reports. It appears that D3D returns DXGI_ERROR_DEVICE_REMOVED from most API calls when we're in this state, and then we can query GetDeviceRemovedReason() for the 'actual' problem. This shows up in the metadata tab of crash reports as a message like this (from comment 14): [15][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0006 0x887a0005 is DXGI_ERROR_DEVICE_REMOVED (the initial failure) and 0x887a0006 is DXGI_ERROR_DEVICE_HUNG (the actual problem).
Comment 60•7 years ago
|
||
~10% of the crashes with this signature have the string "[CompositorD3D11] device removed with error code" in graphics_critical_error.
Comment 61•7 years ago
|
||
Approval Request Comment [Feature/regressing bug #]: crash [User impact if declined]: May cause the crash. [Describe test coverage new/current, TreeHerder]: No test. [Risks and why]: I think the risk is low, there's already another early return path and we just add another early return condition here. [String/UUID change made/needed]: None Hello ritu, the former patch hit conflict and I've modified the patch, please help me to approve this request again, thank you.
Comment 62•7 years ago
|
||
Approval Request Comment [Feature/regressing bug #]: Crash [User impact if declined]: Make us more difficult to trace the crash. [Describe test coverage new/current, TreeHerder]: No test [Risks and why]: Low risk, just add several logs for tracing this crash more easier. [String/UUID change made/needed]: No Hello ritu, the former patch hit conflict and I've modified the patch, please help me to approve this request again, thank you.
Comment on attachment 8804527 [details] [diff] [review] [beta-uplift] Add logs to record the failure of compositor creation. Sure! Beta50+
Comment 64•7 years ago
|
||
Comment on attachment 8803673 [details] [diff] [review] Skip CompositorD3D11::EndFrame when device remove happened_v2. Fix a crash. Take it in 51 aurora.
Updated•7 years ago
|
Comment 65•7 years ago
|
||
bugherderuplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/f43ffbca8d4d https://hg.mozilla.org/releases/mozilla-beta/rev/5ebcf218f678
Comment 66•7 years ago
|
||
bugherderuplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/256a14171223 https://hg.mozilla.org/releases/mozilla-aurora/rev/c26d402c9cf2
Comment 67•7 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #65) > https://hg.mozilla.org/releases/mozilla-beta/rev/5ebcf218f678 Had to back this rev out for browser_Troubleshoot.js failures. Kevin says the other patch is fine to leave in while he figures out why the test failed on Beta, so we'll hopefully still get the stability wins in time for the next Beta. https://hg.mozilla.org/releases/mozilla-beta/rev/90c412c9029128b2c60f53c7c013239b2b9b75d5 https://treeherder.mozilla.org/logviewer.html#?job_id=1879329&repo=mozilla-beta
Comment hidden (spam) |
Comment hidden (spam) |
Comment 70•7 years ago
|
||
Patch[1] is just for logging and patch[2] is the one which really changes the code flow. I will monitor the crash report in these days to see if we really fix the crash with patch [2]. [1] https://hg.mozilla.org/releases/mozilla-beta/rev/5ebcf218f678 [2] https://hg.mozilla.org/releases/mozilla-beta/rev/f43ffbca8d4d
Comment 71•7 years ago
|
||
https://crash-stats.mozilla.com/report/index/e5b99f76-2f65-4344-acfd-5680c2161031#tab-metadata In the FF52, Crash still happened even we skip beginFrame after device was removed. [12][GFX1-]: GFX: D3D11 skip BeginFrame with device-removed. (t=250652)
Comment 72•7 years ago
|
||
For those reports which crash at address 0x22[1] have the following characteristics in common: 1. Operating system : Windows 10 / OS Version : 10.0.14936 2. Driver version : 21.21.13.XXXX. 3. Destructing DrawTargetD2D1 in thread 0 when the process crash. 4. Reset driver for more than one times and in a little period. The destruction of DrawTargetD2D1 might be relevant to the crash. Skip it when device rest is detected is a solution we can try; however, the DrawTargetD2D1 is owned by TextureData, when the refcount reaches to 0 it will be automatically removed, therefore, it's difficult for us to skip this part. [1] https://crash-stats.mozilla.com/signature/?address=%3D0x22&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-10-27T03%3A26%3A00.000Z&date=%3C2016-11-03T03%3A26%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports
Comment 73•7 years ago
|
||
For those reports which crash at address 0x14, 0x48[1] almost happen after firefox 50, something must be fixed in firefox 51. Lot of the crashes have a thread executing D3D11DXVA2Manager::CopyToImage, this might be relevant. [1] https://crash-stats.mozilla.com/signature/?address=%3D0x14&address=%3D0x48&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-09-01T08%3A46%3A00.000Z&date=%3C2016-11-04T08%3A46%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-version&_sort=-date&page=1
Comment 74•7 years ago
|
||
Blake, could you have someone to comment Kevin's finding in comment 73?
Comment 75•7 years ago
|
||
Matt should be the best right person to comment this.
Comment 76•6 years ago
|
||
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #73) > For those reports which crash at address 0x14, 0x48[1] almost happen after > firefox 50, something must be fixed in firefox 51. Almost none happen after firefox 50? Or only happen after firefox 50? Is the same change noticeable on beta/aurora/nightly? Can we try get a list of changesets for when it changed. > > Lot of the crashes have a thread executing D3D11DXVA2Manager::CopyToImage, > this might be relevant. That function spends a lot of time waiting for video frames to finish decoding. If the user is currently playing video, then I'd expect it to be on the stack a lot of the time, so it could be unrelated.
Comment 77•6 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #76) > (In reply to Kevin Chen[:kechen] (UTC + 8) from comment #73) > > For those reports which crash at address 0x14, 0x48[1] almost happen after > > firefox 50, something must be fixed in firefox 51. > > Almost none happen after firefox 50? Or only happen after firefox 50? > Sorry it should be before firefox 51. > Is the same change noticeable on beta/aurora/nightly? Can we try get a list > of changesets for when it changed. > Overall, the crash number started to rise around Aug 2nd[1] and it was Firefox 48 in release channel; however I think it too hard to get a list of changesets, because the crash reports with version like 48.0bx are not accessible right now. [1] https://crash-stats.mozilla.com/signature/?signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-05-24T06%3A40%3A30.000Z&date=%3C2016-11-24T06%3A40%3A30.000Z#graphs > > > > Lot of the crashes have a thread executing D3D11DXVA2Manager::CopyToImage, > > this might be relevant. > > That function spends a lot of time waiting for video frames to finish > decoding. If the user is currently playing video, then I'd expect it to be > on the stack a lot of the time, so it could be unrelated.
Comment 78•6 years ago
|
||
Currently, the crash rate is still high in release but not in other channels[1]. Like comment 35 said, some correlations show that a subset of crash is related to DXVA2D3D11 and there is a relatively high crash rate in some particular adapter_driver_version : (90.80% in signature vs 07.18% overall) "DXVA2D3D11+" in app_notes = true (30.06% in signature vs 01.46% overall) adapter_driver_version = 21.21.13.6909 However this correlation only happens in release channel and is only relate to 30% of the crash, I am not sure if it's profitable to put it in blacklist like bug 1299520. [1] https://crash-stats.mozilla.com/signature/?product=Firefox&version=50.0&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-10-01T02%3A04%3A00.000Z&date=%3C2016-11-22T02%3A04%3A14.000Z#aggregations
Do we have more information from these additional patches?
Comment 80•6 years ago
|
||
I was suspecting that the device reset process was not completed; however, there are only a few records show logs from attachment 8803674 [details] [diff] [review] which might represent that the error is not caused by the failure when initializing devices. In attachment 8803673 [details] [diff] [review], we skip swap chain by making an early return when device need to be reset but it does not reduce the crash rate. It might be not the root cause. Currently, I am trying to investigate the problem I mentioned in comment 78.
Comment 81•6 years ago
|
||
Hello Matt, according to the graph in this link[1], the number of crash in beta release channel increased around 9/22 and decreased around 10/25. The date of decreasing happened between 50.0b9[2](build id 20161020152750) and 50.0b10[3](build id 20161024172922), and the changeset between this two is [4] and [5]. We might fixed something during this period, I have went through the changesets but found nothing. Do you have any idea about this ? [1] https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-09-01T02%3A09%3A00.000Z&date=%3C2016-12-14T02%3A09%3A29.000Z#graphs [2] https://crash-stats.mozilla.com/signature/?product=Firefox&version=50.0b9&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-09-01T02%3A09%3A00.000Z&date=%3C2016-12-14T02%3A09%3A29.000Z#graphs [3] https://crash-stats.mozilla.com/signature/?product=Firefox&version=50.0b10&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-09-01T02%3A09%3A00.000Z&date=%3C2016-12-14T02%3A09%3A29.000Z#graphs [4] https://hg.mozilla.org/releases/mozilla-beta/rev/2bb6dc758711c00d84246d74b57e5aa6cae4b447 [5] https://hg.mozilla.org/releases/mozilla-beta/rev/38cfded1705240c5d20baff4aef99bdd0a13bcec
Comment 82•6 years ago
|
||
hi kevin, for the spike in 50 beta see bug https://bugzilla.mozilla.org/show_bug.cgi?id=1310600#c14 - apparently it was the fix from bug 1308418 that made the crash volume recline again.
Updated•6 years ago
|
Or the spike moved to bug 1322554?
Updated•6 years ago
|
Comment 84•6 years ago
|
||
We are currently working with Nvidia to see if there are more clues we can get. And for this signature, the number of the most frequently happened crash pattern (crash address is 0x14 or 0x48) has decreased in beta channel. Maybe there are some changes fix the problem, I will keep tracking this.
Comment 85•6 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #83) > Or the spike moved to bug 1322554? Sorry for the late reply. I am not sure if this crash is related, but the crash pattern is not very alike to me.
Comment 86•6 years ago
|
||
According to the response from Nvidia engineer, the root cause of this crash might be the race condition of deleting video driver resources and this is only happened in Windows 10. The thing is that when the volume of allocated video resources go above the OS's limitation, the OS will ask driver to trim the resources and if we happen to destruct the same video driver resources in the same time, there might be a chance that we access a dangling pointer and cause the crash. This inference fits to the feedbacks from users, most of them was playing video games, which needs massive video resources, when this crash happened. Nvdia will update their active drivers to prevent this problem and for the drivers with older version, maybe we should send user a notice to update their driver or blacklist all old versions on Windows 10 ?
Updated•6 years ago
|
Comment 87•6 years ago
|
||
That sounds like a great outcome! Can we find out what driver version it's going to be fixed in, so that we can set our blacklist to be before that version. We don't currently have an UI for asking users to update their drivers, so blacklist is our best choice. Do we know which drivers/cards are affected? Is it the entirety of NVIDIA on Win10? That is a lot of users that won't get hardware accelerated video decoding. When we have the GPU process enabled (currently enabled on Nightly 53, but should ride the trains from here), maybe we can avoid blacklisting since the penalty of crashing is much lower (GPU process restarts silently, user won't notice).
Comment 88•6 years ago
|
||
According to Nvidia, they will fix this problem with active version 376.62, 378.42 and the later. Also, they said that this is a rare case, the leak of resources from app would rise the possibility of this crash and since the fact that the crash number has decreased in the beta 51, is it possible that we fixed some critical leaks between 50 and 51 and lower the crash number ? And maybe we should wait for couple days and see if this crash still happen with new version drivers before we blacklist the old drivers.
Comment 89•6 years ago
|
||
If this help, I get this crash only when playing Overwatch while Firefox is running in the background, playing YouTube. Started crashing and crash only when playing Overwatch. GTX650 i3-3rd gen, Win10x64 My crash https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9-d02992170115
Comment 90•6 years ago
|
||
(In reply to rhazor from comment #89) > If this help, I get this crash only when playing Overwatch while Firefox is > running in the background, playing YouTube. > > Started crashing and crash only when playing Overwatch. > > GTX650 i3-3rd gen, Win10x64 > > My crash > https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9- > d02992170115 Latest Firefox 50.1.0 and I have Win10 Anniversary update
I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the number of texture allocations dramatically by recycling textures. Matt - would you concur?
Comment 92•6 years ago
|
||
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #91) > I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the > number of texture allocations dramatically by recycling textures. > > Matt - would you concur? Good catch! Yeah, that seems very likely.
Updated•6 years ago
|
Comment 93•6 years ago
|
||
Since we've found a fix that might reduce the crash volume, maybe we shouldn't blacklist the drivers at this point, just wait and see if the crash number would decrease in next release. Also we can consider uplifting the patch in Bug 1292032 to firefox 50.
Comment 94•6 years ago
|
||
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #93) > Since we've found a fix that might reduce the crash volume, maybe we > shouldn't blacklist the drivers at this point, just wait and see if the > crash number would decrease in next release. You're probably right, given how much it has reduced on Beta: https://crash-stats.mozilla.com/signature/?release_channel=beta&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-07-17T10%3A24%3A54.000Z&date=%3C2017-01-17T10%3A24%3A54.000Z#graphs > Also we can consider uplifting the patch in Bug 1292032 to firefox 50. We're going to release 51 next week, so there's no need.
Comment 95•6 years ago
|
||
(In reply to rhazor from comment #89) > If this help, I get this crash only when playing Overwatch while Firefox is > running in the background, playing YouTube. > > Started crashing and crash only when playing Overwatch. > > GTX650 i3-3rd gen, Win10x64 > > My crash > https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9- > d02992170115 NVIDIA is going to release a driver update that should fix your crash. In the meantime, you could manually disable DXVA via a pref (I don't remember which one).
Comment 96•6 years ago
|
||
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #91) > I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the > number of texture allocations dramatically by recycling textures. > > Matt - would you concur? Sorry, I didn't actually check the bug. Is that the right number? That bug only renamed some functions, it shouldn't have had any functional changes.
(In reply to Matt Woodrow (:mattwoodrow) from comment #96) > That bug only renamed some functions, it shouldn't have had any functional > changes. Did the texture recycling also land in 51?
Comment 98•6 years ago
|
||
No, I think it was earlier than that. List of fixed graphics bug in 51: https://bugzilla.mozilla.org/buglist.cgi?f1=cf_status_firefox51&list_id=13411493&o1=equals&resolution=---&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=DUPLICATE&resolution=WORKSFORME&resolution=INCOMPLETE&resolution=SUPPORT&resolution=EXPIRED&resolution=MOVED&o2=notequals&query_format=advanced&f2=cf_status_firefox50&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&v1=fixed&component=Graphics&component=Graphics%3A%20Layers&v2=fixed&product=Core Bug 1289640 seems like the most likely.
Comment 99•6 years ago
|
||
We still got a crash with latest Nvidia driver (21.21.13.7662), I will contact Nvidia to double check if this version contains the fix. [1] https://crash-stats.mozilla.com/report/index/3704d15b-06fa-43c7-9a62-eacda2170125#tab-metadata
Updated•6 years ago
|
Comment 100•6 years ago
|
||
Just found a few crashes with the build after 20170130065342. Kevin, do you know the reason?
Comment 101•6 years ago
|
||
The decrease of crash number might be caused by the release of firefox 51. I've found that the crash rate is significantly decreased in firefox 51 ,according to comment 98, Bug 1289640 might be the reason. Currently, we only have few crashes with crash address 0x14 and 0x48 in 51[1] which was the main crash address. We should keep monitoring it. The remaining crashes might caused by other reason. [1] https://crash-stats.mozilla.com/signature/?product=Firefox&version=51.0&version=51.0.1&address=0x14&address=0x48&address=0x1c&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2017-01-26T07%3A29%3A00.000Z&date=%3C2017-02-02T07%3A29%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#aggregations
Updated•6 years ago
|
Did the signature here just shift? Is this the same issue as bug 1322554 or something different? It is certainly very low volume for this dll now on non-release channels.
Comment 103•6 years ago
|
||
From my perspective, this bug is different from bug 1322554 which is a startup bug and mostly happened in Windows 7 platform. There are still some crashes (50 a day in release channel) in these days, I will keep monitoring it.
Updated•6 years ago
|
See conversation in bug 1388437 where a new version of cubeb in 55 spiked this.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 105•5 years ago
|
||
https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Release_Cycle_Queries_and_Activities Move optional bugs to next release
Comment 108•4 years ago
|
||
NVIDIA closed bug 1828302 on 02/27/2017.
The customer close message is:
From a sample of the crash reports with driver 378.49, it appears we had a trim thread that just evict and remove a block from the residency list and freed it while another thread doing destroyResource (from Async server thread) trying to unlink the same block that is already unlinked and freed.
NVIDIA QA verified the bug is fixed in 377.11.
(In reply to Jerry Shih[:jerry] (UTC+8) (inactive) from comment #31)
(In reply to Mark Kilgard from comment #29)
NVIDIA is tracking this as NVIDIA bug # 1828302
Is it possible to show the bug link here?
Unfortunately, no. NVIDIA bugs are accessible through an NVOnline database that requires a login. Mozilla engineers would need individual accounts.
![]() |
||
Updated•2 years ago
|
Comment 109•6 months ago
|
||
Since the crash volume is low (less than 5 per week), the severity is downgraded to S3
. Feel free to change it back if you think the bug is still critical.
For more information, please visit auto_nag documentation.
Comment 110•6 months ago
|
||
The severity field for this bug is relatively low, S3. However, the bug has 6 See Also bugs.
:bhood, could you consider increasing the bug severity?
For more information, please visit auto_nag documentation.
Updated•6 months ago
|
Description
•