Open Bug 1292273 Opened 7 years ago Updated 6 months ago

Crash in nvwgf2um.dll | BaseThreadInitThunk

Categories

(Core :: Graphics, defect, P3)

x86
Windows
defect

Tracking

()

REOPENED
Tracking Status
platform-rel --- -
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 + wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- affected
firefox53 --- wontfix
firefox54 --- fix-optional
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- fix-optional
firefox59 --- ?

People

(Reporter: u279076, Unassigned)

References

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted][platform-rel-nVidia])

Crash Data

Attachments

(5 files, 7 obsolete files)

43.32 KB, image/png
Details
1.80 KB, patch
kechen
: review+
Details | Diff | Splinter Review
2.34 KB, patch
kechen
: review+
Details | Diff | Splinter Review
2.28 KB, patch
Details | Diff | Splinter Review
3.07 KB, patch
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-c27d44ac-645d-48d4-8e4f-263892160804.
=============================================================
Ø 0 	nvwgf2um.dll 	nvwgf2um.dll@0x8078a3 	
Ø 1 	nvwgf2um.dll 	nvwgf2um.dll@0x87d1af 	
Ø 2 	nvwgf2um.dll 	nvwgf2um.dll@0x8ad8cd 	
Ø 3 	nvwgf2um.dll 	nvwgf2um.dll@0x8ac8e9 	
Ø 4 	nvwgf2um.dll 	nvwgf2um.dll@0x80fcd3 	
Ø 5 	nvwgf2um.dll 	nvwgf2um.dll@0x801378 	
Ø 6 	nvwgf2um.dll 	nvwgf2um.dll@0xa2dfb 	
Ø 7 	nvwgf2um.dll 	nvwgf2um.dll@0x92f64 	
Ø 8 	nvwgf2um.dll 	nvwgf2um.dll@0x1ad494 	
Ø 9 	nvwgf2um.dll 	nvwgf2um.dll@0x90d52e 	
Ø 10 	nvwgf2um.dll 	nvwgf2um.dll@0x90d656 	
11 	kernel32.dll 	BaseThreadInitThunk 	
12 	ntdll.dll 	__RtlUserThreadStart 	
13 	ntdll.dll 	_RtlUserThreadStart 	
=============================================================
More reports: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk

There are no strong correlations to a particular device/driver.
~40% of crashes with this signature are occurring with 9.18.13.4192 vs ~5% of all crashes with AMD cards. I don't know if this info is useful.
Crash volume for signature 'nvwgf2um.dll | BaseThreadInitThunk':
 - nightly (version 51): 4 crashes from 2016-08-01.
 - aurora  (version 50): 100 crashes from 2016-08-01.
 - beta    (version 49): 728 crashes from 2016-08-02.
 - release (version 48): 1972 crashes from 2016-07-25.
 - esr     (version 45): 235 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       1       2       0
 - aurora       36      29      14
 - beta        270     221      99
 - release     683     541     340
 - esr          10      26      27

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly           #307
 - aurora  #74       #24
 - beta    #53       #184      #1320
 - release #27       #45       #2322
 - esr     #512
This crash continues to rise in Release and is now at #17 @ 0.56% with 1367 crashes.
While this didn't originally show a strong correlation it seems like there is a strong correlation forming. Perhaps unsurprisingly given the signature, 98.38% of crashes in the last week were on NVidia devices. 

There is a broad swath of GPUs, 30% are on Tesla, 17% are on Kepler, 12% are on Maxwell, 12% are on Fermi. The most common device is the Geforce 210 @ 9.12%.

There is also a broad swatch of drivers with 39% on 341.*, 15% on 368.*, 10% on 353.*, and 9% on 372.*. The most common driver is 341.92 at 26.86%.

It's also worth noting that the spike seems to be driven at least in part by the Geforce 210 on driver 341.92 (see graphs in attached screenshot). The live version can be generated by pasting the signature into http://ashughes1.github.io/metrics-graphics-gfx/.
(In reply to Marco Castelluccio [:marco] from comment #1)
> ~40% of crashes with this signature are occurring with 9.18.13.4192 vs ~5%
> of all crashes with AMD cards. I don't know if this info is useful.

Obviously here I meant NVIDIA :)

Also interesting:
(69.36% in signature vs 26.39% overall) platform_pretty_version = Windows 10
(31.33% in signature vs 01.16% overall) adapter_driver_version = 9.18.13.4192
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #4)
> ... Perhaps unsurprisingly given the signature,
> 98.38% of crashes in the last week were on NVidia devices. 

Yes, these are Nvidia driver DLLs, I'm surprised it isn't 100% Nvidia.
(In reply to Milan Sreckovic [:milan] from comment #6)
> Yes, these are Nvidia driver DLLs, I'm surprised it isn't 100% Nvidia.

The other 1.62% are Intel which are likely dual GPU systems where we detect Intel as the primary GPU.
Crash volume for signature 'nvwgf2um.dll | BaseThreadInitThunk':
 - nightly (version 52): 33 crashes from 2016-09-19.
 - aurora  (version 51): 11 crashes from 2016-09-19.
 - beta    (version 50): 1057 crashes from 2016-09-20.
 - release (version 49): 2340 crashes from 2016-09-05.
 - esr     (version 45): 310 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly      15      18
 - aurora       11       0
 - beta        842     215
 - release    1879     460
 - esr           5      12

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly           #72
 - aurora  #447      #132      #362
 - beta    #14       #20
 - release #25       #21       #616
 - esr     #1422
Priority: -- → P1
[Tracking Requested - why for this release]:
this crash signature is quadrupling in 50.0b builds compared to 49.0b (from ~40 to ~160 daily crashes).
I'm not sure how actionable this is.  The stacks look like this:

Frame 	Module 	Signature 	Source
Ø 0 	nvwgf2um.dll 	nvwgf2um.dll@0xaedb35 	
Ø 1 	nvwgf2um.dll 	nvwgf2um.dll@0xaed4d8 	
Ø 2 	nvwgf2um.dll 	nvwgf2um.dll@0xaecc99 	
Ø 3 	nvwgf2um.dll 	nvwgf2um.dll@0xaea74c 	
Ø 4 	nvwgf2um.dll 	nvwgf2um.dll@0xae0994 	
Ø 5 	nvwgf2um.dll 	nvwgf2um.dll@0x157cb1 	
Ø 6 	nvwgf2um.dll 	nvwgf2um.dll@0x633f04 	
Ø 7 	nvwgf2um.dll 	nvwgf2um.dll@0x7e53f1 	
Ø 8 	nvwgf2um.dll 	nvwgf2um.dll@0x16b619 	
Ø 9 	nvwgf2um.dll 	nvwgf2um.dll@0x16b3d1 	
Ø 10 	nvwgf2um.dll 	nvwgf2um.dll@0x16354c 	
Ø 11 	nvwgf2um.dll 	nvwgf2um.dll@0x169818 	
Ø 12 	nvwgf2um.dll 	nvwgf2um.dll@0x16973c 	
Ø 13 	nvwgf2um.dll 	nvwgf2um.dll@0x6db3c4 	
Ø 14 	nvwgf2um.dll 	nvwgf2um.dll@0xbd919a 	
Ø 15 	nvwgf2um.dll 	nvwgf2um.dll@0xbd92c2 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	ntdll.dll 	__RtlUserThreadStart 	
18 	ntdll.dll 	_RtlUserThreadStart

and now that we aggregate on the first function with symbols, we get BaseThreadInitThunk.  Which is like saying, look, this thread crashed somewhere, and we don't know where.  The top address is driver specific, so we had a number of these before - now that they're all together, all the driver versions show up under this bug.  For example, the driver version 9.18.13.4192 would have crashed at nvwgf2um.dll@0x8078a3, and the driver version 21.21.13.7290 at nvwgf2um.dll@0xaedb35.  Those were separate reports before, now they're together.

As in, I'm not convinced the actual crashes quadrupled.
As in, the volume goes up when we release 50 into beta?  Fair enough.  Still not sure they're actionable.
I listed the address range for this crash. It seems like we access the member of class/struct which was alraedy NULL for half of crash volume.


1 	0x14 	989 	27.79 %
2 	0x48 	686 	19.28 %
3 	0x0 	272 	7.64 %
I found two report with 0x0 crash address, and both of them are related to driver reset.

https://crash-stats.mozilla.com/report/index/c3fed968-9592-4666-9201-fbb2c2161006#tab-metadata
https://crash-stats.mozilla.com/report/index/5379fcf4-e64a-4a39-b4e2-210862161007#tab-metadata

[12][GFX1-]: GFX: D3D11 lock mutex abandoned (t=1253.11) |
[13][GFX1-]: [D3D11] TextureSourceD3D11:GetShaderResourceView CreateSRV failure 0x887a0005 (t=1253.11) [14][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=1253.11) |
[15][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0006, 0 (t=1253.16)
From the crash reports in comment 14, I saw we had called swapchain->present several times even we know device was removed and then firefox crashed at nv d3d library.

I think we can do the same error handling in EndFrame() just like beginFrame()[1].

|[0][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=265.555)
|[1][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0005, 0 (t=265.555)
|[2][GFX1-]: Detected rendering device reset on refresh: 7 (t=265.555)
|[3][GFX1-]: Detected rendering device reset on refresh: 7 (t=265.555)
|[4][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=266.868)
|[5][GFX1-]: Out of sync D3D11 devices in HandleError, 0 (t=266.868)
|[6][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0020, 0 (t=266.868)
|[7][GFX1-]: GFX: D3D11 abandoned sync (t=277.546)
|[8][GFX1-]: GFX: D3D11 lock mutex abandoned (t=277.546)
|[9][GFX1-]: GFX: D3D11 lock mutex abandoned (t=277.546)
|[10][GFX1-]: D3D11 swap chain preset failed 0x887a0005 (t=287.4)
|[11][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0007, 0 (t=287.4) 


[1]http://searchfox.org/mozilla-central/source/gfx/layers/d3d11/CompositorD3D11.cpp#967
Jerry, please work with kevin to debug this issue.
Assignee: nobody → kechen
Hi Jerry,

I've added an early return condition in EndFrame according to comment 15, could you help me to check if I miss anything here ?

Thank you.
Attachment #8800572 - Flags: feedback?(hshih)
Comment on attachment 8800572 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened.

Review of attachment 8800572 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1069,5 @@
>  
>  void
>  CompositorD3D11::EndFrame()
>  {
> +  if (!mDefaultRT || mDevice->GetDeviceRemovedReason() != S_OK) {

I think we would better to log early return here when device got removed. And I think we should do more cleanup if we are going to early return in EndFrame().
Comment on attachment 8800572 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened.

Review of attachment 8800572 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1074,4 @@
>      Compositor::EndFrame();
>      return;
>    }
>  

put the checking at this line, then we can know the early return is just for device-removed.
and also add some gfxcriticalnote and clean the rendertarget.
  mCurrentRT = nullptr
Attachment #8800572 - Flags: feedback?(hshih) → feedback+
The CompositorD3D11::BeginFrame() should also have driver-removed log.
Flags: needinfo?(kechen)
See Also: → 1310600
Hi Jerry, I've modified the patch according to the comments, please help me to check it. Thank you.
Flags: needinfo?(kechen)
Attachment #8801968 - Flags: review?(hshih)
I added some logs to the place where we fail to recreate new compositors.
In current code, without really handling this situation, we just finish the device-removed handle process and wait for other codes to trigger the next device-removed handle process.
I think this patch can help us to figure out if the device-removed crash in this signature is caused by the situation mentioned above.

And I found that I misunderstood the code flow in comment 22. The code flow[1] is okay. Please ignore the comment.


[1] https://dxr.mozilla.org/mozilla-central/rev/01ab78dd98805e150b0311cce2351d5b408f3001/gfx/thebes/DeviceManagerDx.cpp#318-326
Attachment #8800572 - Attachment is obsolete: true
Attachment #8802427 - Flags: feedback?(hshih)
Comment on attachment 8801968 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v2

Review of attachment 8801968 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +974,5 @@
>    // Don't composite if we are minimised. Other than for the sake of efficency,
>    // this is important because resizing our buffers when mimised will fail and
>    // cause a crash when we're restored.
>    NS_ASSERTION(mHwnd, "Couldn't find an HWND when initialising?");
> +  if (::IsIconic(mHwnd)) {

I would prefer you add additional GetDeviceRemovedReason() here and call gfxCriticalNote. Or you can just dump handle and device status at the same time in one gfxCritialNote.

@@ +1085,5 @@
>  
> +  if (mDevice->GetDeviceRemovedReason() != S_OK) {
> +    gfxCriticalNote << "GFX: D3D11 skip EndFrame with device-removed.";
> +    Compositor::EndFrame();
> +    mCurrentRT = nullptr;

I think we don't need to take care of mCurrentRT because it will be changed by SetRenderTarget() in BeginFrame() call.

And again, you can merge this with line 1081 and output gfxCriticalNote with mDefaultRT and device status.
Comment on attachment 8802427 [details] [diff] [review]
Add logs to record the failure of compositor creation.

Review of attachment 8802427 [details] [diff] [review]:
-----------------------------------------------------------------

Could we dump failure reason in CompositorBridgeParent::NewCompositor[1]? Then it covers not only D3D11 but also D3D9 backend.


[1]https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorBridgeParent.cpp#1762
Hi Jerry, I've modified the patch according to comment 25.
Can you help me to review it ?
Thank you.
Attachment #8801968 - Attachment is obsolete: true
Attachment #8801968 - Flags: review?(hshih)
Attachment #8803206 - Flags: review?(hshih)
Hi Jerry, I've modified the patch according to comment 26.
Can you help me to review it?
Thank you.
Attachment #8802427 - Attachment is obsolete: true
Attachment #8802427 - Flags: feedback?(hshih)
Attachment #8803207 - Flags: feedback?(hshih)
NVIDIA is tracking this as NVIDIA bug # 1828302
Comment on attachment 8803206 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v3

Review of attachment 8803206 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +981,5 @@
>      // read-unlock our textures to prevent them from accumulating.
>      ReadUnlockTextures();
>      *aRenderBoundsOut = IntRect();
> +
> +    gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \

I think this message should be out of "IsIconic()" if block. It's more possible to trigger window-minimizing. Then, we will have a lot of gfxCriticalNote message.

@@ +1081,3 @@
>      Compositor::EndFrame();
> +
> +    gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \

Make this message out of "mDefaultRT" if block.
Attachment #8803206 - Flags: review?(hshih) → review-
(In reply to Mark Kilgard from comment #29)
> NVIDIA is tracking this as NVIDIA bug # 1828302

Is it possible to show the bug link here?
Flags: needinfo?(mjk)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #30)
> Comment on attachment 8803206 [details] [diff] [review]
> Skip CompositorD3D11::EndFrame when device remove happened_v3
> 
> Review of attachment 8803206 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/d3d11/CompositorD3D11.cpp
> @@ +981,5 @@
> >      // read-unlock our textures to prevent them from accumulating.
> >      ReadUnlockTextures();
> >      *aRenderBoundsOut = IntRect();
> > +
> > +    gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \
> 
> I think this message should be out of "IsIconic()" if block. It's more
> possible to trigger window-minimizing. Then, we will have a lot of
> gfxCriticalNote message.
> 
> @@ +1081,3 @@
> >      Compositor::EndFrame();
> > +
> > +    gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \
> 
> Make this message out of "mDefaultRT" if block.

Peter, my opinion is different from your comment 25. Please check comment 30.
Flags: needinfo?(howareyou322)
Attachment #8803207 - Flags: feedback?(hshih) → feedback+
platform-rel: --- → ?
Whiteboard: [gfx-noted] → [gfx-noted][platform-rel-nVidia]
We also have similar AMD signatures ("atidxx32.dll | BaseThreadInitThunk" and "atidxx64.dll | BaseThreadInitThunk"), but their volume is smaller.
Crash Signature: [@ nvwgf2um.dll | BaseThreadInitThunk] → [@ nvwgf2um.dll | BaseThreadInitThunk] [@ nvwgf2umx.dll | BaseThreadInitThunk]
Comment on attachment 8803206 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v3

Review of attachment 8803206 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +981,5 @@
>      // read-unlock our textures to prevent them from accumulating.
>      ReadUnlockTextures();
>      *aRenderBoundsOut = IntRect();
> +
> +    gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \

Yes, you are right. To keep this simple, I'm fine to move this log out of this loop.

@@ +1081,3 @@
>      Compositor::EndFrame();
> +
> +    gfxCriticalNote << "GFX: D3D11 skip EndFrame without render target or with \

After second thought, I also agree to move this out to keep the logic simple.
Flags: needinfo?(howareyou322)
(79.84% in signature vs 05.45% overall) "DXVA2D3D11+" in app_notes = true
(80.58% in signature vs 06.80% overall) "DXVA2D3D11?" in app_notes = true
(80.97% in signature vs 23.83% overall) platform_pretty_version = Windows 10

At least a subset of the crashes looks related to DXVA2D3D11.

Have we changed anything with regards to this in 48 or 49? The volume for both this signature and the signature from bug 1299520 increased in 48/49 (they became more frequent in 48 and probably even more frequent in 49).

The crash in bug 1299520 was probably fixed by an update to the AMD driver, but one of the reporters is saying that, even after the crash fix, he's having smoothness issues.
See Also: → 1299520
IIRC, we've enabled DXVA2D3D11 in 48. Perhaps we need to increase the driver versions we blocklist, given this bug and bug 1299520.
Flags: needinfo?(gsquelart)
I'm not sure where the regression in 50 comes from.

Comparing the data for this signature between 49 release and 50 beta is interesting:
Release
(80.97% in signature vs 23.83% overall) platform_pretty_version = Windows 10
(71.15% in signature vs 17.70% overall) platform_version = 10.0.14393
(79.84% in signature vs 05.45% overall) "DXVA2D3D11+" in app_notes = true
(77.64% in signature vs 49.81% overall) "D3D11 Layers+" in app_notes = true
(97.63% in signature vs 48.63% overall) "D2D1.1+" in app_notes = true
(07.23% in signature vs 19.65% overall) "DXVA2D3D9+" in app_notes = true

Beta
(61.96% in signature vs 40.22% overall) platform_version = 6.1.7601 Service Pack 1
(24.50% in signature vs 05.98% overall) "DXVA2D3D11+" in app_notes = true
(88.04% in signature vs 49.04% overall) "D3D11 Layers+" in app_notes = true
(73.34% in signature vs 36.12% overall) "D2D1.1+" in app_notes = true
(30.47% in signature vs 26.97% overall) "DXVA2D3D9+" in app_notes = true
(33.29% in signature vs 04.45% overall) GFX_ERROR "DXVA2D3D9 video decoding is disabled due to a previous crash." = true

So looks like this crash is more frequent on Release with Windows 10 Anniversary Edition and DXVA2D3D11. It's more frequent on Beta with Windows 7 SP1 and both DXVA2D3D11 and DXVA2D3D9.
Attachment #8803207 - Flags: review?(dvander)
Comment on attachment 8801968 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v2

Hi David,
This patch try to show some informations and skip the composition call for driver-remove in "BeginFrame" and "EndFrame".
Attachment #8801968 - Attachment is obsolete: false
Attachment #8801968 - Flags: review?(dvander)
Comment on attachment 8803206 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v3

Use version 2(attachment 8801968 [details] [diff] [review]) is enough.
Attachment #8803206 - Attachment is obsolete: true
platform-rel: ? → +
Comment on attachment 8801968 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v2

Review of attachment 8801968 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +983,5 @@
>      return;
>    }
>  
> +  if (mDevice->GetDeviceRemovedReason() != S_OK) {
> +    gfxCriticalNote << "GFX: D3D11 skip BeginFrame and the following actions \

nit: This reads a little better without "and the following actions".
Attachment #8801968 - Flags: review?(dvander) → review+
Comment on attachment 8803207 [details] [diff] [review]
Add logs to record the failure of compositor creation_v2

Review of attachment 8803207 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorBridgeParent.cpp
@@ +2390,1 @@
>      return Nothing();

This should never happen because the BasicCompositor is always a (hopefully infallible) fallback. It'd be better to MOZ_RELEASE_ASSERT here if anything.
Attachment #8803207 - Flags: review?(dvander) → review+
(In reply to Marco Castelluccio [:marco] from comment #36)
> IIRC, we've enabled DXVA2D3D11 in 48. Perhaps we need to increase the driver
> versions we blocklist, given this bug and bug 1299520.
Looks like Jerry&Kevin are working on this, so I'll step back into the shadows. Please ping me if I'm needed again.
Flags: needinfo?(gsquelart)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/34fd4c7639ec
Skip CompositorD3D11::EndFrame when device-removed happened and add some logs for tracking the behavior. r=dvander
https://hg.mozilla.org/integration/mozilla-inbound/rev/0c0b78fc1a6d
Add logs to record the failure of compositor creation. r=dvander
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/34fd4c7639ec
https://hg.mozilla.org/mozilla-central/rev/0c0b78fc1a6d
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Do you think we can/should uplift this to beta?
Flags: needinfo?(kechen)
Approval Request Comment
[Feature/regressing bug #]: crash
[User impact if declined]: May cause the crash.
[Describe test coverage new/current, TreeHerder]: No test.
[Risks and why]:
  I think the risk is low, there's already another early return path and we just add another early return condition here.
[String/UUID change made/needed]: None
Flags: needinfo?(kechen)
Attachment #8803751 - Flags: approval-mozilla-beta?
Approval Request Comment
[Feature/regressing bug #]: Crash
[User impact if declined]: Make us more difficult to trace the crash.
[Describe test coverage new/current, TreeHerder]: No test
[Risks and why]:
Low risk, just add several logs for tracing this crash more easier.
[String/UUID change made/needed]: No
Attachment #8803752 - Flags: approval-mozilla-beta?
Comment on attachment 8803673 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v2.

Approval Request Comment
[Feature/regressing bug #]: crash
[User impact if declined]: May cause the crash.
[Describe test coverage new/current, TreeHerder]: No test.
[Risks and why]:
  Low risk, there's already another early return path and we just add another early return condition here.
[String/UUID change made/needed]: None
Attachment #8803673 - Flags: approval-mozilla-aurora?
Comment on attachment 8803674 [details] [diff] [review]
Add logs to record the failure of compositor creation_v2

Approval Request Comment
[Feature/regressing bug #]: Crash
[User impact if declined]: Make us more difficult to trace the crash.
[Describe test coverage new/current, TreeHerder]: No test
[Risks and why]:
Low risk, just add several logs for tracing this crash more easier.
[String/UUID change made/needed]: No
Attachment #8803674 - Flags: approval-mozilla-aurora?
Comment on attachment 8803752 [details] [diff] [review]
[beta-uplift] Add logs to record the failure of compositor creation.

Patch looks simple enough and low risk, the crash volume of the first signature is pretty high on 50, let's uplift.
Attachment #8803752 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8803751 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hi Milan, this fix seemed like a low risk candidate for beta uplift. Please let me know if you think otherwise. Thanks!
Flags: needinfo?(milan)
I'm hitting conflicts applying this to beta.
Flags: needinfo?(kechen)
I'm concerned this bug is due to an NVIDIA GPU reset.

The suggested to Skip CompositorD3D11::EndFrame when device remove happened and falling back to an infallible compositor may be masking/hiding a more fundamental NVIDIA driver bug Firefox is tickling.

I'm worried that what is getting reported as a "device removal" is not actually that but a GPU reset and the only way that unexpected not-supposed-to-happen fault can be reported is as a device removal, but no device is actually being removed.

I'm still looking for an appropriate NVIDIA engineer to triage this issue appropriately.

Having the additional logging is certainly a good idea.
Matt, this code got changed enough that I wouldn't mind an explicit confirmation as to what we want it to look like.  For example, do we want to call the base class' EndFrame() before or after EnsureSize(), or does it matter?
Flags: needinfo?(milan) → needinfo?(matt.woodrow)
Some users are reporting things like:
> Crash happened (again) when updating Nvidia Graphics Drivers to their latest version using the GeForce Experience program. This has happened (and I've reported it) before in previous version of both firefox and nvidia drivers.

> I was updating my graphics drivers during the crash

> Crashed during installation of Nvidia graphics driver

> Was installing a graphics driver, when the screen "refreshed" firefox crashed.

> Nvidia graphics driver update just completed

> Was browsing reddit while my nvidia graphics card updated.

> Probably due to updating video drivers?

> videocard or drive from videocard?

And there are some reports with URLs that are NVIDIA driver download pages.

This might explain a subset of the crashes, I find it hard to believe that all these people are updating the drivers so often.
(In reply to Milan Sreckovic [:milan] from comment #56)
> Matt, this code got changed enough that I wouldn't mind an explicit
> confirmation as to what we want it to look like.  For example, do we want to
> call the base class' EndFrame() before or after EnsureSize(), or does it
> matter?

It shouldn't matter.
Flags: needinfo?(matt.woodrow)
(In reply to Mark Kilgard from comment #55)
> I'm concerned this bug is due to an NVIDIA GPU reset.
> 
> The suggested to Skip CompositorD3D11::EndFrame when device remove happened
> and falling back to an infallible compositor may be masking/hiding a more
> fundamental NVIDIA driver bug Firefox is tickling.
> 
> I'm worried that what is getting reported as a "device removal" is not
> actually that but a GPU reset and the only way that unexpected
> not-supposed-to-happen fault can be reported is as a device removal, but no
> device is actually being removed.
> 
> I'm still looking for an appropriate NVIDIA engineer to triage this issue
> appropriately.
> 
> Having the additional logging is certainly a good idea.

This is almost certainly the case in some (possibly most) of these crash reports.

It appears that D3D returns DXGI_ERROR_DEVICE_REMOVED from most API calls when we're in this state, and then we can query GetDeviceRemovedReason() for the 'actual' problem.

This shows up in the metadata tab of crash reports as a message like this (from comment 14):

[15][GFX1-]: [CompositorD3D11] device removed with error code: 0x887a0005, 0x887a0006

0x887a0005 is DXGI_ERROR_DEVICE_REMOVED (the initial failure) and 0x887a0006 is DXGI_ERROR_DEVICE_HUNG (the actual problem).
~10% of the crashes with this signature have the string "[CompositorD3D11] device removed with error code" in graphics_critical_error.
Approval Request Comment
[Feature/regressing bug #]: crash
[User impact if declined]: May cause the crash.
[Describe test coverage new/current, TreeHerder]: No test.
[Risks and why]:
  I think the risk is low, there's already another early return path and we just add another early return condition here.
[String/UUID change made/needed]: None

Hello ritu, the former patch hit conflict and I've modified the patch, please help me to approve this request again, thank you.
Attachment #8803751 - Attachment is obsolete: true
Attachment #8803752 - Attachment is obsolete: true
Flags: needinfo?(kechen) → needinfo?(rkothari)
Attachment #8804525 - Flags: approval-mozilla-beta?
Approval Request Comment
[Feature/regressing bug #]: Crash
[User impact if declined]: Make us more difficult to trace the crash.
[Describe test coverage new/current, TreeHerder]: No test
[Risks and why]:
Low risk, just add several logs for tracing this crash more easier.
[String/UUID change made/needed]: No

Hello ritu, the former patch hit conflict and I've modified the patch, please help me to approve this request again, thank you.
Attachment #8804527 - Flags: approval-mozilla-beta?
Comment on attachment 8804527 [details] [diff] [review]
[beta-uplift] Add logs to record the failure of compositor creation.

Sure! Beta50+
Flags: needinfo?(rkothari)
Attachment #8804527 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8804525 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8803673 [details] [diff] [review]
Skip CompositorD3D11::EndFrame when device remove happened_v2.

Fix a crash. Take it in 51 aurora.
Attachment #8803673 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8803674 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Ryan VanderMeulen [:RyanVM] from comment #65)
> https://hg.mozilla.org/releases/mozilla-beta/rev/5ebcf218f678

Had to back this rev out for browser_Troubleshoot.js failures. Kevin says the other patch is fine to leave in while he figures out why the test failed on Beta, so we'll hopefully still get the stability wins in time for the next Beta.
https://hg.mozilla.org/releases/mozilla-beta/rev/90c412c9029128b2c60f53c7c013239b2b9b75d5

https://treeherder.mozilla.org/logviewer.html#?job_id=1879329&repo=mozilla-beta
Flags: needinfo?(kechen)
Patch[1] is just for logging and patch[2] is the one which really changes the code flow. I will monitor the crash report in these days to see if we really fix the crash with patch [2].

[1] https://hg.mozilla.org/releases/mozilla-beta/rev/5ebcf218f678
[2] https://hg.mozilla.org/releases/mozilla-beta/rev/f43ffbca8d4d
Flags: needinfo?(kechen)
https://crash-stats.mozilla.com/report/index/e5b99f76-2f65-4344-acfd-5680c2161031#tab-metadata

In the FF52, Crash still happened even we skip beginFrame after device was removed.
[12][GFX1-]: GFX: D3D11 skip BeginFrame with device-removed. (t=250652)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
For those reports which crash at address 0x22[1] have the following characteristics in common:

1. Operating system : Windows 10 / OS Version : 10.0.14936
2. Driver version : 21.21.13.XXXX.
3. Destructing DrawTargetD2D1 in thread 0 when the process crash.
4. Reset driver for more than one times and in a little period.

The destruction of DrawTargetD2D1 might be relevant to the crash. Skip it when device rest is detected is a solution we can try; however, the DrawTargetD2D1 is owned by TextureData, when the refcount reaches to 0 it will be automatically removed, therefore, it's difficult for us to skip this part.


[1] https://crash-stats.mozilla.com/signature/?address=%3D0x22&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-10-27T03%3A26%3A00.000Z&date=%3C2016-11-03T03%3A26%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports
Blake, could you have someone to comment Kevin's finding in comment 73?
Flags: needinfo?(bwu)
Matt should be the best right person to comment this.
Flags: needinfo?(bwu) → needinfo?(matt.woodrow)
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #73)
> For those reports which crash at address 0x14, 0x48[1] almost happen after
> firefox 50, something must be fixed in firefox 51.

Almost none happen after firefox 50? Or only happen after firefox 50?

Is the same change noticeable on beta/aurora/nightly? Can we try get a list of changesets for when it changed.

> 
> Lot of the crashes have a thread executing D3D11DXVA2Manager::CopyToImage,
> this might be relevant.

That function spends a lot of time waiting for video frames to finish decoding. If the user is currently playing video, then I'd expect it to be on the stack a lot of the time, so it could be unrelated.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #76)
> (In reply to Kevin Chen[:kechen] (UTC + 8) from comment #73)
> > For those reports which crash at address 0x14, 0x48[1] almost happen after
> > firefox 50, something must be fixed in firefox 51.
> 
> Almost none happen after firefox 50? Or only happen after firefox 50?
> 
Sorry it should be before firefox 51.

> Is the same change noticeable on beta/aurora/nightly? Can we try get a list
> of changesets for when it changed.
> 
Overall, the crash number started to rise around Aug 2nd[1] and it was Firefox 48 in release channel; however I think it too hard to get a list of changesets, because the crash reports with version like 48.0bx are not accessible right now.

[1] https://crash-stats.mozilla.com/signature/?signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-05-24T06%3A40%3A30.000Z&date=%3C2016-11-24T06%3A40%3A30.000Z#graphs

> > 
> > Lot of the crashes have a thread executing D3D11DXVA2Manager::CopyToImage,
> > this might be relevant.
> 
> That function spends a lot of time waiting for video frames to finish
> decoding. If the user is currently playing video, then I'd expect it to be
> on the stack a lot of the time, so it could be unrelated.
Currently, the crash rate is still high in release but not in other channels[1].

Like comment 35 said, some correlations show that a subset of crash is related to DXVA2D3D11 and there is a relatively high crash rate in some particular adapter_driver_version :

(90.80% in signature vs 07.18% overall) "DXVA2D3D11+" in app_notes = true
(30.06% in signature vs 01.46% overall) adapter_driver_version = 21.21.13.6909

However this correlation only happens in release channel and is only relate to 30% of the crash, I am not sure if it's profitable to put it in blacklist like bug 1299520.


[1] https://crash-stats.mozilla.com/signature/?product=Firefox&version=50.0&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-10-01T02%3A04%3A00.000Z&date=%3C2016-11-22T02%3A04%3A14.000Z#aggregations
Flags: needinfo?(gsquelart)
See Also: → 1320054
Do we have more information from these additional patches?
platform-rel: + → -
Flags: needinfo?(kechen)
Keywords: regression
I was suspecting that the device reset process was not completed; however, there are only a few records show logs from attachment 8803674 [details] [diff] [review] which might represent that the error is not caused by the failure when initializing devices.

In attachment 8803673 [details] [diff] [review], we skip swap chain by making an early return when device need to be reset but it does not reduce the crash rate. It might be not the root cause.

Currently, I am trying to investigate the problem I mentioned in comment 78.
Flags: needinfo?(kechen)
hi kevin, for the spike in 50 beta see bug https://bugzilla.mozilla.org/show_bug.cgi?id=1310600#c14 - apparently it was the fix from bug 1308418 that made the crash volume recline again.
Flags: needinfo?(matt.woodrow)
Or the spike moved to bug 1322554?
See Also: → 1322554
We are currently working with Nvidia to see if there are more clues we can get.

And for this signature, the number of the most frequently happened crash pattern (crash address is 0x14 or 0x48) has decreased in beta channel. Maybe there are some changes fix the problem, I will keep tracking this.
(In reply to Milan Sreckovic [:milan] from comment #83)
> Or the spike moved to bug 1322554?

Sorry for the late reply.
I am not sure if this crash is related, but the crash pattern is not very alike to me.
According to the response from Nvidia engineer, the root cause of this crash might be the race condition of deleting video driver resources and this is only happened in Windows 10.

The thing is that when the volume of allocated video resources go above the OS's limitation, the OS will ask driver to trim the resources and if we happen to destruct the same video driver resources in the same time, there might be a chance that we access a dangling pointer and cause the crash.

This inference fits to the feedbacks from users, most of them was playing video games, which needs massive video resources, when this crash happened.

Nvdia will update their active drivers to prevent this problem and for the drivers with older version, maybe we should send user a notice to update their driver or blacklist all old versions on Windows 10 ?
Flags: needinfo?(gsquelart) → needinfo?(matt.woodrow)
Flags: needinfo?(gsquelart)
Flags: needinfo?(ajones)
That sounds like a great outcome!

Can we find out what driver version it's going to be fixed in, so that we can set our blacklist to be before that version.

We don't currently have an UI for asking users to update their drivers, so blacklist is our best choice.

Do we know which drivers/cards are affected? Is it the entirety of NVIDIA on Win10? That is a lot of users that won't get hardware accelerated video decoding.

When we have the GPU process enabled (currently enabled on Nightly 53, but should ride the trains from here), maybe we can avoid blacklisting since the penalty of crashing is much lower (GPU process restarts silently, user won't notice).
Flags: needinfo?(matt.woodrow)
See Also: → 1328082
According to Nvidia, they will fix this problem with active version 376.62, 378.42 and the later.

Also, they said that this is a rare case, the leak of resources from app would rise the possibility of this crash and since the fact that the crash number has decreased in the beta 51, is it possible that we fixed some critical leaks between 50 and 51 and lower the crash number ?

And maybe we should wait for couple days and see if this crash still happen with new version drivers before we blacklist the old drivers.
If this help, I get this crash only when playing Overwatch while Firefox is running in the background, playing YouTube.

Started crashing and crash only when playing Overwatch.

GTX650 i3-3rd gen, Win10x64

My crash https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9-d02992170115
(In reply to rhazor from comment #89)
> If this help, I get this crash only when playing Overwatch while Firefox is
> running in the background, playing YouTube.
> 
> Started crashing and crash only when playing Overwatch.
> 
> GTX650 i3-3rd gen, Win10x64
> 
> My crash
> https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9-
> d02992170115

Latest Firefox 50.1.0 and I have Win10 Anniversary update
I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the number of texture allocations dramatically by recycling textures.

Matt - would you concur?
Flags: needinfo?(ajones) → needinfo?(matt.woodrow)
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #91)
> I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the
> number of texture allocations dramatically by recycling textures.
> 
> Matt - would you concur?

Good catch! Yeah, that seems very likely.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(gsquelart)
Since we've found a fix that might reduce the crash volume, maybe we shouldn't blacklist the drivers at this point, just wait and see if the crash number would decrease in next release.
Also we can consider uplifting the patch in Bug 1292032 to firefox 50.
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #93)
> Since we've found a fix that might reduce the crash volume, maybe we
> shouldn't blacklist the drivers at this point, just wait and see if the
> crash number would decrease in next release.

You're probably right, given how much it has reduced on Beta:
https://crash-stats.mozilla.com/signature/?release_channel=beta&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2016-07-17T10%3A24%3A54.000Z&date=%3C2017-01-17T10%3A24%3A54.000Z#graphs

> Also we can consider uplifting the patch in Bug 1292032 to firefox 50.

We're going to release 51 next week, so there's no need.
(In reply to rhazor from comment #89)
> If this help, I get this crash only when playing Overwatch while Firefox is
> running in the background, playing YouTube.
> 
> Started crashing and crash only when playing Overwatch.
> 
> GTX650 i3-3rd gen, Win10x64
> 
> My crash
> https://crash-stats.mozilla.com/report/index/d83b248c-769d-4e8f-a1b9-
> d02992170115

NVIDIA is going to release a driver update that should fix your crash.
In the meantime, you could manually disable DXVA via a pref (I don't remember which one).
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #91)
> I'm going to put the reduced crash rate in 51 to bug 1292032. It reduces the
> number of texture allocations dramatically by recycling textures.
> 
> Matt - would you concur?

Sorry, I didn't actually check the bug. Is that the right number?

That bug only renamed some functions, it shouldn't have had any functional changes.
Flags: needinfo?(ajones)
(In reply to Matt Woodrow (:mattwoodrow) from comment #96)
> That bug only renamed some functions, it shouldn't have had any functional
> changes.

Did the texture recycling also land in 51?
Flags: needinfo?(ajones) → needinfo?(matt.woodrow)
We still got a crash with latest Nvidia driver (21.21.13.7662), I will contact Nvidia to double check if this version contains the fix.

[1] https://crash-stats.mozilla.com/report/index/3704d15b-06fa-43c7-9a62-eacda2170125#tab-metadata
Just found a few crashes with the build after 20170130065342. Kevin, do you know the reason?
Flags: needinfo?(kechen)
The decrease of crash number might be caused by the release of firefox 51.
I've found that the crash rate is significantly decreased in firefox 51 ,according to comment 98, Bug 1289640 might be the reason.
Currently, we only have few crashes with crash address 0x14 and 0x48 in 51[1] which was the main crash address. We should keep monitoring it.
The remaining crashes might caused by other reason.

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&version=51.0&version=51.0.1&address=0x14&address=0x48&address=0x1c&signature=nvwgf2um.dll%20%7C%20BaseThreadInitThunk&date=%3E%3D2017-01-26T07%3A29%3A00.000Z&date=%3C2017-02-02T07%3A29%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#aggregations
Flags: needinfo?(kechen)
Did the signature here just shift? Is this the same issue as bug 1322554 or something different?  It is certainly very low volume for this dll now on non-release channels.
From my perspective, this bug is different from bug 1322554 which is a startup bug and mostly happened in Windows 7 platform.
There are still some crashes (50 a day in release channel) in these days, I will keep monitoring it.
See Also: → 1388437
See conversation in bug 1388437 where a new version of cubeb in 55 spiked this.
I'm not actively working on this.
Assignee: kev155266 → nobody
Moving to p3 because no activity for at least 24 weeks.
Priority: P1 → P3

NVIDIA closed bug 1828302 on 02/27/2017.

The customer close message is:

From a sample of the crash reports with driver 378.49, it appears we had a trim thread that just evict and remove a block from the residency list and freed it while another thread doing destroyResource (from Async server thread) trying to unlink the same block that is already unlinked and freed.

NVIDIA QA verified the bug is fixed in 377.11.

(In reply to Jerry Shih[:jerry] (UTC+8) (inactive) from comment #31)

(In reply to Mark Kilgard from comment #29)

NVIDIA is tracking this as NVIDIA bug # 1828302

Is it possible to show the bug link here?

Unfortunately, no. NVIDIA bugs are accessible through an NVOnline database that requires a login. Mozilla engineers would need individual accounts.

Flags: needinfo?(mjk)
Crash Signature: [@ nvwgf2um.dll | BaseThreadInitThunk] [@ nvwgf2umx.dll | BaseThreadInitThunk] → [@ nvwgf2um.dll | BaseThreadInitThunk] [@ nvwgf2umx.dll | BaseThreadInitThunk] [@ <unknown in nvwgf2umx_cfg.dll> | BaseThreadInitThunk]
QA Whiteboard: qa-not-actionable

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: critical → S3

The severity field for this bug is relatively low, S3. However, the bug has 6 See Also bugs.
:bhood, could you consider increasing the bug severity?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bhood)
Flags: needinfo?(bhood)
You need to log in before you can comment on or make changes to this bug.