Closed Bug 1306168 Opened 8 years ago Closed 4 years ago

Crash in mozilla::layers::CompositorD3D11::BeginFrame

Categories

(Core :: Graphics: Layers, defect, P3)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox53 --- affected
firefox54 --- affected

People

(Reporter: ting, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash-win, Whiteboard: [gfx-noted])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-23aeed7e-5196-4cc2-a2e1-2d6f22160927.
=============================================================

#49 of 0925 Nightly on Windows, 7 crashes from 7 installations. From the graph [1], bug 1133623 seemed fixed it, but it came back in the beginning of August. Low volume though.

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=Nightly&_sort=-date&signature=mozilla%3A%3Alayers%3A%3ACompositorD3D11%3A%3ABeginFrame&date=%3E2016-06-07#graphs
Seems to have spiked only recently in Aurora, when it became 51. That gives our regression range a lower bound of 2016-08-01, the last Nightly 50.

Looking at the build graph[1], Nightly 51 started reporting this a couple of days later in the 2016-08-04 build.

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&date=%3E2016-06-07&signature=mozilla%3A%3Alayers%3A%3ACompositorD3D11%3A%3ABeginFrame#graph
If we're slightly optimistic and assume that the first spike is exactly when this started happening, we get the range at [1].

In particular, a couple of the patches from bug 1289640 talking about threadsafe texture upload seem a bit suspicious. Doesn't add all that much information though, sadly.

[1] http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=6608e5864780589b25d5421c3d3673ab30c4c318&tochange=1576e7bc1bec7232e9e4ba78cce62526b1a6380b
I wonder if it's possible that we see a device loss, causing these mutexes to not acquire, but manage to reset the device before they time out? So:

1. Device is lost.
2. (Thread 1) Start attempting to acquire the mutex. The device is lost so this can't succeed.
3. (Thread 2) We reset the device.
4. (Thread 1) We time out trying to acquire the mutex; crash.
Blocks: 1297204
Crash volume for signature 'mozilla::layers::CompositorD3D11::BeginFrame':
 - nightly (version 52): 53 crashes from 2016-09-19.
 - aurora  (version 51): 53 crashes from 2016-09-19.
 - beta    (version 50): 10 crashes from 2016-09-20.
 - release (version 49): 273 crashes from 2016-09-05.
 - esr     (version 45): 47 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly      34      19
 - aurora       42      11
 - beta          9       1
 - release     218      55
 - esr           4       1

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #22
 - aurora  #24
 - beta    #1421
 - release #232
 - esr     #1647
Priority: -- → P3
Whiteboard: [gfx-noted]
Flags: needinfo?(milan)
This is #28 topcrash in Nightly over the past 7 days.
Note that majority of these should go away once we're in beta or release - in nightly and aurora, we force a crash when particular errors happen, in beta and release we keep running.  Some of the times we continue, we end up crashing in a driver, or timing out and moz_crash-ing, but the numbers seem to be low.
Most of the work for this is now in bug 1160157.
Crash volume for signature 'mozilla::layers::CompositorD3D11::BeginFrame':
 - nightly (version 53): 247 crashes from 2016-11-14.
 - aurora  (version 52): 232 crashes from 2016-11-14.
 - beta    (version 51): 1405 crashes from 2016-11-14.
 - release (version 50): 709 crashes from 2016-11-01.
 - esr     (version 45): 92 crashes from 2016-07-06.

Crash volume on the last weeks (Week N is from 01-02 to 01-08):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly      34      33      60      51      38      20       0
 - aurora       32      39      41      47      51      16       0
 - beta        198     201     199     224     271     162      83
 - release      98     112     124     113     105     104      28
 - esr           3       6       3       3       6       4       9

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #166
 - aurora  #24
 - beta    #45
 - release #497
 - esr     #1545
bp-76f6fe15-c9d8-4b3c-963c-63fc12170118 with nightly 53.0a1 20170111030235.

At the same time I got OOM | small bp-15b14274-ad24-4bd7-85b9-487cb2170118  and  F802033140_______________________________________ bp-603e29e6-538d-46d5-9e70-4c9db2170118
This is #34 on Beta 51. In most cases we are moz-crashing after a timeout, like you said:
(99.30% in signature vs 00.17% overall) moz_crash_reason = MOZ_CRASH(GFX: D3D11 normal status timeout)
unfortunately the volume of this crash has increased once 51 went to the release audience - it's the #6 browser crash causing 1.22% of all browser crashes in firefox 51.0.1
The device reset happens, and we don't properly deal with it and eventually crash in the timeout.  Seems to be disproportionately many Nvidia cards in these crashes.
The good news is that all (with a couple of interesting exceptions) of the 53 & 54 crashes are the GPU process.  The bad news is that 51 & 52 don't have the GPU process, so the browser goes down with this crash.
The "last" of the device reset crashes seem to come from something similar to what's described in bug 1333329, except that we MOZ_CRASH because of the timeout in CompositorD3D11::BeginFrame (e.g., https://crash-stats.mozilla.com/report/index/82e1407a-9506-41f9-8d61-8dbd62170129)

Should we remove this MOZ_CRASH?  We do it because the timeout doesn't think it's part of the device reset, but looking at the log, there clearly were resets in the past, we just "forgot" about them by the time we get here.  Or maybe it's something else.

On a side note, we're going to uplift a patch to beta that could reduce the number of device resets on Nvidia in the first place.  If that shows results, we could do a dot release on 51.
Flags: needinfo?(milan) → needinfo?(dvander)
Crashing here, if in the GPU process, actually seems fine to me. Having the compositor block for 30+ seconds each frame is a little worrying. It might be worth disabling D3D11 at that point.
Flags: needinfo?(dvander)
Crash volume for signature 'mozilla::layers::CompositorD3D11::BeginFrame':
 - nightly (version 54): 27 crashes from 2017-01-23.
 - aurora  (version 53): 20 crashes from 2017-01-23.
 - beta    (version 52): 168 crashes from 2017-01-23.
 - release (version 51): 1906 crashes from 2017-01-16.
 - esr     (version 45): 104 crashes from 2016-08-03.

Crash volume on the last weeks (Week N is from 01-30 to 02-05):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly      16
 - aurora        9
 - beta        107
 - release     968       0
 - esr           9       4       5       3       3       6       3

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #785
 - aurora  #562
 - beta    #41
 - release #6
 - esr     #1470
This is the #2 topcrash for Windows nightly of 20170309030216,
reported 729 times.
(In reply to Julian Seward [:jseward] from comment #15)
> This is the #2 topcrash for Windows nightly of 20170309030216,
> reported 729 times.

Probably, Bug 1345814.
Adding a note that this is the #1 GPU Process crash @ 55.38% in Nightly 55 (#2 @ 13.34% overall) with 5098 of 5352 reports coming from the GPU Process (95.2%). In Beta this is only #109 @ 0.04% and Aurora this is only #54 @ 0.09%.
Keywords: topcrash-win
Mass wontfix for bugs affecting firefox 52.

This currently far from being a topcrash, and isn't showing for any current versions
https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3Alayers%3A%3ACompositorD3D11%3A%3ABeginFrame

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.