crash in UMDevice::SetShaderResources

REOPENED
Assigned to

Status

()

defect
P3
critical
REOPENED
3 years ago
2 years ago

People

(Reporter: marco, Assigned: bas.schouten)

Tracking

({crash, topcrash})

unspecified
All
Windows
Points:
---

Firefox Tracking Flags

(platform-rel +, firefox-esr45 affected, firefox50 wontfix, firefox51 wontfix, firefox52+ wontfix, firefox53 affected, firefox54 affected)

Details

(Whiteboard: [gfx-noted], crash signature)

This bug was filed from the Socorro interface and is 
report bp-1b31be8b-eac4-404e-b664-f994f2160426.
=============================================================

This isn't a new crash (first seen on 2015-10-16), but it is a top crasher (#48) on Firefox 47.0a2.

The function is in d3d10warp.dll, so this is related to WARP.

Most crashes are on Windows 10, a few on Windows 8.1.
As I understand it WARP has been disabled in FF48. We could consider uplifting it.
Flags: needinfo?(bas)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #1)
> As I understand it WARP has been disabled in FF48. We could consider
> uplifting it.

Not a bad idea. Considering the majority of crashes is on Win10, where our percentage of users on WARP is fairly low I believe, it suggests this is happening for a fair amount of Win10 users on WARP.

I suspect this may have to do something with feature levels and such.. but it's hard to be sure.
Flags: needinfo?(bas)
This is pretty heavily skewed towards users with NVIDIA GPUs:
1       NVIDIA          68.70 %
2       AMD             16.40 %
3       Intel           14.90 %

Top-5 Drivers correlate to 364.72:
1       10.18.13.6472   78.33 %
2       10.18.13.6451    5.30 %
3       10.18.13.5382    3.69 %
4        9.18.13.4195    1.93 %
5        9.18.13.4192    1.77 %
Z       10.18.13.6510    0.10 % [latest driver]

Top-20 Chipsets correlate to Maxwell:
1       Maxwell         52.65 % [GM204, GM200, GM206, GM107]
2       Kepler          26.06 % [GK104, GK106, GK110]
3       Fermi            1.61 % [GF106]
4       Tesla            0.80%  [GT218]
(May not be surprising given this is mostly a Windows 10 crash and Maxwell/Kepler are more modern chipsets).

Also note that d3d10warp.dll shows up at the top of the module correlations (~99%).
Whiteboard: [gfx-noted]
This is still mostly a Windows 10 crash (99,6% in the last 5 days), but the other findings in comment 3 no longer apply.
It's still skewed towards NVIDIA, but more in line with the overall crash data (for the last 5 days, ~33,5% for this signature vs ~20,6% in general).
Pushed by rgiles@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f1c1957ce80
Update source for mp4parse v0.4.0. r=kinetik
Which was actually for bug 1267887.
https://hg.mozilla.org/mozilla-central/rev/8f1c1957ce80
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
See comment 6.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Crash volume for signature 'UMDevice::SetShaderResources':
 - nightly (version 50): 113 crashes from 2016-06-06.
 - aurora  (version 49): 186 crashes from 2016-06-07.
 - beta    (version 48): 743 crashes from 2016-06-06.
 - release (version 47): 7970 crashes from 2016-05-31.
 - esr     (version 45): 19 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly         11         37         39          7          4          6          6
 - aurora          29         59         40         11         19         16          8
 - beta           117        182        129         95         84         67         39
 - release       1212       1887       1334        947        926        854        429
 - esr              5          5          2          1          2          0          0

Affected platform: Windows
How bad is this? (I assume no traction possible...)
Flags: needinfo?(anthony.s.hughes)
(In reply to David Bolter [:davidb] from comment #10)
> How bad is this? (I assume no traction possible...)

I can't say for certain how bad this is for those users who are crashing. My best guess based on the data is that it's not critical. At this point it is not a top crash and less than 2% of these crashes are happening within the first minute of startup time. That said, it seems to be happening to more users as time goes on but not at a high frequency.

It is primarily happening to those users with discrete GPUs running on Windows 10 with not-current driver versions. Video and image heavy websites show up frequently in the URLs (Facebook, Youtube, Netflix, 9gag, and some adult-oriented websites).

As a note, 21 of the users reporting these crashes have provided email addresses. It may be worth reaching out to these people.

What follows is the data I was able to gather about this crash. I hope it helps.

Reports vs Installations:
=========================
-1 week: 1908 crashes / 1933 installations [0.987 effective crash rate]
-2 week: 1218 crashes / 1216 installations [1.002 effective crash rate]
-3 week: 1320 crashes / 1316 installations [1.003 effective crash rate]
-4 week: 1331 crashes / 1325 installations [1.005 effective crash rate]


Breakdown by Version:
=====================
In 51 this doesn't show up in the top-300.
In 50 this doesn't show up in the top-300.
In 49 this is #97 @ 0.16% rising 78 positions over the last week.
In 48 this is #33 @ 0.34% rising 39 positions over the last week.
In 47 this is #35 @ 0.32% rising 35 positions over the last week.

Percent Change over last 6 months:
==================================
Feb 18 - Mar 17:  882 [baseline]
Mar 18 - Apr 17: 1891 [+114.40%]
Apr 18 - May 17: 4410 [+400.00%]
May 18 - Jun 17: 4597 [+421.20%]
Jun 18 - Jul 17: 6298 [+614.06%]
Jul 18 - Aug 17: 6420 [+627.89%]

Breakdown by Platform:
======================
99.8% are on Windows 10
 0.2% are on Windows 8.1

Breakdown by GPU Vendor:
========================
40% are on AMD
40% are on NVIDIA
20% are on Intel

Breakdown by Driver:
====================
AMD
> 48.97% are on the 16.* branch driver with 58% of those users on 16.150.2211.0
> 42.46% are on the 15.* branch driver with 62% of those users on 15.300.1025.1001
>  6.54% are on the 20.* branch driver with 98% of those users on 20.19.0.32832

NVIDIA
> 66.39% are on the 368.* branch driver with 75% of those users on 10.18.13.6881
> 11.76% are on the 341.* branch driver with 60% of those users on 9.18.13.4195
>  8.91% are on the 353.* branch driver with 46% of those users on 10.18.13.5382

Intel
> 34.81% are on the 20.19.15.* branch driver with 18% of those users on 20.19.15.4444
> 32.74% are on the 10.18.15.* branch driver with 31% of those users on 10.18.15.4248
> 17.91% are on the 10.18.10.* branch driver with 54% of those users on 10.18.10.4252

Breakdown by Chipset:
=====================
AMD
> 46.65% are on CIK with 35% of those users on Kaveri
> 40.63% are on Southern Islands with 39% of those users on Pitcairn
>  4.61% are on Cayman with 100% of those users on Aruba

NVIDIA
> 20.78% are on Maxwell-GM204 chipsets with 80% of those users on Geforce GTX 970 cards
> 13.30% are on Kepler-GK104 chipsets with 58% of those users on Geforce GTX 760/770 cards
>  6.32% are on Maxwell-GM107 chipsets with 84% of those users on Geforce GTX 750 cards 

Intel
> 33.24% are on Haswell with 41% of those users on Intel HD 4400
> 25.32% are on Skylake with 63% of those users on Intel HD 520
>  9.97% are on Ivybridge with 64% of those users on Intel HD 4000
Flags: needinfo?(anthony.s.hughes)
(97.28% in signature vs 01.34% overall) GFX_ERROR "VendorIDMismatch V "="True"
(99.91% in signature vs 25.75% overall) platform_pretty_version="Windows 10"
(86.22% in signature vs 17.37% overall) platform_version="10.0.14393"
(100.0% in signature vs 34.39% overall) reason="EXCEPTION_ACCESS_VIOLATION_READ"
(64.35% in signature vs 11.35% overall) address="0x0"
(99.56% in signature vs 50.46% overall) "D2D1.1+" in app_notes="True"
(99.56% in signature vs 50.47% overall) "DWrite+" in app_notes="True"
(99.56% in signature vs 50.47% overall) "DWrite?" in app_notes="True"
(100.0% in signature vs 51.96% overall) "D3D11 Layers+" in app_notes="True"
(69.62% in signature vs 22.04% overall) adapter_vendor_id="0x10de"
(54.52% in signature vs 07.10% overall) "DXVA2D3D11?" in app_notes="True"
(100.0% in signature vs 53.45% overall) "D3D11 Layers?" in app_notes="True"
(07.64% in signature vs 52.27% overall) adapter_vendor_id="0x8086"
(96.14% in signature vs 53.29% overall) os_arch="amd64"
(42.32% in signature vs 01.54% overall) adapter_driver_version="21.21.13.7290"
(74.80% in signature vs 36.13% overall) bios_manufacturer="American Megatrends Inc."
(01.67% in signature vs 39.45% overall) "D2D1.1-" in app_notes="True"
(42.05% in signature vs 05.69% overall) "DXVA2D3D11+" in app_notes="True"
(03.86% in signature vs 37.93% overall) os_arch="x86"
(34.86% in signature vs 01.59% overall) "DXVA2D3D11-" in app_notes="True"
(51.80% in signature vs 19.50% overall) Addon "Adblock Plus"="True"
(100.0% in signature vs 77.06% overall) shutdown_progress="None"
(21.16% in signature vs 05.46% overall) url="www.youtube.com"

What does the "VendorIDMismatch" mean?
Looks like the first vendor ID is always one of NVIDIA, Intel or AMD and
the second always "0x1414".

E.g.:
"VendorIDMismatch V 0x10de 0x1414"
"VendorIDMismatch V 0x8086 0x1414"
"VendorIDMismatch V 0x1002 0x1414"

0x1414 is the vendor ID of the "Microsoft Basic Render Driver".
Bas, According to comment 2 you seem to have a good idea of the way forward.
Assignee: nobody → bas
also see bug 1285792 about the [@ UMDevice::ResourceCopyRegion] signature, which looks related as it had spikes at the same points in time like this bug. the reporter there said the crash happened during a (nvidia) driver update.
See Also: → 1285792
Bas, this is showing up as a top crash across all branches. Can you please help move this forward?
Flags: needinfo?(bas)
Target Milestone: mozilla50 → ---
(In reply to Nicolas Silva [:nical] from comment #14)
> Bas, According to comment 2 you seem to have a good idea of the way forward.

The GPU process should be fixing this, what's a little odd is I believe we've long disabled WARP so I wonder why we're still getting crashes in it, I'll ask around.
Flags: needinfo?(bas)
Crash volume for signature 'UMDevice::SetShaderResources':
 - nightly (version 54): 37 crashes from 2017-01-23.
 - aurora  (version 53): 8 crashes from 2017-01-23.
 - beta    (version 52): 216 crashes from 2017-01-23.
 - release (version 51): 0 crashes from 2017-01-16.
 - esr     (version 45): 184 crashes from 2016-08-03.

Crash volume on the last weeks (Week N is from 01-30 to 02-05):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly      32
 - aurora        7
 - beta        160
 - release       0       0
 - esr          27       9      14       7       4       8      14

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #73
 - aurora  #104
 - beta    #34
 - release
 - esr     #989
[Tracking Requested - why for this release]:
this crash signature is regressing in volume on the beta channel after we went into the 52 cycle:
https://crash-stats.mozilla.com/signature/?release_channel=beta&signature=UMDevice%3A%3ASetShaderResources&date=>%3D2016-11-16T09%3A22%3A47.000Z#graphs

any clues on what may be the cause of this and if it's fixable?
Flags: needinfo?(bas)
looking at the nightly channel this seems to have regressed in 52.0a1 build 20161029062601.

this would be the pushlog of 52.0a1 build 20161029062601 -1 day: 
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=944cb0fd05526894fcd90fbe7d1e625ee53cd73d&tochange=1b170b39ed6bdbde366233ab84594bdaaa960a5a
and back a further 2 days:
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2016-10-26&tochange=944cb0fd05526894fcd90fbe7d1e625ee53cd73d
(In reply to [:philipp] from comment #20)
> looking at the nightly channel this seems to have regressed in 52.0a1 build
> 20161029062601.
> 
> this would be the pushlog of 52.0a1 build 20161029062601 -1 day: 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=944cb0fd05526894fcd90fbe7d1e625ee53cd73d&tochange=1b17
> 0b39ed6bdbde366233ab84594bdaaa960a5a

Bug 1313770 in this range touched a lot of layers code. Not sure if that was generic IPC code or more specific to the GPU process, though. There's also bug 1297822, but I would assume that's only for the GPU process and wouldn't affect 52.

> and back a further 2 days:
> https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2016-10-
> 26&tochange=944cb0fd05526894fcd90fbe7d1e625ee53cd73d

There's a Skia update in here (bug 1299435), but I would assume that'd be irrelevant here. A couple other GPU process fixes in there and some Gonk-removal code cleanup patches, but none of that looks particularly likely either.
Flags: needinfo?(dvander)
Unfortunately none of that looks related to this driver crash.
Flags: needinfo?(dvander)
This crash related to gfx driver reset has rather high volume on beta52. :(
We have seen us hitting driver bugs, mostly on the race conditions, when it comes to video.  The range in comment 20 does contain some video work (e.g., bug 1295352, despite being marked as "tests"), so it's possible it's tickling the driver bugs more.
this signature is spiking up again in the last days. it's now 2% of all browser crashes on 52.0.1.
Too late for firefox 52, mass-wontfix.
Those are very interesting spikes; Anthony, can you check that the graph is showing us the correct numbers?

On a side note, great majority of these have a device reset, and perhaps come back before the card is ready - we get the "vendor mismatch" message, with the adapter vendor 0x1414 (Microsoft) instead of the actual one.  Perhaps that leads us to trouble.

Needinfo Peter in case we have the bandwidth to look at this in the device reset context.
Flags: needinfo?(howareyou322)
Flags: needinfo?(bas)
Flags: needinfo?(anthony.s.hughes)
The values in the graph are accurate however the spikes do not correlate to our own release dates. I suspect this could be related to an OS, driver, or third-party update based on correlations.

* 99.77% are on Windows 10
* 82.47% are on NVIDIA hardware
* 55% are on 378.92 driver, only 0.8% are on the latest 381.65 driver

The spikes correlate quite closely to the NVIDIA 378.78 driver update (released March 9th) and the 378.92 driver update (released March 20th).

Other correlations can be seen here:
https://crash-stats.mozilla.com/signature/?signature=UMDevice%3A%3ASetShaderResources#correlations
Flags: needinfo?(anthony.s.hughes)
Makes sense.

To save other time, if they're looking at a crash like this: https://crash-stats.mozilla.com/report/index/e4b0731e-2b12-4705-8ce5-422d12170409 and wondering about the crash guard running hours into the session.  It was the first run, so the crash guard ran once; we store this in the preference, but those preferences only get updated on restart.  So, when a device reset happened and we restarted the compositor, we ran the crash guard code again.

We also did this somewhat too soon - the hardware wasn't ready for it, which is why we end up with the mismatched vendor message.

This may be something we want to raise with Nvidia, given comment 28.
platform-rel: --- → ?
Flags: needinfo?(howareyou322)
platform-rel: ? → +
Currently #8 overall top browser crash in FF 53.
This signature spiked the 2017-05-17 for release 53 with 1224 crashes (when the average was around 150 crashes/day in the days before).
Keywords: topcrash
the crash seems to be spiking whenever a graphics driver update is pushed out through windows update...
(In reply to [:philipp] from comment #32)

I tend to agree with you here.
I updated from 17.4.4 to 17.5.2 30 mins ago. I have not restarted my pc yet and got the crash right now.
https://crash-stats.mozilla.com/report/index/a14896e0-89e0-4784-93ef-aead80170522
You need to log in before you can comment on or make changes to this bug.