Closed Bug 1359416 Opened 7 years ago Closed 7 years ago

Increase in Intel graphics driver crashes with gfx.direct3d11.allow-intel-mutex on

Categories

(Core :: Graphics, defect)

54 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 + disabled
firefox55 + fixed
firefox56 + fixed

People

(Reporter: philipp, Assigned: bas.schouten)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-ac8e08c0-4db2-4d81-a6c1-7b7dd0170425.
=============================================================

intel graphics driver crashes on windows are sharply increasing on the beta channel after 54.0b1 got rolled out there. while they where accounting to 1.5% of crashes in the browser process during the 53.0b cycle, crashes with signatures starting with igd* are now 15% of browser crashes in 54.0b1.

http://bit.ly/2q0vGay shows the signatures with the most movement between the two beta cycles (i've added all signatures which share increased to this bug).

this is happening on all supported windows versions and seems to originate from a couple of particular gpus:

Adapter device id facet
1 	0x0116 	794 	33.32 %
2 	0x0166 	422 	17.71 %
3 	0x0152 	318 	13.34 %
4 	0x0102 	284 	11.92 %
5 	0x0106 	271 	11.37 %
6 	0x0126 	208 	8.73 %
7 	0x0156 	61 	2.56 %
8 	0x0162 	17 	0.71 %
9 	0x0122 	5 	0.21 %
10 	0x0112 	3 	0.13 %

adding a ni for milan to make sure that the proper people are looking into this :-)
Flags: needinfo?(milan)
a good junk of the increase seems to come from reports with IMFYCbCrImage::GetD3D11TextureData in the stack, so maybe that's related to bug 1300699?
See Also: → 1300699
Bug 1300699 was hopefully just unit test code, but there are enough changes in there that I'd like that confirmed - Kevin, Jerry, did bug 1300699 patches contain any changes to the code that runs outside of the unit tests, even if those changes "should not make any difference"?
Flags: needinfo?(milan)
Flags: needinfo?(kechen)
Flags: needinfo?(hshih)
[Tracking Requested - why for this release]:
After reviewing the patches in Bug 1300699 again, there are two minor changes outside of the unit test.

1. The timing of executing GetSize()[1] :
  Before the patch, this method was involved when creating DXGIYCbCrTextureData[2] while the patch changed the function call to   the beginning of the function. But, from my understanding, there was no chance to change the mSize between these two timing.
2. An additional call to GetContent/CompositorDevice in[3].

[1] https://hg.mozilla.org/mozilla-central/file/3f8ed553e230/gfx/layers/IMFYCbCrImage.cpp#l335
[2] https://hg.mozilla.org/mozilla-central/file/17853e696113/gfx/layers/IMFYCbCrImage.cpp#l331
[3] https://hg.mozilla.org/mozilla-central/file/3f8ed553e230/gfx/layers/IMFYCbCrImage.cpp#l227
Flags: needinfo?(kechen)
Track 54+ as the volume of crash in 54 is high.
Perhaps the increasing in 54.0b1 is related to Bug 1335971?
It seems like we still have the problem in Bug 1292923.
Hello Bas, any idea about this?
Flags: needinfo?(bas)
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #8)
> Perhaps the increasing in 54.0b1 is related to Bug 1335971?

yeah, that might be in line with coincidental crash stats data. the spike wasn't very visible during the 54 nightly cycle (maybe people there generally use newer hardware), except perhaps for the [@ igd10umd64.dll] signature which popped up somewhat more often starting with build 20170205030206 - that would fit well with when bug 1335971 landed: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&process_type=browser&signature=igd10umd64.dll&date=%3E%3D2017-01-01T00%3A00%3A00.000Z#reports (sort by buildid).

could we attempt a backout of bug 1335971 from beta, as the crash volume continues to be really high there (16% of all browser crashes from 54.0b2 have a signature starting with igd*)?
The functionality from bug 1335971 is behind a preference gfx.direct3d11.allow-intel-mutex, which means we can turn this off using a system add-on.  Probably less intrusive than backing out the code - Bas, I'm assuming disabling this would be the same as backing the code out.

Ryan, I can't recall where to start with the system add-on approach, and I'm assuming that's preferable from rel coordination point of view?
Flags: needinfo?(ryanvm)
We could ship a hotfix addon theoretically. But we also ship two beta builds a week, so wouldn't it be easier to just flip the pref on Beta and have it go out that way?
Flags: needinfo?(ryanvm)
Summary: Crash in igd10umd32.dll | CResource<T>::CLS::FinalConstruct → Increase in Intel graphics driver crashes in 54.0b cycle
Changing the pref on beta would be great.
Assignee: nobody → bas
Having looked at this a little more I have some doubts whether bug 1335971 is responsible. But let's flip the pref anyway and see.
Flags: needinfo?(bas)
Although.. maybe for the ones with IMFYCbCrImage::GetD3D11TextureData in it it is. So if those are where the increase is coming from it could work.
Comment on attachment 8863147 [details]
Bug 1359416: Disable uploading on client side on Intel devices on Beta.

https://reviewboard.mozilla.org/r/134972/#review138056
Attachment #8863147 - Flags: review?(milan) → review+
Keywords: checkin-needed
Hi Bas, this was mentioned as a top crasher on Beta. If this fix helps, could you please nominate this patch for uplift to Beta55? Thanks!
Flags: needinfo?(bas)
...54 beta 5 :-)
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s a2479bb25aca -d 83fe6517d5be: rebasing 393202:a2479bb25aca "Bug 1359416: Disable uploading on client side on Intel devices on Beta. r=milan" (tip)
merging gfx/thebes/gfxPrefs.h
warning: conflicts while merging gfx/thebes/gfxPrefs.h! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
In retrospect, comment 19 was silly because we only want this on Beta. Which means it needs approval first and shouldn't get autolanded once it has it.
Keywords: checkin-needed
Flags: needinfo?(hshih)
Comment on attachment 8863147 [details]
Bug 1359416: Disable uploading on client side on Intel devices on Beta.

Approval Request Comment
[Feature/Bug causing the regression]: 1335971
[User impact if declined]: Crashes
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: No
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Reverting to old behavior.
[String changes made/needed]: None
Flags: needinfo?(bas)
Attachment #8863147 - Flags: approval-mozilla-beta?
Comment on attachment 8863147 [details]
Bug 1359416: Disable uploading on client side on Intel devices on Beta.

Revert to old behavior and see how it goes in Beta54. Beta54+. Should be in 54 beta 5.
Attachment #8863147 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Bas Schouten (:bas.schouten) from comment #21)
> [Is this code covered by automated tests?]: Yes
> [Has the fix been verified in Nightly?]: No
> [Needs manual test from QE? If yes, steps to reproduce]: No

Setting qe-verify- based on Bas' assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
toggling gfx.direct3d11.allow-intel-mutex to false in 54.0b5 worked very well - the intel driver crashes are back to a level we were used to see in prior versions.
Crash Signature: igd10umd32.dll | NDXGI::CDevice::SetSharedResourceCreationArgs ] [@ igd10umd32.dll | CContext::ID3D11DeviceContext2_UpdateSubresource_<T> ] [@ igd10umd32.dll | d3d11.dll@0xc0651 ] → igd10umd32.dll | NDXGI::CDevice::SetSharedResourceCreationArgs ] [@ igd10umd32.dll | CContext::ID3D11DeviceContext2_UpdateSubresource_<T> ] [@ igd10umd32.dll | d3d11.dll@0xc0651 ] [@ memset | TextStageManager::MapTextureTransferSurface ]
(In reply to [:philipp] from comment #25)
> toggling gfx.direct3d11.allow-intel-mutex to false in 54.0b5 worked very
> well - the intel driver crashes are back to a level we were used to see in
> prior versions.

Any chance you could verify the patch in bug 1348320 has also been effective at reducing our crash rates for 55?
Flags: needinfo?(madperson)
curiously there's a minuscule volume of these crashes on nightly in the first place - i could only see a hand full of them on 55.0a1 vs many thousands on 54.0b (not sure why that is, maybe the affected intel hd 2500/3000/4000 devices aren't as present there).
so in order to verify the fix i think it would be either necessary to uplift the patch and flip back the preference for a trial run again or wait until the next cycle and see how things turn out then...
Flags: needinfo?(madperson)
[Tracking Requested - why for this release]: Topcrash in early Beta 55 builds
Based on comment 27 it sounds like bug 1348320 was not as effective as it was hoped.  Should we land this patch here in 55 too?
Flags: needinfo?(bas)
See Also: → 1373937
(In reply to Julien Cristau [:jcristau] from comment #29)
> Based on comment 27 it sounds like bug 1348320 was not as effective as it
> was hoped.  Should we land this patch here in 55 too?

It would be a pretty big perf setback. We should probably figure out what's causing beta to be disproportionately affected by this and blacklist specifically the set of hardware causing this.
Flags: needinfo?(bas)
so far these are some correlations from the 55.0b rollout:

Platform pretty version facet
1 	Windows 10 	284 	90.45 %
2 	Windows 7 	21 	6.69 %
3 	Windows 8.1 	6 	1.91 %
4 	Windows 8 	3 	0.96 %

Adapter device id facet
1 	0x0116 	180 	57.32 %
2 	0x0126 	75 	23.89 %
3 	0x0102 	24 	7.64 %
4 	0x0106 	18 	5.73 %
5 	0x0166 	12 	3.82 %
6 	0x0152 	4 	1.27 %
7 	0x0112 	1 	0.32 %

Adapter driver version facet
1 	9.17.10.4459 	248 	78.98 %
2 	9.17.10.4229 	36 	11.46 %
3 	8.15.10.2418 	18 	5.73 %
4 	9.17.10.2770 	2 	0.64 %
5 	9.17.10.2828 	2 	0.64 %
...

this is the crash stats query i've used for this: http://bit.ly/2stomHP
Summary: Increase in Intel graphics driver crashes in 54.0b cycle → Increase in Intel graphics driver crashes with gfx.direct3d11.allow-intel-mutex on
This is a top crash in beta, I think we should land the pref flip now, rather than keeping this crash around for weeks when we know we have a workaround.
milan, could you make the the pref flip happen again? 
the issue is hanging around on 55.0b for a while now already & while the volume of this bug seems less severe than it has been in the 54.0b cycle, with ~200 daily crashes on beta it's still one of the top issues and crash rates in 55 are generally higher than we'd like to see...
Flags: needinfo?(milan)
If we block gen6/sandybridge, we'll take care of #1, #2, #3, #4, #7 in the list in comment 32.  That's about 11% of all the users; I don't know what percentage of Windows 10 users that is.
Bas, what do you think about blocking those users?
Flags: needinfo?(milan) → needinfo?(bas)
(In reply to Milan Sreckovic [:milan] from comment #35)
> If we block gen6/sandybridge, we'll take care of #1, #2, #3, #4, #7 in the
> list in comment 32.  That's about 11% of all the users; I don't know what
> percentage of Windows 10 users that is.
> Bas, what do you think about blocking those users?

Taking the perf improvement from 11% of the user is definitely better than taking it from all of them!
Flags: needinfo?(bas)
Any chance we can get either the pref flip or blocking a subset of chips done this week?
Flags: needinfo?(bas)
Jeff, can you send me the list of device IDs that constitute Tesla devices?
Flags: needinfo?(bas) → needinfo?(jmuizelaar)
The list is here in the Tesla section: https://github.com/jrmuizel/gpu-db/blob/master/nvidia.json
Flags: needinfo?(jmuizelaar)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #39)
> The list is here in the Tesla section:
> https://github.com/jrmuizel/gpu-db/blob/master/nvidia.json

I posted this on the wrong bug, but thanks!
Comment on attachment 8887673 [details]
Bug 1359416: Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6.

https://reviewboard.mozilla.org/r/158560/#review164116
Attachment #8887673 - Flags: review?(bas) → review+
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4d83b498dc00
Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6. r=bas
I had to back this out for various test failures with assertions like https://treeherder.mozilla.org/logviewer.html#?job_id=116542086&repo=autoland
Flags: needinfo?(bas)
Backout by kwierso@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ea24b188a4b0
Backed out changeset 4d83b498dc00 for assertions in ServoStyleSet a=backout CLOSED TREE
The preferences now get first accessed off the main thread, and had no chance to be initialized.  I'll take a look.
Flags: needinfo?(bas) → needinfo?(milan)
Flags: needinfo?(milan)
See Also: → 1357307
(In reply to Milan Sreckovic [:milan] from comment #50)
> Comment on attachment 8887673 [details]
> Bug 1359416: Allow blocklisting of D3D11 keyed mutex and block it for Intel
> Gen 6.
> 
> Review request updated; see interdiff:
> https://reviewboard.mozilla.org/r/158560/diff/3-4/

Changed to David as Bas already reviewed the other part, and this deals with the whole gfxVars story.
Comment on attachment 8887673 [details]
Bug 1359416: Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6.

https://reviewboard.mozilla.org/r/158560/#review166368
Attachment #8887673 - Flags: review?(dvander) → review+
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/38778100700e
Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6. r=bas,dvander
https://hg.mozilla.org/mozilla-central/rev/38778100700e
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
thanks for the patch, could you request uplift to beta if you deem fit to do so?
Flags: needinfo?(milan)
Comment on attachment 8887673 [details]
Bug 1359416: Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6.

Approval Request Comment
[Feature/Bug causing the regression]:
[User impact if declined]: Users that should be blocked get the feature.
[Is this code covered by automated tests?]:
[Has the fix been verified in Nightly?]: It is probably too early to see the trends already.
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: Going back to old functionality, low risk.
[Why is the change risky/not risky?]:
[String changes made/needed]:
Flags: needinfo?(milan)
Attachment #8887673 - Flags: approval-mozilla-beta?
In addition to the above comments, another reason to take the patch into beta - this gives us the ability to use the downloadable blocklist to turn this functionality off remotely, if any additional troublesome configurations present themselves.
Comment on attachment 8887673 [details]
Bug 1359416: Allow blocklisting of D3D11 keyed mutex and block it for Intel Gen 6.

add blocklist for a d3d11 feature on intel gen 6 to try and work around crashes.  should be in 55.0b13
Attachment #8887673 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1397040
You need to log in before you can comment on or make changes to this bug.