Closed Bug 1403353 Opened 7 years ago Closed 7 years ago

Increase in crashes with Intel's igdusc64.dll module during Firefox 57

Categories

(Core :: Graphics, defect, P1)

57 Branch
x86_64
Windows 7
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 + fixed
firefox57 + fixed
firefox58 --- fixed

People

(Reporter: philipp, Assigned: dvander)

References

()

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-af5e7d3d-35b2-4905-99be-920d60170926.
=============================================================
[Tracking Requested - why for this release]:
there is an increase in crashes with signatures containing intel's igdusc64.dll module during the 57.a1 cycle and continuing into 57 beta. so far those reports account for a bit over 3% of browser crashes in early data from 57.0b.

https://crash-stats.mozilla.com/search/?signature=~igdusc64.dll&product=Firefox&version=57.0b&process_type=browser&date=%3E%3D2017-06-01T&date=%3C2017-09-26T21%3A29%3A11.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=adapter_vendor_id&_facets=build_id&_facets=install_time&_facets=platform_pretty_version&_facets=useragent_locale&_facets=process_type&_facets=adapter_device_id&_facets=adapter_driver_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

Adapter device id facet
1 	0x1616 	89 	61.38 %
2 	0x0a16 	27 	18.62 %
3 	0x041e 	23 	15.86 %
4 	0x0416 	3 	2.07 %
Flags: needinfo?(milan)
Summary: Crash in igdusc64.dll | HeapFree | igdusc64.dll | HeapFree | igdusc64.dll | igd10iumd64.dll | stdext::_Hash<T>::insert → Increase in crashes with Intel's igdusc64.dll module during Firefox 57
David, maybe some devices aren't ready for AL?
Flags: needinfo?(milan) → needinfo?(dvander)
Priority: -- → P1
Whiteboard: [gfx-noted]
These are 64-bit, right?  Is the increase matching us moving more people to 64-bit automatically?
Flags: needinfo?(madperson)
the module is 64bit only and i can't spot an 32bit counterpart of the dll in crash stats. however i don't think that the spike in 57 is directly tied to the win64 migration. in 56.0b12 we had ~60 of these crashes, now on 57.0b3 there are already more than 1000.

i also have to correct the percentage from my comment #0 - by now the issue is responsible for around 12% of browser crashes in 57.0b3.
Flags: needinfo?(madperson)
I'm worried that we're only seeing Windows 7 crashes because it doesn't support the GPU process, and we don't get GPU process crash reports on beta/release. Is there any way to see whether these crashes are also occurring in Telemetry, for Windows 10?
Flags: needinfo?(madperson)
hi marco, do you know if telemetry can offer an insight here? (i don't have access to telemetry data...)
Flags: needinfo?(madperson) → needinfo?(mcastelluccio)
(In reply to Milan Sreckovic [:milan] from comment #1)
> David, maybe some devices aren't ready for AL?

Good call. The Intel driver is crashing trying to compile the "TexturedVertex" shader in Advanced Layers, which is not very common and therefore initialized lazily. (It gets used with masks and 3d transforms.) I don't know whether that's just the first bad shader to get initialized or what, but the adapters in comment #0 are ~14% of our population and kicking them back to the D3D11 compositor would be a real pain given that we want to delete it.

Milan, looking at the inventory page [1], there might be a machine or two in Toronto with 0x0416 or 0x0412 adapters. Do you know if those exist and if so, what OS they run?

[1] https://wiki.mozilla.org/QA/Platform/Graphics/Inventory
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Flags: needinfo?(dvander) → needinfo?(milan)
I have a desktop machine with 0x0412 but it doesn't seem to have Windows on it. I haven't been able to find other machines that have haswell gpu's yet.
I found a hard drive in that machine that has win10 on it.
(In reply to David Anderson [:dvander] from comment #4)
> I'm worried that we're only seeing Windows 7 crashes because it doesn't
> support the GPU process, and we don't get GPU process crash reports on
> beta/release. Is there any way to see whether these crashes are also
> occurring in Telemetry, for Windows 10?

Why don't we get GPU process crash reports on beta/release?

We can't look for the specific signatures, but we can see whether there's a increase of crashes with Intel graphic cards, if that's interesting.
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #9)
> (In reply to David Anderson [:dvander] from comment #4)
> > I'm worried that we're only seeing Windows 7 crashes because it doesn't
> > support the GPU process, and we don't get GPU process crash reports on
> > beta/release. Is there any way to see whether these crashes are also
> > occurring in Telemetry, for Windows 10?
> 
> Why don't we get GPU process crash reports on beta/release?
> 
> We can't look for the specific signatures, but we can see whether there's a
> increase of crashes with Intel graphic cards, if that's interesting.

Because there is no way to ask the user to submit reports. They have to happen to visit about:crashes. There's no UI for when the GPU process crashes (by design), but privacy limitations prevent us from submitting anything automatically.
Working theory. If you facet these crashes on "cpu info" we see:
1 	family 6 model 61 stepping 4 | 4 	88 	61.54 %
2 	family 6 model 60 stepping 3 | 4 	26 	18.18 %
3 	family 6 model 69 stepping 1 | 4 	26 	18.18 %
4 	family 6 model 55 stepping 8 | 2 	1 	0.70 %
5 	family 6 model 60 stepping 3 | 8 	1 	0.70 %
6 	family 6 model 61 stepping 4 | 3 	1 	0.70 %

Which is a pretty population of CPUs. Furthermore, the crashing instruction is always:
   000007FEF6E0FED5  vpunpckldq  xmm1,xmm1,xmm0

That's an AVX instruction. The CPUs above support AVX, but according to the Intel docs, the Operating System is responsible for enabling AVX by flipping bits 2:1 in the XCR0 control register. This is only done on Windows 7 SP1 and higher. If not, an #UD exception is raised which is an illegal exception.

If we facet on platform version:
1 	6.1.7600 	138 	96.50 %
2 	6.1.7601 Service Pack 1 	4 	2.80 %
3 	10.0.15063 	1 	0.70 %

The 5 crash reports associated with Windows 7 SP1 and Windows 10 look totally unrelated.

So, my conclusion is that this is an Intel driver bug specific to x64 on Windows 7. It does not correctly check for AVX support, probably assuming that the CPUID check is enough.
Flags: needinfo?(milan)
The next question is why this spiked in 57. I don't really have any theories, it's possible Advanced Layers exacerbated it, but I also suspect this problem has existed forever and we just didn't have as many x64 users.

Crashes on the old D3D11 compositor that are clearly this bug:
https://crash-stats.mozilla.com/report/index/21b60865-347b-4b0a-9bc2-320ec0170518
https://crash-stats.mozilla.com/report/index/c65e23a1-c78a-4006-96a5-851d50170518
https://crash-stats.mozilla.com/report/index/e56b5bce-c650-40e4-b0e6-6f9950170613

etc... some of these go back to Firefox 42. Crash reports on Windows 7 SP1 look very different.

Without having STR and without knowing what causes the driver to go down this path... I'm proposing a blanket D3D11 compositor ban if the following things are all true:
 1. The architecture is AMD64.
 2. CPUID reports AVX support.
 3. XCR0 and CR4 report that the kernel does not support AVX.
 4. The adapter is an Intel device (any model/driver).

We could potentially narrow it down further, but this is a good place to start. I'll do a quick Telemetry analysis to see how many users that would affect.
That seems reasonable. We should also reach out to Intel and ask them what's going on.
This would affect, roughly guessing, an upper bound of 0.7% of Windows users:

   6% of Windows users are on Windows 7 pre-SP1.
  70% of those users have an Intel driver.
  53% of those users are currently getting a D3D11 compositor.
  44% of those users have 64-bit Windows.
  74% of those users have an AVX-capable processor but no AVX OS support.
Attached patch patchSplinter Review
Attachment #8913028 - Flags: review?(jmuizelaar)
Bas mentioned he would try to get in touch with Intel, setting ni? just as a reminder.
Flags: needinfo?(bas)
Attachment #8913028 - Flags: review?(jmuizelaar) → review+
Intel has been contacted.
Flags: needinfo?(bas)
This reminds me of bug 1225094, where the bug was due to the Microsoft boot loader in dual boot scenarios (with different versions of Windows).
See Also: → 1225094
(In reply to Marco Castelluccio [:marco] from comment #19)
> This reminds me of bug 1225094, where the bug was due to the Microsoft boot
> loader in dual boot scenarios (with different versions of Windows).

Note if this is the same situation, the solution from comment 12 will not fix the problem.
I guess we will know when we land this.
Comment on attachment 8913213 [details]
Bug 1403353: Enable OMTP by default on windows only.

I put this on the wrong bug.
Attachment #8913213 - Attachment is obsolete: true
Attachment #8913213 - Flags: review?(dvander)
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/79580c3ab338
Block D3D11 when using Intel drivers on Windows 7 systems with partial AVX support. (bug 1403353, r=jrmuizel)
OS: Windows → Windows 7
https://hg.mozilla.org/mozilla-central/rev/79580c3ab338
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
We plan to automatically migrate eligible Windows users running 32-bit Firefox to 64-bit Firefox 56. To avoid causing Haswell users to crash by migrating them to 64-bit, we would like to uplift this fix to a 56.0.1 dot release after the fix has baked on Nightly for a few days.

If we can't uplift this fix to 56.0.1, then we can exclude Windows 7 users without SP1 from 64-bit migration to avoid this crash. We can migrate those users later when this fix rides to the Release channel (in Firefox 57 or 58).
[Tracking Requested - why for this release]: See comment 24.
could you please request uplift for beta if you deem fit to do so? the last recorded crash on nightly was on build 20170927100120...
Flags: needinfo?(dvander)
Comment on attachment 8913028 [details] [diff] [review]
patch

Approval Request Comment
[Feature/Bug causing the regression]: Multiple factors; D3D11 changes, x64 migration
[User impact if declined]: Sporadic crashes on older versions of Windows 7
[Is this code covered by automated tests?]: N/A
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No
[Why is the change risky/not risky?]: It's just a blocklist entry for broken drivers, and we've made it surgical to affect as few users as possible.
[String changes made/needed]:
Flags: needinfo?(dvander)
Attachment #8913028 - Flags: approval-mozilla-beta?
Comment on attachment 8913028 [details] [diff] [review]
patch

Approval Request Comment
We're currently holding back 64-bit auto-upgrades away from Win 7 Pre-SP1 users because of this problem.  We got a request to make this available for 56.0.1 so that we can upgrade everybody.
The patch has stopped crashes on nightly since landed and looks stable.
Attachment #8913028 - Flags: approval-mozilla-release?
Comment on attachment 8913028 [details] [diff] [review]
patch

promising data from nightly that this fix works as expected, beta57+
Attachment #8913028 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
So far there is only 1 crash in the last week for 56 across all these signatures. That seems unusual. Are you sure we need this on 56 release?
Flags: needinfo?(dvander)
That is unusual. There are crashes for, say, 56 beta but not release. Anyway I'm not sure we need it there, if it's not a problem then no need to uplift.
Flags: needinfo?(dvander)
To get a sense of relative scale of how many Release channel users might affected by this Haswell crash if we migrate 100% of eligible users, I searched for the number of crash reports (any crash signature from the last six months) from eligible Beta and Release users, i.e. currently running 32-bit Firefox on Windows 7 pre-SP1 with more than 2GB RAM.

Over the last six months, there were 369,180 Beta and 388,450 Release crash reports from these particular users. That suggests that we have roughly the same number of Beta and Release users currently running 32-bit Firefox on Windows 7 pre-SP1 with more than 2GB RAM. So we should expect roughly the same number of Haswell crashes from migrated Release users as we saw from migrated Beta users. We saw 7,168 crashes from Beta 56 and 57 over the last seven days, so we will probably see just 72 crashes in the first week after migrating 1% of eligible Release users (if we don't uplift this fix or exclude Win7 pre-SP1 from migration).
OK. So it's not a ton of users, but on the other hand, this won't affect future dot releases (it would be incorporated into them) and it also won't affect our watershed.   56.0 will remain the watershed for the migration. So there isn't any big reason *not* to uplift. If we decide to increase the % of users who migrate during the 56 cycle, then it will be totally worth it.
Comment on attachment 8913028 [details] [diff] [review]
patch

Taking this for the planned dot release for 64-bit migration rollout.
Attachment #8913028 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: