Closed Bug 1702019 Opened 3 years ago Closed 3 years ago

Firefox 87.0 topcrash in [@ JS_WrapValue] with Intel GeminiLake (UHD Graphics 600/605)

Categories

(Core :: JavaScript Engine, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr78 --- wontfix
firefox87 + wontfix
firefox88 --- unaffected
firefox89 --- unaffected
firefox106 + affected

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: crash, topcrash)

Crash Data

[Tracking Requested - why for this release]: Top crash

This topcrash (#6 for Firefox 87.0) is new in this frequency (1.1k crashes so far). All except one on Windows 10, and >99% with Intel GeminiLake (UHD Graphics 600/605):

family 6 model 122 stepping 1 1094 97.07 %
family 6 model 122 stepping 8 31 2.75 %

Only one non-87.0 crash (for 89.0a1) - might a dot release fix this?

Crash report: https://crash-stats.mozilla.org/report/index/e7d40e1a-8254-4ee2-943b-3b1f00210330

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll JS_WrapValue js/src/jsapi.cpp:656
1 xul.dll trunc 
2 xul.dll static XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1142
3 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:925
4 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:520
5 xul.dll Interpret js/src/vm/Interpreter.cpp:3243
6 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:552
7 xul.dll js::jit::DoCallFallback js/src/jit/BaselineIC.cpp:1841
8  @0x1dab9873ebe 
9 xul.dll trunc 
Flags: needinfo?(tcampbell)

78.9.0's crash rate is also elevated (78.6.0 was also affected).

https://bugs.chromium.org/p/chromium/issues/detail?id=1157639#c14
Chrome reports a recent spike in the last two weeks which matches us.

This really looks like Intel shipped some microcode update and lost the stability fix.

Flags: needinfo?(tcampbell)

This link suggests Microsoft starting rolling out KB4589212 on March 10th. In the KB4589212 description, the list Gemini Lake (the stepping 1 ID), with a footnote that says:

1 Rolled back to microcode updates related to Spectre Variant 3a (CVE-2018-3640: "Rogue System Register Read (RSRE)"), Spectre Variant 4 (CVE-2018-3639: "Speculative Store Bypass (SSB)"), and L1TF (CVE-2018-3620, CVE-2018-3646: "L1 Terminal Fault")
Severity: -- → S3
Priority: -- → P3

Changing the priority to p1 as the bug is tracked by a release manager for the current release.
See What Do You Triage for more information

Priority: P3 → P1

This crash is only happening in Release on CPUs that have hardware bugs. In the past when we've tried to work around this, we haven't been able to directly have impact. Without any better ideas and with merge in a few days, I think our best option is to cross our fingers and hope that the 88.0 build does not generate code that hits the same pattern.

Went away in 88+ as expected.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME

This crash is back on gemini lake with firefox 91.0.1.

See Also: → 1746270

My brother is hitting this on his laptop on current nightly (bp-61c854a8-cd3b-447e-9768-da1800221021, bp-4013e77f-f13a-411a-8746-87bbb0221021). It seems to happen on gmail and with a rather slow internet connection (one of the crashes has the background hang monitor on the stack).

Is there something that would be useful to investigate here?

Flags: needinfo?(gsvelto)

The crashes here are all coming from machines using Gemini Lake processors which suggest a CPU bug, especially given the stacks are all different. I don't think there's much we can do. If this is triggered by a particular code sequence then the next version of nightly should make the problem disappear from his laptop (at least until we end up with the same code sequence again).

Flags: needinfo?(gsvelto)

Gabriele, could you help us diagnose this in 106.0.4 and confirm that we are hitting the same CPU bug in 106.0?
Assuming so, the only workaround is a 106.0 dot release?

Flags: needinfo?(gsvelto)

Given the current spike I had another look. I can confirm that this is indeed a CPU bug. All the crashes are coming from machines with Gemini Lake CPUs also known as Goldmont Plus. The crashes manifest themselves as an ACCESS_VIOLATION_WRITE exception, which requires a memory access to be triggered, but the crashing instruction is mov r15, rcx which does not access memory and couldn't possibly cause that exception.

I skimmed over the errata document for these CPUs but couldn't find a specific issue that could cause this, yet the number of issues in this core is quite large so I might have missed something.

Flags: needinfo?(gsvelto)
You need to log in before you can comment on or make changes to this bug.