Closed Bug 1524257 Opened 1 year ago Closed 1 month ago

Crash in js::frontend::NameOpEmitter::emitAssignment on Intel CPU family 6 model 122 stepping 1

Categories

(Core :: JavaScript Engine, defect, P1)

65 Branch
Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox-esr68 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- fixed

People

(Reporter: marcia, Assigned: jorendorff)

References

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-902fcf13-a14d-48ef-952d-76e460190130.

Seen while looking at 65 crash stats: https://bit.ly/2HRB10S. This content crash appeared in 65rc2 and carried into release. A few crashes in 66 nightly and 64 in the last month.

Some comments mention crashing when they open Firefox.

Top 10 frames of crashing thread:

0 xul.dll js::frontend::NameOpEmitter::emitAssignment js/src/frontend/NameOpEmitter.cpp:241
1  @0xde761fc6ef 
2 xul.dll js::frontend::BytecodeEmitter::emitCatch js/src/frontend/BytecodeEmitter.cpp:4461
3 xul.dll js::frontend::BytecodeEmitter::emitTree js/src/frontend/BytecodeEmitter.cpp:8809
4 xul.dll js::frontend::BytecodeEmitter::emitLexicalScope js/src/frontend/BytecodeEmitter.cpp:4710
5 xul.dll js::frontend::BytecodeEmitter::emitTree js/src/frontend/BytecodeEmitter.cpp:9074
6 xul.dll js::frontend::BytecodeEmitter::emitTry js/src/frontend/BytecodeEmitter.cpp:4523
7 xul.dll js::frontend::BytecodeEmitter::emitTree js/src/frontend/BytecodeEmitter.cpp:8803
8 xul.dll js::frontend::BytecodeEmitter::emitTree js/src/frontend/BytecodeEmitter.cpp:8857
9 xul.dll js::frontend::BytecodeEmitter::emitLexicalScope js/src/frontend/BytecodeEmitter.cpp:4710

Bug 1308744 has some similarities in the stack, but instead of emitAssignment it is emitFunction and emitTree.

so far the crashes are hitting only installations on a particular cpu model (Intel family 6 model 122), so chances are that this is a build-specific issue that might go away again in another version.

No crashes yet in 65.0.1 yet, but it is still early.

the crash pattern appears to be returning in 65.0.2 :-/

As philipp notes, the recent crashes in 65.0.2 are 100% correlated to CPU Info = family 6 model 122 stepping 1:

(100.0% in signature vs 00.89% overall) CPU Info = family 6 model 122 stepping 1

Only 2 crashes in 66 release, but 6000+ in 65.0.2. I wonder if this is a compiler issue - or maybe something was fixed between 65.0.2 and 66.0.

QA Whiteboard: [qa-regression-triage]

Very low volume post-55, no leads. -> P5.

Priority: -- → P5
See Also: → 1553380

This crash shows up in 67 release, but hasn't shown up in either 67.0.1 or 67.0.2. There have only been a few crashes in 68 beta so far.

Summary: Crash in js::frontend::NameOpEmitter::emitAssignment → Crash in js::frontend::NameOpEmitter::emitAssignment on Intel CPU family 6 model 122 stepping 1
Assignee: nobody → jorendorff
Priority: P5 → P1

I will add __aligned__(32) to this function and set it as release-tracking.

Of course this is no solution. It's a CPU bug, the same bug as bug 1553380 and bug 1578722 and Chromium bug 968683. I don't see how we could really fix this except by upstreaming a patch to LLVM codegen.

Mike, any ideas?

Interestingly enough, Chromium recently reverted their workaround citing falling crash numbers: https://bugs.chromium.org/p/chromium/issues/detail?id=968683#c94

That gives me hope that this class of crashes will become less prevalent over time rather than an endless whack-a-mole, so in the meantime a one-off workaround for emitAssignment sounds reasonable.

I would like to track this down in Intel documentation, but it's a little tricky.

The first question is how to map "CPU Info: family 6 model 122 stepping 1" (from our crash reports, maybe from here) to a specific CPUID value. My best guess is that this means:

  • Extended Family (bits 27:20) = 00000000
  • Extended Model (bits 19:16) = 0111
  • reserved bits 15:14 = 00
  • Processor Type (bits 13:12) = 00
  • Family Code (bits 11:8) = 0110
  • Model Number (bits 7:4) = 1010 (so that with the extended model we get 0b0111_1010 which is 122)
  • Stepping ID (bits 3:0) = 0001

This gives a CPUID of 0b0000_0000_0111_0000_0110_1010_0001 or 0x706a1.

Then there is the problem of how to map a CPUID to a particular Intel "spec update" document i.e. CPU errata sheet.

Intel's spec updates are published in PDFs. I don't know of any way to find a particular PDF by CPUID.

The documents are named using the marketing name of the processor (for example, "Second Generation Intel® Xeon® Scalable Processors Specification Update").

However, within the document, there's a section that purportedly explains what CPUID values are covered by the document. That particular document does not appear to be the one for CPUID 0x706a1.

(In reply to :dmajor from comment #10)

Interestingly enough, Chromium recently reverted their workaround citing falling crash numbers: https://bugs.chromium.org/p/chromium/issues/detail?id=968683#c94

Yes, clearly something perturbed compiled code addresses across everything compiled for Windows, causing the bug to happen in Firefox instead. ;-) :-\

IIRC this class of crashes first came up around the time of a microcode update (*) so it might not necessarily be a silicon bug. Not sure if that means it wouldn't appear in the errata sheets. And declining volume could be explained by a fix rolling out.

(*) I'm aware that we have microcode annotations in the crash reports and they're all over, but the correlation is too strong to ignore, I'm still half-convinced that some update rolled out that isn't reflected in the registry key that our annotations inspect.

This gives a CPUID of 0b0000_0000_0111_0000_0110_1010_0001 or 0x706a1.

FWIW WinDbg reports these as 6,10,1 which, assuming the extended model bits are implied, agrees with your value.

jcristau found the Intel® Pentium® Silver and Celeron® Processors Spec Update which contains this item:

035 : Unexpected #PF, #GP, #UD, Or Other Unpredictable System Behavior May Occur

Problem Under complex microarchitectural conditions, incorrect instruction bytes may be used for code with linear addresses bits 5:4 = 10b

Implication When this erratum occurs, unpredictable system behavior may occur. This unpredictable behavior often results in an unexpected #PF, #GP or #UD exception which causes an application to unexpectedly close.

Workaround It is possible for BIOS to contain a workaround for this erratum

There is no fix and no recommended workaround for software vendors.

No fix from Intel's perspective, but if "It is possible for BIOS to contain a workaround for this erratum" that would appear as a fix from our end.

In the week since 68.6.1esr and 74.0.1.shipped, crashes with this signature
have spiked. They occur only on family 6 model 122 stepping 1 CPUs. This patch
ports a workaround that landed in V8 to address what looks like the same CPU
bug.

In short, the crash happens only in functions that start on addresses that end
with 10, 50, 90, or d0. Aligning the function to a 32-byte boundary rules out
such addresses. See https://crbug.com/968683 for more information.

Pushed by jorendorff@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cb00f09b615c
Work around apparent Intel CPU bug. r=tcampbell.
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77

Should we uplift this to Beta/ESR just to play it safe?

Flags: needinfo?(jorendorff)

No, because the spike went down before we even landed the patch—I think there must have been some kind of microcode update that made this a lot less likely to hit.

I could be convinced otherwise, but my tendency is to do nothing further here.

Flags: needinfo?(jorendorff)
You need to log in before you can comment on or make changes to this bug.