Closed Bug 1832582 Opened 1 year ago Closed 11 months ago

Crash in [@ js::jit::AnalyzeBytecodeForIon]

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED FIXED
115 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox113 --- unaffected
firefox114 + fixed
firefox115 --- fixed

People

(Reporter: pascalc, Assigned: jandem)

Details

(Keywords: crash, topcrash, topcrash-startup)

Crash Data

Attachments

(4 files)

Crash report: https://crash-stats.mozilla.org/report/index/5127780f-32a4-4fa5-abed-4fdbb0230511

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  js::jit::AnalyzeBytecodeForIon  js/src/jit/BytecodeAnalysis.cpp:253
1  xul.dll  js::jit::JitScript::ensureHasCachedIonData  js/src/jit/JitScript.cpp:465
2  xul.dll  WarpScriptOracle::createScriptSnapshot  js/src/jit/WarpOracle.cpp:305
3  xul.dll  js::jit::WarpOracle::createSnapshot  js/src/jit/WarpOracle.cpp:154
4  xul.dll  js::jit::CreateWarpSnapshot  js/src/jit/Ion.cpp:1586
4  xul.dll  js::jit::IonCompile  js/src/jit/Ion.cpp:1656
4  xul.dll  js::jit::Compile  js/src/jit/Ion.cpp:1861
5  xul.dll  BaselineCanEnterAtEntry  js/src/jit/Ion.cpp:1993
5  xul.dll  IonCompileScriptForBaseline  js/src/jit/Ion.cpp:2117
5  xul.dll  js::jit::IonCompileScriptForBaselineAtEntry  js/src/jit/Ion.cpp:2144

This crash is spiking since 114.0b2

This is weird.

In the last week, there are 39 crashes, but only 5 installs. The installs all seem to be on different machines (different CPU info, etc). I looked at one crash from each install, and they all crashed in the same way:

5ed527e0:       8b 51 08                mov    0x8(%ecx),%edx
5ed527e3:       8d 4a 21                lea    0x21(%edx),%ecx
5ed527e6:       8b 72 04                mov    0x4(%edx),%esi
5ed527e9:       85 f6                   test   %esi,%esi
5ed527eb:       74 37                   je     0x5ed52824
5ed527ed:       01 ce                   add    %ecx,%esi
5ed527ef:       31 d2                   xor    %edx,%edx
5ed527f1:       eb 1b                   jmp    0x5ed5280e
5ed527f3:       b0 01                   mov    $0x1,%al
^^^^^^^^^       ^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^
5ed527f5:       66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%eax,%eax,1)
5ed527fc:       00 00 00 00
5ed52800:       0f b6 3c fd 64 2a c3    movzbl 0x62c32a64(,%edi,8),%edi
5ed52807:       62
5ed52808:       01 f9                   add    %edi,%ecx

The mov is a legal instruction, but in every case we're actually executing one byte earlier (in this case 0x5ed527f2), so we start decoding with the last byte of the previous instruction. This gives 1b b0 01 66 66 2e sbb esi,DWORD PTR [eax+0x2e666601], which (combined with the value of 0x175bb101 in eax) reproduces the crash address, 0x45c21702.

The disassembly after the crashing instruction also looks like nonsense. It doesn't look like it's meant to be actual code.

Which raises the question: how did we get here? Either we were misaligned while executing the previous instructions (and interpreted the jmp as something else), or we jumped directly to the crashing instruction. I can't find a misalignment that removes the jump while preserving the sbb interpretation, so I think we must have jumped directly here. But that's a weird thing to do once, much less in five independent installs.

This crash is in compiled C++ code, not jit code. It's been a year since we touched this code last, so I don't think the spike is related to code we landed. Have we updated the build compiler lately? Is it possible that this is a bug in clang?

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta (startup)

For more information, please visit BugBot documentation.

The bug is marked as tracked for firefox114 (beta). However, the bug still isn't assigned.

:sdetar, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(sdetar)

(In reply to Iain Ireland [:iain] from comment #1)

Have we updated the build compiler lately? Is it possible that this is a bug in clang?

Some recent-ish Clang-related changes: bug 1823591, bug 1829262, bug 1829771. It's weird because the stacks look reasonable and I'd expect a miscompilation in this function to affect more installs.

Crashes are Windows 7 and 8.1, both parent and content processes, I don't see anything obvious in the crash reports.

Seeing a lot of GenuineIntel family 6 model 54 stepping 1 in the crash reports.

Cedarview, that's a new one as far as cursed Atom chips go.

Looking at the list of errata for Atom D2000/N2000 here, it has:

BQ16 An Indirect Jump Instruction May Execute to An Incorrect Target

Problem: An indirect jump instruction may execute incorrectly to the target from a previous instance of the jump instead of the correct target. This may occur only when the indirect jump is part of a short loop. Implication: Due to this erratum, an indirect jump may cause unpredictable system behavior.

I haven't looked at the assembly code but this function does have a very short loop with probably an indirect jump, so it might be related.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #5)

Seeing a lot of GenuineIntel family 6 model 54 stepping 1 in the crash reports.

Oh, wow, yeah. I could have sworn that I saw a variety of CPUs when I looked at it yesterday, but 62/64 crashes in the last week were that particular chip.

So one hypothesis is that we updated clang and it is now emitting code that triggers this hardware bug more often. I think the timing lines up: the clang patches landed in 114, and all the failures are in 114.0b2 and 114.0b3.

Making an S2 bug, since it is a top crash.

Assignee: nobody → jdemooij
Severity: -- → S2
Flags: needinfo?(sdetar)
Priority: -- → P1

I think at this point, the function only cares about the uses-environment-chain flag. We might be able to rewrite the code to be based on a table lookup instead of a switch / indirect jump. I can try that and then we can see if it affects the crash volume here.

Flags: needinfo?(jdemooij)

Now that the analysis is only used for the uses-environment flag, we can
simplify it a bit. This is also faster because we can return immediately
when we know the answer.

Depends on D178650

This eliminates the indirect jump in ScriptUsesEnvironmentChain which will
hopefully workaround the Atom CPU issue reported in the bug report.

Depends on D178651

This shouldn't affect behavior because for JSOp::Arguments this replaces a similar check
in WarpOracle, and generators have their own environment object causing us to return
true from ScriptUsesEnvironmentChain.

Depends on D178652

I posted some patches to rewrite how this function works. These changes should be safe to backport.

Flags: needinfo?(jdemooij)
Status: NEW → ASSIGNED
Pushed by jdemooij@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6a2f51d48ac8
part 1 - Remove unused IonBytecodeInfo::modifiesArguments flag. r=iain
https://hg.mozilla.org/integration/autoland/rev/47afe93b46c8
part 2 - Change AnalyzeBytecodeForIon to ScriptUsesEnvironmentChain. r=iain
https://hg.mozilla.org/integration/autoland/rev/ed8285b6542c
part 3 - Add JOF_USES_ENV and use it in ScriptUsesEnvironmentChain. r=iain
https://hg.mozilla.org/integration/autoland/rev/64b2487b96b9
part 4 - Add JOF_USES_ENV to JSOp::Arguments and JSOp::Generator. r=iain

Should we include those patches in our upcoming RC2 tomorrow? Could we get an uplift request if this is safe to backport? Thanks!

Flags: needinfo?(jdemooij)

Comment on attachment 9335210 [details]
Bug 1832582 part 1 - Remove unused IonBytecodeInfo::modifiesArguments flag. r?iain!

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes with certain Intel Atom CPUs. This patch stack likely works around this CPU issue.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Changes are mostly mechanical and code is covered by many tests.
  • String changes made/needed:
  • Is Android affected?: No
Flags: needinfo?(jdemooij)
Attachment #9335210 - Flags: approval-mozilla-beta?
Attachment #9335212 - Flags: approval-mozilla-beta?
Attachment #9335213 - Flags: approval-mozilla-beta?
Attachment #9335214 - Flags: approval-mozilla-beta?

Comment on attachment 9335210 [details]
Bug 1832582 part 1 - Remove unused IonBytecodeInfo::modifiesArguments flag. r?iain!

Fx114 is now in RC, moving this uplift request to release.

Attachment #9335210 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Attachment #9335212 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Attachment #9335213 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Attachment #9335214 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Comment on attachment 9335210 [details]
Bug 1832582 part 1 - Remove unused IonBytecodeInfo::modifiesArguments flag. r?iain!

New 114 crash with some volume, low risk backport, approved for our RC3 as a ride-along, thanks.

Attachment #9335210 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9335212 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9335213 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #9335214 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: