Open Bug 1882796 Opened 11 months ago Updated 11 months ago

Crash in [@ js::jit::MBasicBlock::isMarked]

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

Other
Windows 11
defect

Tracking

()

Tracking Status
firefox125 --- affected

People

(Reporter: release-mgmt-account-bot, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, stalled)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/fe015da6-26b7-4b0b-ac5d-2b3bb0240223

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  js::jit::MBasicBlock::isMarked const  js/src/jit/MIRGraph.h:471
0  xul.dll  js::jit::PruneUnusedBranches  js/src/jit/IonAnalysis.cpp:393
0  xul.dll  js::jit::OptimizeMIR  js/src/jit/Ion.cpp:999
1  xul.dll  js::jit::CompileBackEnd  js/src/jit/Ion.cpp:1605
2  xul.dll  js::jit::IonCompileTask::runTask  js/src/jit/IonCompileTask.cpp:52
2  xul.dll  js::jit::IonCompileTask::runHelperThreadTask  js/src/jit/IonCompileTask.cpp:30
3  xul.dll  js::GlobalHelperThreadState::runTaskLocked  js/src/vm/HelperThreads.cpp:1728
3  xul.dll  js::GlobalHelperThreadState::runOneTask  js/src/vm/HelperThreads.cpp:1689
4  xul.dll  JS::RunHelperThreadTask  js/src/vm/HelperThreads.cpp:1684
4  xul.dll  HelperThreadTaskHandler::Run  js/xpconnect/src/XPCJSContext.cpp:1118

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

  • First crash report: 2023-12-21
  • Process type: Content
  • Is startup crash: No
  • Has user comments: No
  • Is null crash: Yes - all crashes happened on null or near null memory address

The Bugbug bot thinks this bug should belong to the 'Core::JavaScript Engine: JIT' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: General → JavaScript Engine: JIT

I grabbed the last week of crashes. 6/10 crashes in isMarked were in PruneUnusedBranches. All of those claim to crash on this line. This is a little interesting, because there are several calls to isMarked, but not that interesting, because it's the first time we traverse the graph, so it would run into any data corruption first. 5/6 are null or near-null. The sixth crashes with a non-canonical address, with an obvious bitflip in the high bits of a pointer.

Looking at the crashing disassembly, there are a few different patterns, but this one shows up four times:

cmp    %rdx,%rax
jne    0x7ffb1f23098b
mov    0x80(%rcx),%rbx
cmpb   $0x0,0xc4(%rbx) <- failing instruction
jne    0x7ffb95f24ead
movb   $0x1,0xc4(%rbx)
cmpb   $0x0,0x11(%rbx)

In each case, rdx and rax are some pointer value, and they're always equal (as expected, since otherwise we would take the branch). I'm pretty sure that 0xc4 is the offset of the isMarked boolean inside MBasicBlock: it doesn't quite match up with my debug build, but it's in the right ballpark, and there are some debug-only members. I'm pretty sure this code and the following code disassembles to

      if (succ->isMarked()) {
        continue;
      }
      JitSpew(JitSpew_Prune, "Reaches block %u", succ->id());
      if (!markReachable(succ)) {
        return false;
      }

where markReachable is a lambda defined above as

  auto markReachable = [&](MBasicBlock* block) -> bool {
    block->mark();
    numMarked++;
    if (block->alwaysBails()) {
      needsTrim = true;
    }
    return worklist.append(block);
  };

So that mostly confirms that the crash report isn't lying to us. We're crashing because succ is null.

The problem is that I'm a little unclear about how we're getting our hands on succ. In the source, we call MBasicBlock* succ = block->getSuccessor(i), which should be calling a virtual function on the control instruction for the block, and I can't find anything that looks like a virtual call. Staring at the code for a while, I think I've convinced myself that this is clang doing speculative devirtualization. Attempted partial decompilation:

                                           // r12 is block
7ffb95f24dfd:   cmpb   $0x0,0x11(%r12)        if block->alwaysbails
7ffb95f24e03:   jne    0x7ffb95f24dd5                continue


7ffb95f24e05:   mov    0x30(%r12),%rax       rax = block->instructions_->rBegin
7ffb95f24e0a:   lea    -0x48(%rax),%rcx      (vtable nonsense happens here?)
7ffb95f24e0e:   test   %rax,%rax             ...
7ffb95f24e11:   cmove  %rax,%rcx             ...
7ffb95f24e15:   mov    (%rcx),%rax           ...
7ffb95f24e18:   mov    0xf0(%rax),%rax       ... 
7ffb95f24e1f:   cmp    %r13,%rax             compare to a baked-in constant we loaded above
7ffb95f24e22:   je     0x7ffb95f24e34           if it matches, skip next check
7ffb95f24e24:   lea    -0x12ef0b(%rip),%rdx  otherwise, load a different constant
7ffb95f24e2b:   cmp    %rdx,%rax             if it doesn't match
7ffb95f24e2e:   jne    0x7ffb95f24fc9           jump to some code far away (failed devirt?)
7ffb95f24e34:   mov    0x30(%r12),%rax        r12 = block, rax = block->instructions_->rBegin
7ffb95f24e39:   lea    -0x48(%rax),%rcx       (more vtable nonsense?)
7ffb95f24e3d:   test   %rax,%rax              ...
7ffb95f24e40:   cmove  %rax,%rcx              ...
7ffb95f24e44:   mov    (%rcx),%rax            ...
7ffb95f24e47:   mov    0xf8(%rax),%rax        ...
7ffb95f24e4e:   lea    -0x65ff5(%rip),%rdx    load a constant
7ffb95f24e55:   cmp    %rdx,%rax              compare whatever we got
7ffb95f24e58:   jne    0x7ffb95f24f3d           jump somewhere else

7ffb95f24e5e:   mov    0x80(%rcx),%rbx        load the successor from a known offset (maybe this is MGoto?)
7ffb95f24e65:   cmpb   $0x0,0xc4(%rbx)        succ->isMarked() <- this is the failing instruction

After all this staring, nothing is jumping out at me as a problem. It's tempting to think that maybe this is a bug in clang's devirtualization, but that seems pretty unlikely. Glancing at the other crash that doesn't match this pattern, it certainly seems possible that it is one of the other branches of the devirtualization. It crashes on these instructions:

mov    (%rcx,%rdi,8),%rbx
cmpb   $0x0,0xc4(%rbx)

Those are what I would expect to see in MTest::getSuccessor. So these crashes are consistent with the hypothesis that sometimes we just have null pointers where our successors should be. It's hard to imagine what sort of bug in our code would lead to that happening at such a low frequency. Maybe we're missing some OOM handling somewhere? Otherwise it's maybe just a downstream consequence of weird bitflips in WarpBuilder.

The crash rate here is quite low. I'm going to mark this as stalled unless we can somehow reproduce this crash in a fuzzer.

Severity: -- → S4
Keywords: stalled
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.