Crash in js::jit::BaselineScript::nativeCodeForPC

RESOLVED FIXED in Firefox 69

Status

()

defect
P3
normal
RESOLVED FIXED
2 years ago
8 days ago

People

(Reporter: 6lobe, Assigned: djvj)

Tracking

({nightly-community})

55 Branch
mozilla69
x86_64
Linux
Points:
---

Firefox Tracking Flags

(firefox60 fix-optional, firefox69 fixed)

Details

(crash signature)

Attachments

(3 attachments)

This bug was filed from the Socorro interface and is 
report bp-ea445ca1-3f30-45b3-9ddd-be58b0170527.
=============================================================

I can consistently reproduce this crash on two different computers.

Go to https://www.reittiopas.fi/ and just use the site for a while and you'll get this crash sooner or later.


I also got the following crash signature once:

js::jit::BailoutIonToBaseline
https://crash-stats.mozilla.com/report/index/b974e836-cbf5-4268-ab10-75c130170527
Has STR: --- → yes
It doesn't crash for me on Win 7.
This should be actionable.  How urgent is this, and how high of a crasher is this?  It seems pretty low-frequency.  If we can get a repro for this it should be very actionable.
Flags: needinfo?(kvijayan)
(In reply to Kannan Vijayan [:djvj] from comment #3)
> This should be actionable.

Can you detail what would be the conditions needed to reach this assertion, and how fixing this issue should be approached?
Flags: needinfo?(kvijayan)
(In reply to Paul Bone [:pbone] from comment #5)
> Could be related to running the gecko-profiler? 

Maybe! I just ran into this crash, just after capturing a profile with perf-html.io.  My crash report is bp-d16a027c-d908-49e8-a55f-1b5b10180913 .

The crash happened just after the profile was displayed -- possibly while resolving symbols.
The crash is almost definitely related to jit-coach, and native stack walking.  These crashes are basically impossible to reproduce and happen randomly.

The proper resolution is to implement bug 1426134 and get rid of the ad-hoc, extremely fragile stack-walking that explicitly keeps itself aware of the stack structure of every type of frame.  This is just extremely prone to bitrot, and crashes are very hard to diagnose.
Flags: needinfo?(kvijayan)

FYI, this crashes within a few pages of browsing typically for me with the profiler on in a Debug build, on MOZ_ASSERT(native).
Is there anything else that would help? attaching info from gdb. This is easy to make happen, if you can live with how slow a Debug build is.

Flags: needinfo?(kvijayan)

So the thing that gets us here is that there is a some PC being passed in that is not mapped to a baseline native address. The PC being looked up is generated here:

https://searchfox.org/mozilla-central/source/js/src/jit/JSJitFrameIter.cpp#633

The comment above that line:
// Certain exception handling cases such as debug OSR or resuming a generator
// with .throw() will use BaselineFrame::setOverridePc() to indicate the
// effective |pc|. We translate the effective-pc into a Baseline code
// address.

I suspect that some new situatoins have come up where we are setting overridePC here to something that is not being properly mapped by PCMap in the bytecode. My suspicions lean in the direction of generators, potentially - since those are the main new feature to be provided support in the jits after the stack-walking functionality landed.

Flags: needinfo?(kvijayan)

In Debug opt builds it still crashes, just less often, and in non-debug Opt builds (though on a null deref). The nice thing about trapping it in a Debug non-opt build is (a) it happens much faster, and b) it's easier to examine vars in the stack.

So what's the next step here?

Flags: needinfo?(sdetar)

Kannan, do you have thoughts on next steps?

Flags: needinfo?(sdetar) → needinfo?(kvijayan)

Can you catch it in rr? If you can, then we can figure out who set the bad address by setting a watchpoint on overrideOffset_ in the baseline frame and running in reverse. That should give us a pretty good idea who's to blame.

We should land bug 1511370 or, maybe better, fix bug 1437246. Then we will get these crashes and assertion failures in the JS shell.

:jesup - I've been trying to repro this on the original site mentioned, as well as others. Do you have any particular trigger actions to recommend?

Flags: needinfo?(kvijayan)

I did some digging into this today with jesup's help. Found some tempting clues. I couldn't replicate it locally but jesup caught one in gdb and dumped some info for me. The nativeCodeForPC() override address that's failing falls within the following script bytecode:

(gdb) x/54bx script->code()
0x7fe7c158b730:	0x3e	0x88	0x00	0x03	0x00	0x00	0xec	0x0c
0x7fe7c158b738:	0xeb	0x89	0x00	0x03	0x00	0x00	0x51	0x14
0x7fe7c158b740:	0x07	0x1a	0x00	0x00	0x00	0xe6	0x05	0x00
0x7fe7c158b748:	0x00	0x00	0x88	0x00	0x02	0x00	0x00	0x0c
0x7fe7c158b750:	0xb8	0x00	0x00	0x00	0x00	0x0a	0x3a	0x00
0x7fe7c158b758:	0x00	0x05	0xe6	0x08	0x00	0x00	0x00	0x88
0x7fe7c158b760:	0x01	0x05	0x00	0x00	0x70	0x99

The override pointer is: 0x7fe7c158b765 (the 0x99 on the last line).

The table below is a bytecode disassembly of the above:

BYTECODE:
0 - @0 - 0x3e - ZERO
1 - @1 - 0x88 - GETALIASEDVAR
2 - @6 - 0xec - TONUMERIC
3 - @7 - 0x0c - DUP
4 - @8 - 0xeb - DEC
5 - @9 - 0x89 - SETALIASEDVAR
6 - @14 - 0x51 - POP
7 - @15 - 0x14 - LT
8 - @16 - 0x07 - IFEQ
9 - @21 - 0xe6 - JSOP_JUMPTARGET
10 - @26 - 0x88 - GETALIASEDVAR
11 - @31 - 0x0c - DUP
12 - @32 - 0xb8 - CALLPROP
13 - @37 - 0x0a - SWAP
14 - @38 - 0x3a - CALL
15 - @41 - 0x05 - RETURN
16 - @42 - 0xe6 - JSOP_JUMPTARGET
17 - @47 - 0x88 - GETALIASEDVAR
18 - @52 - 0x70 - THROW
19 - @53 - 0x99 - RETRVAL

The method pseucode would be something along the lines of:

function () {
  if (A-- != 0) {
    B.method();
  }
  throw D;
}

Specifically, the override PC points to offset 53: the RETRVAL opcode immediately after the THROW.

Looking at the source code, the following logic seemed relevant:

https://searchfox.org/mozilla-central/source/js/src/jit/BaselineJIT.cpp#905

  // Code was not generated for this PC because BaselineCompiler believes it
  // is unreachable.
  return nullptr;

Looking at BaselineCompiler - we will not emit any mapping entries for opcodes which follow other opcodes which are not Fallthrough (of which JSOP_THROW is one).

Jesup dumped the pcmapping table for the baselineScript:

(gdb) x/38bx ((char *) script->baseline) + script->baseline->pcMappingOffset_
0x7fe7be334fd0:	0x00	0x8d	0x1c	0x81	0x46	0x81	0x01	0x0a
0x7fe7be334fd8:	0x92	0x06	0x81	0x20	0x81	0x51	0x02	0x00
0x7fe7be334fe0:	0x81	0x22	0x80	0x10	0x80	0x1c	0x81	0x2e
0x7fe7be334fe8:	0x92	0x06	0x81	0x20	0x92	0x08	0x81	0x2e
0x7fe7be334ff0:	0x80	0x0a	0x80	0x1c	0x81	0x50

And the dumps for the indexTable that indexes into that:

(gdb) x/6xw ((char *) script->baseline) + script->baseline->pcMappingIndexOffset_
0x7fe7be334fc0:	0x00000000	0x00000807	0x00000000	

I tried to parse this manually by eye several times, but my reading seems to indicate that there are more than 19 instructions being mapped? Posting temporary status for now before I work more on this.

Spoke with Ted about this today and he pointed out this bit of code: https://searchfox.org/mozilla-central/source/js/src/jit/VMFunctions.cpp#930

DebugEpilogue sets the PC to the last bytecode instruction. It's called from:

OnLeaveBaselineFrame: https://searchfox.org/mozilla-central/source/js/src/jit/JitFrames.cpp#254
called from HandleExceptionBaseline: https://searchfox.org/mozilla-central/source/js/src/jit/JitFrames.cpp#513
called from HandleException: https://searchfox.org/mozilla-central/source/js/src/jit/JitFrames.cpp#689
which is called in the exceptionTail trampoline code for baseline.

This seems like an obvious mechanism whereby we would skip generating pcmappings for the last bytecode instruction, and then have that be set as the overridePc anyway, and fail to map if the profiler manages to hit between the time the overridePC is set, and the time that the frame is popped and the proper resumePC on the caller frame set.

I have a question I can't answer easily: why are we setting the overridePC to the last instruction here?

It's temporarily being set on the frame, immediately before the frame gets popped for a return (so the assumption is that the overridePc is never observed). However, the profiler is breaking this assumption and observing the overridePc.

If the overridePc is never expected to be observed during this time-period, then can't we just set it to the original pc passed in? Or the first bytecode? Is there a specific reason why lastPC is used here?

nbp, jandem - either of you have an idea about the reasoning behind that 'lastPC' choice?

Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(jdemooij)

This is an educated stab in the dark. Imagine a man in a monocle and tophat, stabbing with a fancy walking stick of some sort.

Jesup: Since you're getting these crashes reliably and repeatedly - would you mind running with this patch applied?

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846756468695993d1f15be26a944a98c462bf6f8

Assignee: nobody → kvijayan

(In reply to Kannan Vijayan [:djvj] from comment #17)

I have a question I can't answer easily: why are we setting the overridePC to the last instruction here?

The overridePC is set for the exception handling trampoline (when we |return false| from DebugEpilogue). After unwinding environments we want to prevent the exception handler from getting confused by the now-unwound environment chain not matching the pc. Setting it to the first instruction, script->code(), would likely work.

However does the profiler even use the override pc? I thought it only cared about return addresses and the lastCallFrame stuff.

Flags: needinfo?(jdemooij)

It saves the PC and uses it later (when dumping profiles) to generate frame display-name info (e.g. filename:lineno info for the script):

https://searchfox.org/mozilla-central/source/js/src/jit/JSJitFrameIter.cpp#635

The relevant stack trace (from jesup):

#0  0x00007fe860f84538 in js::jit::BaselineScript::nativeCodeForPC(JSScript*, unsigned char*, js::jit::PCMappingSlotInfo*)
    (this=0x7fe7be334ee0, script=0x25d31274e5b0, pc=0x7fe7c158b765 "\231\230\200", slotInfo=0x7fe834cee010)
    at ../../../../js/src/jit/BaselineJIT.h:446
        native = 0x0
#1  0x00007fe8613360fe in js::jit::JSJitProfilingFrameIterator::fixBaselineReturnAddress() (this=0x7fe834cee2b0)
    at /home/jesup/src/mozilla/inbound/js/src/jit/JSJitFrameIter.cpp:639
        slotInfo = {
          slotInfo_ = 0 '\000'
        }
        script = 0x25d31274e5b0
        override = 0x7fe7c158b765 "\231\230\200"
        bl = 0x7fff4e59f3b0
#2  0x00007fe861335d89 in js::jit::JSJitProfilingFrameIterator::moveToNextFrame(js::jit::CommonFrameLayout*)
    (this=0x7fe834cee2b0, frame=0x7fff4e59f350) at /home/jesup/src/mozilla/inbound/js/src/jit/JSJitFrameIter.cpp:731
        stubFrame = 0x7fff4e59f390
        prevType = js::jit::FrameType::BaselineStub
#3  0x00007fe86133615d in js::jit::JSJitProfilingFrameIterator::operator++() (this=0x7fe834cee2b0)
    at /home/jesup/src/mozilla/inbound/js/src/jit/JSJitFrameIter.cpp:649
        frame = 0x7fff4e59f350
#4  0x00007fe8609827ac in JS::ProfilingFrameIterator::operator++() (this=0x7fe834cee288)
    at /home/jesup/src/mozilla/inbound/js/src/vm/Stack.cpp:1807

(In reply to Kannan Vijayan [:djvj] from comment #17)

nbp, jandem - either of you have an idea about the reasoning behind that 'lastPC' choice?

I do not. I was not even aware of this overridePC mechanism.

Flags: needinfo?(nicolas.b.pierron)
Pushed by kvijayan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1f4aa15478f1
Set overridePc to firstPc instead of lastPc when unwinding frames. r=jandem
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
QA Whiteboard: [qa-69b-p2]
You need to log in before you can comment on or make changes to this bug.