Closed Bug 1446307 Opened 2 years ago Closed Last year

ToggledCallSize() does not properly account for pools on ARM64


(Core :: JavaScript Engine: JIT, defect, P1)




Tracking Status
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed


(Reporter: lth, Assigned: sstangl)


(Blocks 1 open bug)



(2 files, 1 obsolete file)

SoftIron Overdrive 1000, OpenSUSE
mozilla-inbound 408372:91a98a3ddd28 (m-c to m-i merge changeset)
Configure with --enable-debug --enable-optimize
Run without options


  Exit code: -4
  FAIL - debug/Script-getOffsetsCoverage-02.js

  Exit code: -4
  FAIL - debug/Script-getOffsetsCoverage-03.js

  Exit code: -4
  FAIL - debug/Script-getOffsetsCoverage-04.js

  Exit code: -4
  FAIL - debug/Script-getOffsetsCoverage-bug1233178.js

When running with --no-baseline there are no errors, ergo, JS baseline compiler failure.

Normally this passes on the simulator when running on x64 linux, so probably a hardware-only failure, but needs deeper investigation.
But, the tests also do not fail when run with --baseline-eager --no-threads.
No longer blocks: Fennec-ARM64
Reproduces reliably, sounds actionable -> P1
Flags: needinfo?(sstangl)
Priority: -- → P1
<jorendorff> [various stupid questions]
<lth> i'm imagining a missing icache flush when patching something
I don't have any ARM64 hardware at the moment, so if it doesn't reproduce in the simulator, I can't debug it.
To make things even more interesting, this is *only* for --enable-debug --enable-optimize builds, not for --disable-optimize --enable-debug or --enable-optimize --disable-debug.
For the time being I run tests on hardware manually, roughly weekly.

This morning a few more tests are failing and there are some new failure modes.  Again this is an --enable-debug --enable-optimize build.  My guess is that the underlying problem is the same however.

Exit code: -4
FAIL - debug/Script-getOffsetsCoverage-02.js

Exit code: -4
FAIL - debug/Script-getOffsetsCoverage-03.js

Exit code: -4
FAIL - debug/Script-getOffsetsCoverage-04.js

Exit code: -4
FAIL - debug/Script-getOffsetsCoverage-bug1233178.js

Exit code: -11
FAIL - debug/bug1240803.js

Exit code: -11
FAIL - debug/bug1246605.js

Assertion failure: thingKind == getAllocKind(), at /home/lhansen/m-i/js/src/gc/GC.cpp:563
Exit code: -11
FAIL - debug/bug-1248162.js

Exit code: -11
FAIL - debug/bug1252453.js
Depends on: 1451382
Attached file debugger.js
Simplified test case.  Run this without flags (on hardware) to get an illegal instruction crash.

If run with --no-baseline there's no crash (unsurprising).

If run with --baseline-eager there's no crash (more interesting).
So there's a bug here with the constant pool header, but fixing that does not seem to make a difference.  The bug is: On ARM64, sizeof(Instruction) is 1, so when WritePoolHeader() divides the pool size by sizeof(Instruction) to get the number of instructions to store in the pool header, it gets the wrong result.  We must divide by 4 instead.

There appears to be some code somewhere that does not properly skip a constant pool that is emitted in the middle of a toggled call sequence.  The code looks like this:

   0x20b8b8dfbd8:	add	x28, x28, #0x8
   0x20b8b8dfbdc:	mov	sp, x28                ; SyncStackPtr
   0x20b8b8dfbe0:	ldr	x17, 0x20b8b8dfc4c     ; First instruction of toggled call
   0x20b8b8dfbe4:	b	0x20b8b8dfc54          ; Branch across pool
=> 0x20b8b8dfbe8:	.inst	0xffff001b             ; Pool header (corrected)
   ...                                                 ; constants
   0x20b8b8dfc54:	nop                            ; Second instruction of toggled call
   0x20b8b8dfc58:	mov	x16, #0x28b8
   0x20b8b8dfc5c:	movk	x16, #0xb6e0, lsl #16

The program crashing at the point of => indicates that somebody knows that address and has generated code that uses it, but has not taken into account that there may be a pool there.

We could hack around this probably by flushing the pool before the patchable call but this seems wrong.  (On the other hand, this is a poster child for suppressing constant pools if I ever saw one.)

Given that we don't crash with --baseline-eager this could have something to do with bailouts.  What I know for sure is that we don't do any toggling of the calls for the debugger before the crash, because neither ToggleCall nor toggleDebugTraps is called.

The toggled calls are brittle due to a combination of circumstances:

- the offset that is returned for the toggled sequence is the offset to the SyncStackPtr,
  and there's a comment explaining why this has to be so

- code that skips pools for toggled calls therefore needs to be aware that the sequence
  can have other code in it than the toggled call instructions themselves or the pool

- mostly this seems to be done right, /in the places i've found/, but comments suggest
  vaguely that that the right code was added later, because the comments are not quite
Oh yeah, the TC attached also crashes for --enable-debug --disable-optimize.
jit::EnterBaselineAtBranch() calls BaselineScript::nativeCodeForPC() which correctly returns the address of the SyncStackPtr's "mov" instruction.  It then increments this value to skip past the handler:

    data.jitcode = baseline->nativeCodeForPC(fp->script(), pc);
    if (fp->isDebuggee()) {
        data.jitcode += MacroAssembler::ToggledCallSize(data.jitcode);

However the size of the handler returned by ToggledCallSize() on ARM64 does not account for a possible constant pool in the middle of the handler.  It does handle a pool before the initial LDR, but not before the NOP.  It can also handle a pool at the beginning of the sequence, but then not a SyncStackPtr after the pool; or a pool after the SyncStackPtr, which is handled properly. 

Superficially it looks like ARM could have the same problem - pools appear not to be handled at all, and toggledCall() does not have any guard that the instructions are not split up.
This fixes the problem (comment 10) on ARM64.

Can you take a look at the corresponding ARM code and judge whether it needs to be fixed there too?  I expect debugging has not been exercised heavily on ARM...

I haven't scanned the rest of the arm64 code to see if there might be similar issues elsewhere, but I'll try to find time to do so.
Assignee: nobody → lhansen
Flags: needinfo?(sstangl)
Attachment #8965649 - Flags: review?(sstangl)
Duplicate of this bug: 1451382
Summary: jit-test failures on ARM64 hardware, JS baseline JIT implicated → ToggledCallSize() does not properly account for pools on ARM64
Comment on attachment 8965649 [details] [diff] [review]

Review of attachment 8965649 [details] [diff] [review]:

::: js/src/jit/arm64/MacroAssembler-arm64.h
@@ +1968,2 @@
> +        const Instruction* cur = (const Instruction*)code;

This should use the InstructionIterator rather than open-coding its logic.

It would also be prudent to MOZ_ASSERT() that the instructions are indeed LDR/ADR and NOP/BLR.

@@ +1973,5 @@
> +        cur = cur->skipPool();
> +        cur = cur->NextInstruction(); // LDR/ADR
> +        cur = cur->skipPool();
> +        cur = cur->NextInstruction(); // NOP/BLR
> +        return (uint8_t*)cur - code;

On ARM, the only remaining use of ToggledCallSize is to skip past the call -- so really, it should be some function like NextInstructionAfterToggledCall(), the size being an implementation detail that's leaking.
Attachment #8965649 - Flags: review?(sstangl) → review+
Those are both good ideas but I appear to be swamped with other things.  Since this is a JS-side issue, reassigning for rewrite + landing.
Assignee: lhansen → sstangl
Sean, is this still an issue?
Flags: needinfo?(sstangl)
I rebased the patch. ARM64 doesn't have an InstructionIterator yet, and ToggledCallSize is cross-platform and so I didn't want to fix that up either.

Can land as-is.
Attachment #8965649 - Attachment is obsolete: true
Flags: needinfo?(sstangl)
Attachment #9022277 - Flags: review+
Keywords: checkin-needed
Pushed by
Compute toggled call size properly; fix constant pool header. r=sstangl
Keywords: checkin-needed
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
You need to log in before you can comment on or make changes to this bug.