Closed Bug 1426134 Opened 7 years ago Closed 2 years ago

preserve frame pointer in all JIT code

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED
102 Branch
Tracking Status
firefox59 --- wontfix
firefox102 --- fixed

People

(Reporter: luke, Assigned: anba)

References

Details

Attachments

(7 files)

If we could preserve the frame pointer in all JIT code and use a standard prologue/epilogue, I think there would be a lot of benefits:

 - With preserved fp, it's pretty easy to unwind JIT frames because you don't have to know the sp->fp offset at every pc.  Thus, it would be easy to:
    . register unwind info with Windows (RtlAddFunctionTable et al) so that various forms of native unwinding Just Worked
    . register DWARF info so gdb and other tools Just Worked

 - Having 'fp' simplifies codegen and stack iteration and can reduce metadata size of bailouts, etc

 - Maybe we could remove, or at least simplify, the JIT profiling mode.

FWIW, wasm/asm.js used to not preserve fp and go through a bunch of contortions to maintain fp during profiling mode.  Later, we needed the ability to walk the stack at any pc (to handle GC during async interrupt) so we switched wasm to always preserve fp so we could always find the current instance object (via fp).  This was a major simplification and removed profiling mode altogether.

The concern of course is register-starved x86; 6 vs 7 GPRs is a big difference.  However, 64-bit is now the majority of FF users:
  https://hardware.metrics.mozilla.com/#detail-browser-share-32-64
and would likely be the browser of anyone doing serious benchmarking.
> The concern of course is register-starved x86; 6 vs 7 GPRs is a big difference.

FWIW, this didn't stop us from preserving frame pointers for CPP in bug 1322735.
We should check how much this actually affects Ion perf on 32-bit and 64-bit. Both JS shell benchmarks like SunSpider/Kraken/Octane (especially Kraken and Octane-crypto might be affected) and Talos stuff like Speedometer/Dromaeo.

We've had regalloc issues that only show up with profiling enabled (due to FP being reserved only in that case). That's not great and I'm in favor of simplifying the current situation.

If the hit on 32-bit x86 is <= 5% or so I'd be willing to take it. If it's more we should be more careful.
Priority: -- → P3
(In reply to Luke Wagner [:luke] from comment #0)

>     . register DWARF info so gdb and other tools Just Worked

This change would make it far simpler to enable the existing unwinder by default.
https://dxr.mozilla.org/mozilla-central/rev/1624b88874765bf57e9feba176d30149c748d9d2/js/src/gdb/mozilla/unwind.py#1

Currently it is off because it's difficult to unwind when stopped at an arbitrary
spot in JIT code.

DWARF emission wouldn't even have to happen for this to be useful.
Blocks: 1434402
Another random benefit of being able to unwind at an arbitrary point in JIT code that I saw today is that, when we call RedirectJitCodeToInterruptCheck(), we'd be able to walk the stack and only patch/redirect the JIT code for functions on the stack instead of all the JIT code in the JitRuntime.
I attempted to measure the performance impact of preserving %rbp by adding it to the NonAllocatableRegs mask. The register allocator does correctly respect that mask. Simple change to make, but building all the variants takes quite a while!

Those builds produced the following results on Octane:

32-bit, with ebp.
> Richards: 33650
> DeltaBlue: 50137
> Crypto: 25974
> RayTrace: 116992
> EarleyBoyer: 34948
> RegExp: 4555
> Splay: 18503
> SplayLatency: 4345
> NavierStokes: 30361
> PdfJS: 15532
> Mandreel: 22724
> MandreelLatency: 24802
> Gameboy: 63029
> CodeLoad: 18604
> Box2D: 47968
> zlib: 57163
> Typescript: 14681
> ----
> Score (version 9): 25515

32-bit, patched without ebp.
> Richards: 33911
> DeltaBlue: 50474
> Crypto: 26092
> RayTrace: 116326
> EarleyBoyer: 34089
> Segmentation fault (core dumped)

64-bit, with rbp.
> Richards: 29541
> DeltaBlue: 48722
> Crypto: 28312
> RayTrace: 110776
> EarleyBoyer: 28612
> RegExp: 4376
> Splay: 20014
> SplayLatency: 23000
> NavierStokes: 29680
> PdfJS: 17694
> Mandreel: 22790
> MandreelLatency: 29674
> Gameboy: 69180
> CodeLoad: 22275
> Box2D: 48063
> zlib: 52394
> Typescript: 14715
> ----
> Score (version 9): 28461

64-bit, patched without rbp.
> Richards: 29343
> DeltaBlue: 52108
> Crypto: 29160
> RayTrace: 108038
> EarleyBoyer: 29225
> RegExp: 4359
> Splay: 18784
> SplayLatency: 21817
> NavierStokes: 30243
> PdfJS: 18677
> Mandreel: 22856
> MandreelLatency: 29153
> Gameboy: 66618
> CodeLoad: 22526
> Box2D: 48111
> zlib: 71744
> Typescript: 13220
> ----
> Score (version 9): 28808

The crash on x86 is from MacroAssembler::generateBailoutTail() assuming that ebp/rbp is in the AllocatableMask.
I have x86 mostly working at this point. The goal was to just get a patch to a point where we can evaluate performance to see whether this route is viable.

Octane runs with the exception of PdfJS -- there are many CacheIR functions that assert that more registers are available than x86 would provide after this patchset. Also, although the RegExp code runs, it currently clobbers %ebp.

Results from Octane runs without/with ebp as an allocatable register are below (minus PdfJS):

> [sstangl@mazu octane]$ ~/dev/gecko-dev/js/src/opt32/js/src/js run.js 
> Richards: 29964
> DeltaBlue: 47975
> Crypto: 25462
> RayTrace: 112700
> EarleyBoyer: 31753
> RegExp: 4541
> Splay: 18499
> SplayLatency: 4506
> NavierStokes: 29532
> Mandreel: 20058
> MandreelLatency: 22973
> Gameboy: 39159
> CodeLoad: 19003
> Box2D: 44460
> zlib: 45256
> Typescript: 15206
> ----
> Score (version 9): 24359

> [sstangl@mazu octane]$ ~/dev/gecko-dev/js/src/opt32-master/js/src/js run.js 
> Richards: 33191
> DeltaBlue: 50792
> Crypto: 26462
> RayTrace: 114550
> EarleyBoyer: 34773
> RegExp: 4658
> Splay: 18384
> SplayLatency: 4308
> NavierStokes: 29917
> Mandreel: 23148
> MandreelLatency: 24437
> Gameboy: 51345
> CodeLoad: 19245
> Box2D: 46722
> zlib: 50268
> Typescript: 16109
> ----
> Score (version 9): 25919

Jan, what do you think?
Flags: needinfo?(jdemooij)
(In reply to Sean Stangl [:sstangl] from comment #6)
> Octane runs with the exception of PdfJS -- there are many CacheIR functions
> that assert that more registers are available than x86 would provide after
> this patchset.

Is this in Baseline or Ion IC code? It's hard enough to make IC regalloc work on x86 so taking away ebp could be a problem... Is it okay to clobber ebp temporarily *before* we make JS/C++ calls? The CacheIR code should ensure registers are restored to their original values whenever we make such calls (because that's necessary for GC marking etc). I guess that would break unwinding if we want to make that fp-based.

> Results from Octane runs without/with ebp as an allocatable register are
> below (minus PdfJS):
> 
> Jan, what do you think?

So on 32-bit a 6% Octane regression, 10% for the asm.js zlib test, right? What do you see on Kraken (many tight loops)? On 64-bit zlib (asm.js) improves from 52394 to 71744 points - that's a really big difference, is it reproducible?

It's a noticeable perf regression on 32-bit so I'm not thrilled, but OTOH most of our power users moved to 64-bit anyway.. It also depends on how invasive the changes are: if it really complicates manual regalloc that could be a more serious problem than the perf hit.
Flags: needinfo?(jdemooij)
That's weird, assuming the asm.js is being compiled by OdinMonkey here (no --no-asmjs), it should already have FP reserved, so I wouldn't expect an Octane-zlib regression.
Blocks: 1530552
Depends on: 1755056
No longer blocks: 1530552

Handle the case when temp2_ couldn't be reserved.

CheckNotBackReferenceImpl():

  • Use backtrack_stack_pointer_ as an additional temp register by saving it
    on the stack.

initFrameAndRegs():

  • Use backtrack_stack_pointer_ as an additional temp register. We don't have
    to restore the register, because backtrack_stack_pointer_ hasn't yet been
    initialised.
  • Use temp0_ for inputStartMinusOneReg. InputOutputData is stored in
    temp0_, but we don't read from it after this point anymore.

successHandler():

  • Use backtrack_stack_pointer_ as an additional temp register. We don't have
    to restore the register, because backtrack_stack_pointer_ won't be read
    after running the success handler.
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED

We can't reuse the trick we've been using for maybeTemp5 where we're reusing
lastIndex as an additional temp register. (Also see bug 1480819.)

Instead just save regexp and lastIndex on the stack and then use these
register for temp4 resp. temp5.

This change also means RegExpMatcherRaw will now (again) always be called
with a valid lastIndex argument.

Depends on D146704

Use AutoScratchRegisterMaybeOutputType instead of AutoScratchRegister, so
we need one less register.

Depends on D146705

LGetPropSuperCache needs seven register, so it can't be used on x86.

Depends on D146706

Changes:

  • Replace regs.take(BaselineFrameReg) with regs.takeUnchecked(...), because
    BaselineFrameReg is the frame pointer only on some platforms, which means it's
    no longer in the register set returned from GeneralRegisterSet::All().
  • Remove TakeJitRegisters because the frame pointer is now always preserved.
  • Update GenerateDirectCallFromJit as instructed in the code comments.

Drive-by change:

  • Remove RegisterID::S0 and RegisterID::S1, because they are some old
    JSC-specific definitions. Instead add RegisterID::fp.

Depends on D146707

This code is no longer needed.

Depends on D146708

Blocks: 1530552
Blocks: 1770365
Blocks: 1770366
Attachment #9277125 - Attachment description: Bug 1426134 - Part 3: Change IonCacheIRCompiler to work with one less temp register. r=jandem! → WIP: Bug 1426134 - Part 3: Change IonCacheIRCompiler to work with one less temp register. r=jandem!
Attachment #9277125 - Attachment description: WIP: Bug 1426134 - Part 3: Change IonCacheIRCompiler to work with one less temp register. r=jandem! → Bug 1426134 - Part 3: Change IonCacheIRCompiler to work with one less temp register. r=jandem!
Pushed by andre.bargull@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/3a85ea2898b3
Part 1: Change SMRegExpMacroAssembler to work with one less temp register. r=iain
https://hg.mozilla.org/integration/autoland/rev/bbf75553a6dc
Part 2: Change JitRealm::generateRegExpMatcherStub to work with one less register on x86. r=jandem
https://hg.mozilla.org/integration/autoland/rev/3a054262db65
Part 3: Change IonCacheIRCompiler to work with one less temp register. r=jandem
https://hg.mozilla.org/integration/autoland/rev/42a0409c631b
Part 4: Disable GetElemSuper IC inlining on x86. r=jandem
https://hg.mozilla.org/integration/autoland/rev/d1b71b47f02a
Part 5: Remove frame pointer from allocatable register set. r=jandem
https://hg.mozilla.org/integration/autoland/rev/f7930df19aac
Part 6: Remove no longer used framepointer-temp from LIonToWasmCallBase. r=jandem
Blocks: 1770922
Blocks: 1771085
No longer blocks: sm-jits
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: