Closed Bug 1773434 Opened 3 years ago Closed 3 years ago

Use frame pointer relative addressing in Ion

Categories

(Core :: JavaScript Engine: JIT, task, P2)

task

Tracking

()

RESOLVED FIXED
104 Branch
Tracking Status
firefox104 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

(Regressed 1 open bug)

Details

Attachments

(10 files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

After bug 1770366 we can use [rbp+N] instead of [rsp+N] to access arguments and stack slots. (Wasm is already doing this for arguments but not stack slots.)

On x86/x64, this saves one byte per instruction and has other benefits. C/C++ compilers also prefer the frame pointer over the stack pointer if they're not omitting frame pointers.

Using the frame pointer is also less error-prone because it's not affected by stack pushes.

On x86/x64, this works great: using [RBP+-N] is one byte shorter than [RSP+N], and I see a small improvement on some benchmarks.

The problem is ARM64: to access locals, we'll need a negative offset from the frame pointer, but the range for load/store immediates on ARM is [-256, 4095]. This means that switching everything to use FP as base register would require a scratch register and an extra instruction in more cases.

We could keep using SP-relative addressing on ARM, or at least for larger offsets, but then there's a class of bugs that will only show up on ARM (when you get masm.framePushed() wrong for example) and I had hoped we could stop using SP-relative addressing completely.

I'm not sure what the best option is, but for x86/x64 there are clear benefits.

Given that one ISA pushes heavily towards frame-relative, and the other pushes towards stack-relative, the optimal solution perf-wise is presumably to support both approaches. Not sure whether that's justifiable from a maintainability perspective.

How bad would it be to continue supporting stack-relative locals on x86 behind a flag, so that we can fuzz both ways?

How bad would it be to continue supporting stack-relative locals on x86
behind a flag, so that we can fuzz both ways?

The increment in overall complexity and in verification burden concerns me,
especially considering that there are unresolved issues with the arm64 16-byte
SP-alignment requirement. I spent several weeks last year working on that
without much success, and it looks like we will need to revisit it soon, see
bug 1774449.

[..] I see a small improvement on some benchmarks [..]

Can you show some numbers? Do we know why there is an improvement? It sounds
like it could be either due to reduced icache misses, or the removal of
stack-sync uops on some Intels.

I've reworked the patches to easily support both modes based on a JitOption setting. On x86/x64 we default to FP-relative and on other platforms we stay with SP-relative.

Here are some numbers for embenchen in the shell on x64:

JS_SHELL1=obj-shell-opt/dist/bin/js-sp JS_SHELL2=obj-shell-opt/dist/bin/js-fp python3 wasm_bench.py --mode=ion+ion --range -n 7
# mode=ion+ion, runs=7, problem size=default
# Lower score is better
box2d           736     743     0.991   [725, 740]      [736, 746]
bullet          890     906     0.982   [878, 910]      [892, 911]
conditionals    293     250     1.172   [290, 295]      [248, 258]
copy            536     509     1.053   [509, 540]      [504, 518]
corrections     936     921     1.016   [919, 940]      [914, 945]
fannkuch        2154    2084    1.034   [2133, 2219]    [2080, 2093]
fasta           668     662     1.009   [661, 668]      [655, 666]
fib             641     649     0.988   [640, 662]      [637, 650]
fib_indirect1   797     804     0.991   [784, 800]      [803, 823]
fib_indirect2   840     841     0.999   [837, 843]      [840, 841]
fib_indirect3   805     836     0.963   [804, 806]      [834, 837]
ifs             173     182     0.951   [154, 192]      [161, 198]
binarytrees     3030    3031    1.0     [2994, 3061]    [3004, 3051]
memops          1025    908     1.129   [998, 1041]     [907, 910]
primes          971     844     1.15    [968, 977]      [843, 859]
raybench        1036    1045    0.991   [1022, 1044]    [1043, 1054]
rust-fannkuch   2130    2087    1.021   [2130, 2133]    [2084, 2088]
skinning        637     644     0.989   [633, 647]      [636, 655]
zlib            1008    1001    1.007   [983, 1011]     [996, 1007]

JS_SHELL1=obj-shell-opt/dist/bin/js-sp JS_SHELL2=obj-shell-opt/dist/bin/js-fp python3 wasm_bench.py --mode=ion+ion --range -n 7
# mode=ion+ion, runs=7, problem size=default
# Lower score is better
box2d           727     730     0.996   [721, 751]      [721, 741]
bullet          876     879     0.997   [874, 910]      [874, 887]
conditionals    293     253     1.158   [291, 311]      [251, 256]
copy            539     502     1.074   [533, 542]      [496, 516]
corrections     935     937     0.998   [932, 939]      [925, 947]
fannkuch        2175    2110    1.031   [2121, 2257]    [2087, 2130]
fasta           671     670     1.001   [665, 681]      [667, 676]
fib             649     653     0.994   [642, 657]      [639, 654]
fib_indirect1   787     803     0.98    [785, 792]      [803, 804]
fib_indirect2   837     840     0.996   [837, 843]      [840, 842]
fib_indirect3   805     846     0.952   [804, 809]      [836, 859]
ifs             157     170     0.924   [156, 159]      [161, 182]
binarytrees     3006    3036    0.99    [2998, 3015]    [3009, 3058]
memops          1023    931     1.099   [912, 1033]     [910, 932]
primes          977     841     1.162   [968, 993]      [839, 843]
raybench        1035    1057    0.979   [1033, 1040]    [1044, 1073]
rust-fannkuch   2131    2082    1.024   [2129, 2149]    [2082, 2097]
skinning        640     643     0.995   [628, 647]      [619, 649]
zlib            987     982     1.005   [972, 999]      [978, 1000]

The >= 10% improvement on conditionals, memops, and primes is very consistent.

Chromium switched ARM64 from fp-relative to sp-relative: https://codereview.chromium.org/1376173003/

This code used to use ARM-specific data structures with limits on the offset range,
but we now match the x86-shared version and the assertion is unnecessary.
The StackPointer-case already uses larger offsets too.

Add a separate UnusedStackBytesForCall function so that we can change
StackOffsetOfPassedArg in a later patch.

Depends on D150944

The idea is to have the code for LAllocation => Address in one function, so
it's easier to change.

Depends on D150947

Wasm was already doing this and we can now do this to JS too.

Always use SP for MIonToWasmCall because in GenerateDirectCallFromJit we clobber
FP and have code to adjust the stack offset.

Depends on D150949

Now that we have a frame pointer, we can use it to access local slots in Ion, instead of
using the stack pointer. This results in more compact and faster code on x86 and x64.

On other platforms we still use the stack pointer, but we allow changing this in the
shell so that fuzzers can test both cases on each platform.

Depends on D150950

This ensures we eliminate some branching at compile-time. There are only two instantiations,
for Default and SP.

Depends on D150952

Pushed by jdemooij@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4b218eed4568 part 1 - Remove bogus assertion on ARM32. r=iain https://hg.mozilla.org/integration/autoland/rev/868fd0d9d3ad part 2 - Tidy up unused-stack computation for calls. r=iain https://hg.mozilla.org/integration/autoland/rev/ef8b19ce75a3 part 3 - Replace StackOffsetOfPassedArg with AddressOfPassedArg. r=iain https://hg.mozilla.org/integration/autoland/rev/86e26be2815d part 4 - Remove unused ToStackOffset and ToFramePointerOffset overloads. r=iain https://hg.mozilla.org/integration/autoland/rev/e73abfde161c part 5 - Inline some functions into ToAddress. r=iain https://hg.mozilla.org/integration/autoland/rev/fa4753dfcd8a part 6 - Assert framePushed matches frameSize in ToAddress and AddressOfPassedArg. r=iain https://hg.mozilla.org/integration/autoland/rev/8154b8fbb975 part 7 - Always use frame pointer for incoming arguments. r=iain https://hg.mozilla.org/integration/autoland/rev/49b19abe314d part 8 - Use frame pointer to access local slots and passed-argument slots on x86/x64. r=iain https://hg.mozilla.org/integration/autoland/rev/e289b0cf4176 part 9 - Add offsetOfArgsFromFP_ to simplify arguments code a bit. r=iain https://hg.mozilla.org/integration/autoland/rev/f17954dda80a part 10 - Templatize ToAddress. r=iain
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: