Use frame pointer relative addressing in Ion
Categories
(Core :: JavaScript Engine: JIT, task, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox104 | --- | fixed |
People
(Reporter: jandem, Assigned: jandem)
References
(Regressed 1 open bug)
Details
Attachments
(10 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review |
After bug 1770366 we can use [rbp+N] instead of [rsp+N] to access arguments and stack slots. (Wasm is already doing this for arguments but not stack slots.)
On x86/x64, this saves one byte per instruction and has other benefits. C/C++ compilers also prefer the frame pointer over the stack pointer if they're not omitting frame pointers.
Using the frame pointer is also less error-prone because it's not affected by stack pushes.
| Assignee | ||
Comment 1•3 years ago
|
||
On x86/x64, this works great: using [RBP+-N] is one byte shorter than [RSP+N], and I see a small improvement on some benchmarks.
The problem is ARM64: to access locals, we'll need a negative offset from the frame pointer, but the range for load/store immediates on ARM is [-256, 4095]. This means that switching everything to use FP as base register would require a scratch register and an extra instruction in more cases.
We could keep using SP-relative addressing on ARM, or at least for larger offsets, but then there's a class of bugs that will only show up on ARM (when you get masm.framePushed() wrong for example) and I had hoped we could stop using SP-relative addressing completely.
I'm not sure what the best option is, but for x86/x64 there are clear benefits.
Comment 2•3 years ago
|
||
Given that one ISA pushes heavily towards frame-relative, and the other pushes towards stack-relative, the optimal solution perf-wise is presumably to support both approaches. Not sure whether that's justifiable from a maintainability perspective.
How bad would it be to continue supporting stack-relative locals on x86 behind a flag, so that we can fuzz both ways?
Comment 3•3 years ago
|
||
How bad would it be to continue supporting stack-relative locals on x86
behind a flag, so that we can fuzz both ways?
The increment in overall complexity and in verification burden concerns me,
especially considering that there are unresolved issues with the arm64 16-byte
SP-alignment requirement. I spent several weeks last year working on that
without much success, and it looks like we will need to revisit it soon, see
bug 1774449.
[..] I see a small improvement on some benchmarks [..]
Can you show some numbers? Do we know why there is an improvement? It sounds
like it could be either due to reduced icache misses, or the removal of
stack-sync uops on some Intels.
| Assignee | ||
Comment 4•3 years ago
|
||
I've reworked the patches to easily support both modes based on a JitOption setting. On x86/x64 we default to FP-relative and on other platforms we stay with SP-relative.
Here are some numbers for embenchen in the shell on x64:
JS_SHELL1=obj-shell-opt/dist/bin/js-sp JS_SHELL2=obj-shell-opt/dist/bin/js-fp python3 wasm_bench.py --mode=ion+ion --range -n 7
# mode=ion+ion, runs=7, problem size=default
# Lower score is better
box2d 736 743 0.991 [725, 740] [736, 746]
bullet 890 906 0.982 [878, 910] [892, 911]
conditionals 293 250 1.172 [290, 295] [248, 258]
copy 536 509 1.053 [509, 540] [504, 518]
corrections 936 921 1.016 [919, 940] [914, 945]
fannkuch 2154 2084 1.034 [2133, 2219] [2080, 2093]
fasta 668 662 1.009 [661, 668] [655, 666]
fib 641 649 0.988 [640, 662] [637, 650]
fib_indirect1 797 804 0.991 [784, 800] [803, 823]
fib_indirect2 840 841 0.999 [837, 843] [840, 841]
fib_indirect3 805 836 0.963 [804, 806] [834, 837]
ifs 173 182 0.951 [154, 192] [161, 198]
binarytrees 3030 3031 1.0 [2994, 3061] [3004, 3051]
memops 1025 908 1.129 [998, 1041] [907, 910]
primes 971 844 1.15 [968, 977] [843, 859]
raybench 1036 1045 0.991 [1022, 1044] [1043, 1054]
rust-fannkuch 2130 2087 1.021 [2130, 2133] [2084, 2088]
skinning 637 644 0.989 [633, 647] [636, 655]
zlib 1008 1001 1.007 [983, 1011] [996, 1007]
JS_SHELL1=obj-shell-opt/dist/bin/js-sp JS_SHELL2=obj-shell-opt/dist/bin/js-fp python3 wasm_bench.py --mode=ion+ion --range -n 7
# mode=ion+ion, runs=7, problem size=default
# Lower score is better
box2d 727 730 0.996 [721, 751] [721, 741]
bullet 876 879 0.997 [874, 910] [874, 887]
conditionals 293 253 1.158 [291, 311] [251, 256]
copy 539 502 1.074 [533, 542] [496, 516]
corrections 935 937 0.998 [932, 939] [925, 947]
fannkuch 2175 2110 1.031 [2121, 2257] [2087, 2130]
fasta 671 670 1.001 [665, 681] [667, 676]
fib 649 653 0.994 [642, 657] [639, 654]
fib_indirect1 787 803 0.98 [785, 792] [803, 804]
fib_indirect2 837 840 0.996 [837, 843] [840, 842]
fib_indirect3 805 846 0.952 [804, 809] [836, 859]
ifs 157 170 0.924 [156, 159] [161, 182]
binarytrees 3006 3036 0.99 [2998, 3015] [3009, 3058]
memops 1023 931 1.099 [912, 1033] [910, 932]
primes 977 841 1.162 [968, 993] [839, 843]
raybench 1035 1057 0.979 [1033, 1040] [1044, 1073]
rust-fannkuch 2131 2082 1.024 [2129, 2149] [2082, 2097]
skinning 640 643 0.995 [628, 647] [619, 649]
zlib 987 982 1.005 [972, 999] [978, 1000]
The >= 10% improvement on conditionals, memops, and primes is very consistent.
Comment 5•3 years ago
|
||
Chromium switched ARM64 from fp-relative to sp-relative: https://codereview.chromium.org/1376173003/
| Assignee | ||
Comment 6•3 years ago
|
||
This code used to use ARM-specific data structures with limits on the offset range,
but we now match the x86-shared version and the assertion is unnecessary.
The StackPointer-case already uses larger offsets too.
| Assignee | ||
Comment 7•3 years ago
|
||
Add a separate UnusedStackBytesForCall function so that we can change
StackOffsetOfPassedArg in a later patch.
Depends on D150944
| Assignee | ||
Comment 8•3 years ago
|
||
Depends on D150945
| Assignee | ||
Comment 9•3 years ago
|
||
Depends on D150946
| Assignee | ||
Comment 10•3 years ago
|
||
The idea is to have the code for LAllocation => Address in one function, so
it's easier to change.
Depends on D150947
| Assignee | ||
Comment 11•3 years ago
|
||
Depends on D150948
| Assignee | ||
Comment 12•3 years ago
|
||
Wasm was already doing this and we can now do this to JS too.
Always use SP for MIonToWasmCall because in GenerateDirectCallFromJit we clobber
FP and have code to adjust the stack offset.
Depends on D150949
| Assignee | ||
Comment 13•3 years ago
|
||
Now that we have a frame pointer, we can use it to access local slots in Ion, instead of
using the stack pointer. This results in more compact and faster code on x86 and x64.
On other platforms we still use the stack pointer, but we allow changing this in the
shell so that fuzzers can test both cases on each platform.
Depends on D150950
| Assignee | ||
Comment 14•3 years ago
|
||
Depends on D150951
| Assignee | ||
Comment 15•3 years ago
|
||
This ensures we eliminate some branching at compile-time. There are only two instantiations,
for Default and SP.
Depends on D150952
Comment 16•3 years ago
|
||
Comment 17•3 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/4b218eed4568
https://hg.mozilla.org/mozilla-central/rev/868fd0d9d3ad
https://hg.mozilla.org/mozilla-central/rev/ef8b19ce75a3
https://hg.mozilla.org/mozilla-central/rev/86e26be2815d
https://hg.mozilla.org/mozilla-central/rev/e73abfde161c
https://hg.mozilla.org/mozilla-central/rev/fa4753dfcd8a
https://hg.mozilla.org/mozilla-central/rev/8154b8fbb975
https://hg.mozilla.org/mozilla-central/rev/49b19abe314d
https://hg.mozilla.org/mozilla-central/rev/e289b0cf4176
https://hg.mozilla.org/mozilla-central/rev/f17954dda80a
Description
•