ARM64 function prologue is very large
Categories
(Core :: JavaScript: WebAssembly, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox90 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
The prologue is 20 instructions / 80 bytes, this is the prologue for a simple wasm function:
0x13c98ee6f000 d10043ff sub sp, sp, #0x10 (16)
0x13c98ee6f004 f90007fe str x30, [sp, #8]
0x13c98ee6f008 f90003fd str x29, [sp]
0x13c98ee6f00c 910003fd mov x29, sp
0x13c98ee6f010 d503201f nop
0x13c98ee6f014 d503201f nop
0x13c98ee6f018 d503201f nop
0x13c98ee6f01c d503201f nop
0x13c98ee6f020 71018d5f cmp w10, #0x63 (99)
0x13c98ee6f024 54000160 b.eq #+0x2c (addr 0x13c98ee6f050)
0x13c98ee6f028 d4a00000 unimplemented (Exception)
0x13c98ee6f02c 14000003 b #+0xc (addr 0x13c98ee6f038)
0x13c98ee6f030 ffff0001 unallocated (Unallocated)
0x13c98ee6f034 14000007 b #+0x1c (addr 0x13c98ee6f050)
0x13c98ee6f038 d503201f nop
0x13c98ee6f03c d503201f nop
0x13c98ee6f040 d10043ff sub sp, sp, #0x10 (16)
0x13c98ee6f044 f90007fe str x30, [sp, #8]
0x13c98ee6f048 f90003fd str x29, [sp]
0x13c98ee6f04c 910003fd mov x29, sp
I'm mostly worried about code size here, as the size of the prologue will completely dominate the size of most small functions. It's not urgent to fix this but I'd like to understand why it is this large:
- why are 6 out of the 20 instructions NOP?
- what is the garbage at ...f030?
- why does the instruction at ...f02c branch to ...f038 instead of to ...f040?
(Edited for correctness)
Assignee | ||
Comment 1•5 years ago
|
||
This is the x64 prologue, for comparison:
00000000 55 push %rbp
00000001 48 8b ec mov %rsp, %rbp
00000004 66 0f 1f 44 00 00 nopw %ax, (%rax,%rax,1)
0000000A 66 0f 1f 44 00 00 nopw %ax, (%rax,%rax,1)
00000010 41 83 fa 63 cmp $0x63, %r10d
00000014 0f 84 0a 00 00 00 jz 0x0000000000000024
0000001A 0f 0b ud2
0000001C 0f 1f 40 00 nopl %eax, (%rax)
00000020 55 push %rbp
00000021 48 8b ec mov %rsp, %rbp
This is also large (36 bytes) and has a large NOP payload (16 bytes). But after the signature check it does not have the same branch nest as ARM64 does.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 2•4 years ago
|
||
This is because in wasm::GenerateFunctionPrologue:
GenerateCallablePrologue(masm, &dummy);
EnsureOffset(masm, offsets->begin, WasmCheckedTailEntryOffset);
along with, on ARM64,
static constexpr uint32_t WasmCheckedTailEntryOffset = 32u;
while on other platforms the value of WasmCheckedTailEntryOffset is 16 (which also seems high as explained above). There's a comment in Assembler-arm64.h that says (for this and the WasmCheckedCallEntryOffset) that
// The offsets are dynamically asserted during
// code generation in the prologue/epilogue.
but that is a lie for WasmCheckedTailEntryOffset, it is never checked anywhere.
Setting the WasmCheckedTailEntryOffset to 16 does not break any tests anywhere. Dmitry's current patch queue (bug 1639153) does not seem to depend on the value 32, or indeed the value 16 for the other platforms, all we want is CodeAlignment so that pointers into the code at that point are valid in other contexts.
Assignee | ||
Comment 3•4 years ago
•
|
||
As for the rest of the code, first we have a jump across a pool followed by a constant pool:
0x13c98ee6f02c 14000003 b #+0xc (addr 0x13c98ee6f038) // jump across pool
0x13c98ee6f030 ffff0001 unallocated (Unallocated) // pool header
0x13c98ee6f034 14000007 b #+0x1c (addr 0x13c98ee6f050) // branch veneer
This is a mess, because the pool is really "empty" except for the branch veneer for the signature check's branch, in case that branch turns out to be long. But we know statically that it won't be long, so if we could declare that somehow then the pool would be empty and we'd get nothing here.
Finally there's CodeAlignment:
0x13c98ee6f038 d503201f nop
0x13c98ee6f03c d503201f nop
Getting rid of the pool will thus remove four words: the pool + one of the nops since only one nop will be needed for CodeAlignment.
(If we later change traps so that we branch out-of-line to traps then some of this may come back; on the other hand, there are other ways to skin that cat.)
Assignee | ||
Comment 4•4 years ago
•
|
||
I guess the only reason the pool is emitted is because we force a flush when we emit the prologue; otherwise we'd get to the branch target and the pending branch would be retired.
I believe that the flushing is not necessary for correctness per se; the effect we want is that the pool does not end up in the middle of the second (UncheckedEntry) prologue. The code that inserted the flush is fairly old, see patch on bug 1350550. It is not well motivated in that bug, only the flush before the first (Checked) prologue is clearly motivated.
The comment in GenerateFunctionPrologue:
// The checked entries might have generated a small constant pool in case of
// immediate comparison.
masm.flushBuffer();
is also incorrect since as we've seen the branch will always cause the pool to be created. So this would have been a regression code-size wise, but we don't test that, which is bad.
Disabling the flush by commenting it out and running jit-tests results in no failures (ARM64), and the prologue code is now tight. At the same time, we want something stronger here to make sure the pool is not flushed in the middle of the prologue and/or to assert if that happens. GenerateCallablePrologue has an AutoForbidPoolsAndNops for ARM64 (and ARM), which is probably good enough, but if it is it requires some heavy commenting...
Assignee | ||
Comment 5•4 years ago
•
|
||
After some discussions with Dmitry, the conclusion is that the checked tail offset should be the smallest value that has legal branch target alignment and accomodates the largest callable prologue on the platform. Effectively, this means the tail offset is the size of the largest callable prologue: 3 on x86, 4 on x64, 12 on arm (should be 8 - subject of bug 1705361), 16 on arm64 and mips64. I will document this and offer a patch that fixes this problem in isolation across all platforms.
Assignee | ||
Comment 6•4 years ago
|
||
It turns out that WasmCheckedTailEntryOffset -- which defines an entry
point for a jump from a trampoline that needs to perform a signature
check but has already allocated a frame header, and which may require
some nop-fill in the function prologue so as to place the entry point
at a constant offset -- only needs to be large enough to accomodate
every possible callable prologue (architecture-dependent). The value
can therefore be made smaller than it currently is on x86, x64, arm,
and arm64, leading to more compact prologues on those platforms,
conserving code size.
In addition, the code in GenerateFunctionPrologue and associated
functions that uses this constant can be tidied up a little bit and
commented better.
Assignee | ||
Comment 7•4 years ago
|
||
Also the ARM64 prologue should be able to use STP to store the registers because they are stored to adjacent cells in memory, and the epilogue should be able to use LDP. Since the frame is 16 bytes it may be possible to use increment/decrement with writeback too. Need to investigate this further. Also see bug 1705361 regarding the same optimization on ARM.
Assignee | ||
Comment 8•4 years ago
|
||
When the flushing of the buffer before the prologue was inserted (bug
1350550), a flush was also added before the secondary unchecked call
entry prologue. But this flush causes code bloat on ARM and ARM64 (as
explained in a comment in the code) and it is not necessary because
the pool will never need to be spilled until after the prologue code
has been emitted completely: we know what's in the pool and we know no
references into the pool will go out of range until later.
So remove the flush and add an assertion that checks this assumption,
resulting in tighter code in the prologue.
Depends on D112173
Comment 10•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/73bc646d7cdd
https://hg.mozilla.org/mozilla-central/rev/c6c9c13c12e1
Description
•