Closed Bug 1678239 Opened 5 years ago Closed 4 years ago

ARM64 function prologue is very large

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

ARM64
All
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

The prologue is 20 instructions / 80 bytes, this is the prologue for a simple wasm function:

0x13c98ee6f000  d10043ff  sub     sp, sp, #0x10 (16)
0x13c98ee6f004  f90007fe  str     x30, [sp, #8]
0x13c98ee6f008  f90003fd  str     x29, [sp]
0x13c98ee6f00c  910003fd  mov     x29, sp
0x13c98ee6f010  d503201f  nop
0x13c98ee6f014  d503201f  nop
0x13c98ee6f018  d503201f  nop
0x13c98ee6f01c  d503201f  nop
0x13c98ee6f020  71018d5f  cmp     w10, #0x63 (99)
0x13c98ee6f024  54000160  b.eq    #+0x2c (addr 0x13c98ee6f050)
0x13c98ee6f028  d4a00000  unimplemented (Exception)
0x13c98ee6f02c  14000003  b       #+0xc (addr 0x13c98ee6f038)
0x13c98ee6f030  ffff0001  unallocated (Unallocated)
0x13c98ee6f034  14000007  b       #+0x1c (addr 0x13c98ee6f050)
0x13c98ee6f038  d503201f  nop
0x13c98ee6f03c  d503201f  nop
0x13c98ee6f040  d10043ff  sub     sp, sp, #0x10 (16)
0x13c98ee6f044  f90007fe  str     x30, [sp, #8]
0x13c98ee6f048  f90003fd  str     x29, [sp]
0x13c98ee6f04c  910003fd  mov     x29, sp

I'm mostly worried about code size here, as the size of the prologue will completely dominate the size of most small functions. It's not urgent to fix this but I'd like to understand why it is this large:

  • why are 6 out of the 20 instructions NOP?
  • what is the garbage at ...f030?
  • why does the instruction at ...f02c branch to ...f038 instead of to ...f040?

(Edited for correctness)

This is the x64 prologue, for comparison:

00000000  55                        push %rbp
00000001  48 8b ec                  mov %rsp, %rbp
00000004  66 0f 1f 44 00 00         nopw %ax, (%rax,%rax,1)
0000000A  66 0f 1f 44 00 00         nopw %ax, (%rax,%rax,1)
00000010  41 83 fa 63               cmp $0x63, %r10d
00000014  0f 84 0a 00 00 00         jz 0x0000000000000024
0000001A  0f 0b                     ud2
0000001C  0f 1f 40 00               nopl %eax, (%rax)
00000020  55                        push %rbp
00000021  48 8b ec                  mov %rsp, %rbp

This is also large (36 bytes) and has a large NOP payload (16 bytes). But after the signature check it does not have the same branch nest as ARM64 does.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED

This is because in wasm::GenerateFunctionPrologue:

  GenerateCallablePrologue(masm, &dummy);
  EnsureOffset(masm, offsets->begin, WasmCheckedTailEntryOffset);

along with, on ARM64,

static constexpr uint32_t WasmCheckedTailEntryOffset = 32u;

while on other platforms the value of WasmCheckedTailEntryOffset is 16 (which also seems high as explained above). There's a comment in Assembler-arm64.h that says (for this and the WasmCheckedCallEntryOffset) that

// The offsets are dynamically asserted during
// code generation in the prologue/epilogue.

but that is a lie for WasmCheckedTailEntryOffset, it is never checked anywhere.

Setting the WasmCheckedTailEntryOffset to 16 does not break any tests anywhere. Dmitry's current patch queue (bug 1639153) does not seem to depend on the value 32, or indeed the value 16 for the other platforms, all we want is CodeAlignment so that pointers into the code at that point are valid in other contexts.

As for the rest of the code, first we have a jump across a pool followed by a constant pool:

0x13c98ee6f02c  14000003  b       #+0xc (addr 0x13c98ee6f038)  // jump across pool
0x13c98ee6f030  ffff0001  unallocated (Unallocated)            // pool header
0x13c98ee6f034  14000007  b       #+0x1c (addr 0x13c98ee6f050) // branch veneer

This is a mess, because the pool is really "empty" except for the branch veneer for the signature check's branch, in case that branch turns out to be long. But we know statically that it won't be long, so if we could declare that somehow then the pool would be empty and we'd get nothing here.

Finally there's CodeAlignment:

0x13c98ee6f038  d503201f  nop
0x13c98ee6f03c  d503201f  nop

Getting rid of the pool will thus remove four words: the pool + one of the nops since only one nop will be needed for CodeAlignment.

(If we later change traps so that we branch out-of-line to traps then some of this may come back; on the other hand, there are other ways to skin that cat.)

I guess the only reason the pool is emitted is because we force a flush when we emit the prologue; otherwise we'd get to the branch target and the pending branch would be retired.

I believe that the flushing is not necessary for correctness per se; the effect we want is that the pool does not end up in the middle of the second (UncheckedEntry) prologue. The code that inserted the flush is fairly old, see patch on bug 1350550. It is not well motivated in that bug, only the flush before the first (Checked) prologue is clearly motivated.

The comment in GenerateFunctionPrologue:

  // The checked entries might have generated a small constant pool in case of
  // immediate comparison.
  masm.flushBuffer();

is also incorrect since as we've seen the branch will always cause the pool to be created. So this would have been a regression code-size wise, but we don't test that, which is bad.

Disabling the flush by commenting it out and running jit-tests results in no failures (ARM64), and the prologue code is now tight. At the same time, we want something stronger here to make sure the pool is not flushed in the middle of the prologue and/or to assert if that happens. GenerateCallablePrologue has an AutoForbidPoolsAndNops for ARM64 (and ARM), which is probably good enough, but if it is it requires some heavy commenting...

After some discussions with Dmitry, the conclusion is that the checked tail offset should be the smallest value that has legal branch target alignment and accomodates the largest callable prologue on the platform. Effectively, this means the tail offset is the size of the largest callable prologue: 3 on x86, 4 on x64, 12 on arm (should be 8 - subject of bug 1705361), 16 on arm64 and mips64. I will document this and offer a patch that fixes this problem in isolation across all platforms.

Blocks: 1705361

It turns out that WasmCheckedTailEntryOffset -- which defines an entry
point for a jump from a trampoline that needs to perform a signature
check but has already allocated a frame header, and which may require
some nop-fill in the function prologue so as to place the entry point
at a constant offset -- only needs to be large enough to accomodate
every possible callable prologue (architecture-dependent). The value
can therefore be made smaller than it currently is on x86, x64, arm,
and arm64, leading to more compact prologues on those platforms,
conserving code size.

In addition, the code in GenerateFunctionPrologue and associated
functions that uses this constant can be tidied up a little bit and
commented better.

Also the ARM64 prologue should be able to use STP to store the registers because they are stored to adjacent cells in memory, and the epilogue should be able to use LDP. Since the frame is 16 bytes it may be possible to use increment/decrement with writeback too. Need to investigate this further. Also see bug 1705361 regarding the same optimization on ARM.

Depends on: 1705438

When the flushing of the buffer before the prologue was inserted (bug
1350550), a flush was also added before the secondary unchecked call
entry prologue. But this flush causes code bloat on ARM and ARM64 (as
explained in a comment in the code) and it is not necessary because
the pool will never need to be spilled until after the prologue code
has been emitted completely: we know what's in the pool and we know no
references into the pool will go out of range until later.

So remove the flush and add an assertion that checks this assumption,
resulting in tighter code in the prologue.

Depends on D112173

Blocks: 1705495
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/73bc646d7cdd Shrink WasmCheckedTailEntryOffset. r=jseward https://hg.mozilla.org/integration/autoland/rev/c6c9c13c12e1 Do not flush the masm buffer in the prologue. r=jseward
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
See Also: → 1756792
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: