Open Bug 1712896 Opened 4 years ago Updated 4 years ago

ARM64 signature check + code alignment constraints can generate nopfill

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

ARM64
All
enhancement

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

For large immediate signatures, the checking code can turn into this:

0x1729efa0e000  d10043ff  sub     sp, sp, #0x10 (16)
0x1729efa0e004  f90007fe  str     x30, [sp, #8]
0x1729efa0e008  f90003fd  str     x29, [sp]
0x1729efa0e00c  910003fd  mov     x29, sp
0x1729efa0e010  529254f0  mov     w16, #0x92a7
0x1729efa0e014  72a00490  movk    w16, #0x24, lsl #16
0x1729efa0e018  6b10015f  cmp     w10, w16
0x1729efa0e01c  54000120  b.eq    #+0x24 (addr 0x1729efa0e040)
0x1729efa0e020  d4a00000  unimplemented (Exception)
0x1729efa0e024  d503201f  nop
0x1729efa0e028  d503201f  nop
0x1729efa0e02c  d503201f  nop

where the first four instructions are the fixed prologue, followed by the signature check, the wasm code here is:

(module
  (func (param i64) (param i64) (param i64) (param i64) (param i64) (result i64)
    (i64.and (local.get 1) (i64.const 64))))

The nopfill is a result of requiring 16-byte alignment for the unchecked entry. It is possible this could be reduced to 8 bytes. It is also possible that fixing the prologue to use stp will change the calculus for this (bug 1705495). Either way we should try to pay attention to bloat here. It would be better for code size to load the constant pc-relative.

When the signature is no longer representable as a constant, the pointer that represents it is stored in the tls, and we load relative from the tls. In a situation when there are many different signatures, the offset into the tls may overflow and we may again get a several instructions here to load the offset (at least a couple), with nopfill resulting. Try this:

(module
  (func (param i64) (param i64) (param i64) (param i64) (param i64) (param i64) (param i64) (param i64) (result i64)
    (i64.and (local.get 1) (i64.const 64))))

In truth, this is probably an issue on x86 too, it's just more obvious on arm64 since large arm64 constants really bloat the code, and arm64 code size is more relevant b/c of mobile.

Looking at TypeIdDesc::immediate, it's clear that while the signature representation is relatively compact, we can do better for C/C++ type code to avoid loading immediates before the signature check. There are several approaches. For both assume there is a single tag bit to distinguish "compact" (0) from "other" (1). (Then "other" becomes the representation we use now but with one extra tag bit.)

One compact representation admits only i32/i64/f32/f64 for the argument and return types; signals the presence of a return type with a single bit; adds the return type at the end of the array of types so that it doesn't need any bits if the type is absent; limits the length field to 3 bits. Thus we have a shared overhead of five bits + 2 bits per type * (up to 7 argument types and one return type) = maximum 21 bits but more typically 3-4 types, so 11-13 bits, which will sometimes fit in an immediate in the compare and otherwise be a single move immediate to set it up.

Another compact representation has a dictionary of common signatures (computed from a corpus) that maps each signature to an integer and uses the normal typedesc immediate as a key to lookup a compact type from this corpus at compile time. The tag bit ensures there's no confusion between these compressed signatures. Now we can arbitrarily limit compact signatures to something that fits in the compare instruction. Some signatures will not be assigned a value and will end up being represented using the normal immediate. This approach is anyway nice because it is not limited to a specific C/C++ subset of types; it applies to all types and does not discriminate.

A third idea is that signatures that are encodable in a small immediate using the existing system should be left alone and we could bias a system in favor of signatures that are common but not encodable using the existing system; this effectively increases the reach of the dictionary approach, for example.

Before undertaking this work, we should try to get data on (via a corpus analysis or just a search of the type space) whether we stand to gain much in practice.

You need to log in before you can comment on or make changes to this bug.