Open Bug 1742930 Opened 1 year ago Updated 11 months ago

[meta] Optimize call_indirect

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

The call_indirect instruction, in its full glory, loads a function from a table (incurring a bounds check on the table) and if it is not null (incurring a null check) assumes that the call may be cross-module (incurring a context switch) and is unknown (incurring the cost of an indirect call and a signature check). All these checks can be optimized to some extent, to greater or lesser benefit. This is the top metabug to track all these optimizations, and related ones. It will block bug reports of how our indirect calls are slower than those of asm.js or those of other browsers.

Of particular interest here is bug 1340235 comment 15 et seq, which breaks down some of the work.

Once bug 1639153 lands (imminent, we hope), the most important work will be removing the null check and cleaning up the bounds check to let the normal BCE machinery go to work. It will probably also be important not to make a purely static determination about whether a table is exposed to other instances, but to allow the table to be downgraded from unexposed to exposed at runtime; this will allow more tables to remain in the unexposed state, and calls not to go through the indirect thunks.

There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.

No longer blocks: 1340235
Blocks: 1711412
No longer depends on: 1711412
Depends on: 1742936
Depends on: 1743581
Depends on: 1743586

(In reply to Lars T Hansen [:lth] from comment #0)

There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.

An LEA (or by extension, any scaled address) can be used on x64 by packing the table differently: all the code pointers together at the beginning, and all the tls pointers together at the end. Now code pointers are 8 bytes apart. This does not have to wait for getting function table entries down to a single word.

Depends on: 1743756
Depends on: 1743760
Depends on: 1744513
Depends on: 1744711
Depends on: 1744943
Depends on: 1745170
Depends on: 1745939
Depends on: 1748794
No longer blocks: 1711412
No longer blocks: 1340235
Depends on: 1340235
No longer blocks: 1340106
Depends on: 1340106
Blocks: 1340235
No longer depends on: 1340235
Depends on: 1754377
Blocks: wasm-lang
No longer depends on: 1742832
No longer depends on: 1743581
No longer depends on: 1744513
No longer depends on: 1743756

This is what call_indirect to a variable index (which is the common case) looks like now:

;; table bounds check, known table length (the common case)
00000034  41 83 fc 02               cmp $0x02, %r12d
00000038  0f 82 02 00 00 00         jb 0x0000000000000040
0000003E  0f 0b                     ud2
;; set up signature
00000040  41 ba 03 00 00 00         mov $0x03, %r10d
;; load table pointer and compute index of callee
00000046  49 8b 46 78               movq 0x78(%r14), %rax
0000004A  41 c1 e4 04               shl $0x04, %r12d
0000004E  49 03 c4                  add %r12, %rax
;; load callee tls
00000051  48 8b 58 08               movq 0x08(%rax), %rbx
;; compare to caller tls and branch if the same (the common case)
00000055  4c 3b f3                  cmp %rbx, %r14
00000058  0f 84 40 00 00 00         jz 0x000000000000009E
;; slow path: context switch.  note null check is by NPE
0000005E  4c 89 74 24 08            movq %r14, 0x08(%rsp)
00000063  4c 8b f3                  mov %rbx, %r14
00000066  4c 89 34 24               movq %r14, (%rsp)
0000006A  4d 8b 3e                  movq (%r14), %r15
0000006D  4d 8b 66 20               movq 0x20(%r14), %r12
00000071  49 8b 5e 18               movq 0x18(%r14), %rbx
00000075  49 89 9c 24 a0 00 00 00   movq %rbx, 0xA0(%r12)
0000007D  48 8b 00                  movq (%rax), %rax
00000080  ff d0                     call %rax
00000082  4c 8b 74 24 08            movq 0x08(%rsp), %r14
00000087  4d 8b 3e                  movq (%r14), %r15
0000008A  4d 8b 56 20               movq 0x20(%r14), %r10
0000008E  4d 8b 66 18               movq 0x18(%r14), %r12
00000092  4d 89 a2 a0 00 00 00      movq %r12, 0xA0(%r10)
00000099  e9 05 00 00 00            jmp 0x00000000000000A3
;; fast path: no context switch.  note no null check required
0000009E  48 8b 00                  movq (%rax), %rax
000000A1  ff d0                     call %rax

There are a couple of things we could do here. The slow case should go out-of-line so that the fast case can just fall through (bug TBD), as forward conditional branches are statically predicted as not-taken. The bounds check should jump out-of-line to the trap (bug 1680243) for the same reason. We could hoist the table pointer load up before the signature setup so that it has more time to execute. And we could hoist the code pointer load up to just after the tls pointer load, this might benefit the fast path especially.

Clearly the shift could be done before the load of the table pointer and on x64 the table pointer could then be added from memory into the shifted value, if the latter is in the right place. This might reduce code size a little, though in reality code size doesn't seem to be an issue - there aren't that many call_indirect sites (there should tend to be many more callees than callers).

The code pointer load can be folded into the call, if we don't hoist it. This too would reduce code size but might affect the CPU's ability to schedule things.

Priority: P2 → P3
Depends on: 1680243
Depends on: 1755625
No longer depends on: 1680243
You need to log in before you can comment on or make changes to this bug.