[meta] Optimize call_indirect
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: meta)
The call_indirect instruction, in its full glory, loads a function from a table (incurring a bounds check on the table) and if it is not null (incurring a null check) assumes that the call may be cross-module (incurring a context switch) and is unknown (incurring the cost of an indirect call and a signature check). All these checks can be optimized to some extent, to greater or lesser benefit. This is the top metabug to track all these optimizations, and related ones. It will block bug reports of how our indirect calls are slower than those of asm.js or those of other browsers.
Of particular interest here is bug 1340235 comment 15 et seq, which breaks down some of the work.
Once bug 1639153 lands (imminent, we hope), the most important work will be removing the null check and cleaning up the bounds check to let the normal BCE machinery go to work. It will probably also be important not to make a purely static determination about whether a table is exposed to other instances, but to allow the table to be downgraded from unexposed to exposed at runtime; this will allow more tables to remain in the unexposed state, and calls not to go through the indirect thunks.
There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 1•3 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #0)
There are also representational issues. Since a table-of-funcref is a 16-byte representation, a simple LEA can't be used to load the indirect callee. Getting this down to a one-word representation (which we want for fast call_ref performance anyway) will allow for a faster indirect calling sequence.
An LEA (or by extension, any scaled address) can be used on x64 by packing the table differently: all the code pointers together at the beginning, and all the tls pointers together at the end. Now code pointers are 8 bytes apart. This does not have to wait for getting function table entries down to a single word.
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 2•3 years ago
|
||
This is what call_indirect to a variable index (which is the common case) looks like now:
;; table bounds check, known table length (the common case)
00000034 41 83 fc 02 cmp $0x02, %r12d
00000038 0f 82 02 00 00 00 jb 0x0000000000000040
0000003E 0f 0b ud2
;; set up signature
00000040 41 ba 03 00 00 00 mov $0x03, %r10d
;; load table pointer and compute index of callee
00000046 49 8b 46 78 movq 0x78(%r14), %rax
0000004A 41 c1 e4 04 shl $0x04, %r12d
0000004E 49 03 c4 add %r12, %rax
;; load callee tls
00000051 48 8b 58 08 movq 0x08(%rax), %rbx
;; compare to caller tls and branch if the same (the common case)
00000055 4c 3b f3 cmp %rbx, %r14
00000058 0f 84 40 00 00 00 jz 0x000000000000009E
;; slow path: context switch. note null check is by NPE
0000005E 4c 89 74 24 08 movq %r14, 0x08(%rsp)
00000063 4c 8b f3 mov %rbx, %r14
00000066 4c 89 34 24 movq %r14, (%rsp)
0000006A 4d 8b 3e movq (%r14), %r15
0000006D 4d 8b 66 20 movq 0x20(%r14), %r12
00000071 49 8b 5e 18 movq 0x18(%r14), %rbx
00000075 49 89 9c 24 a0 00 00 00 movq %rbx, 0xA0(%r12)
0000007D 48 8b 00 movq (%rax), %rax
00000080 ff d0 call %rax
00000082 4c 8b 74 24 08 movq 0x08(%rsp), %r14
00000087 4d 8b 3e movq (%r14), %r15
0000008A 4d 8b 56 20 movq 0x20(%r14), %r10
0000008E 4d 8b 66 18 movq 0x18(%r14), %r12
00000092 4d 89 a2 a0 00 00 00 movq %r12, 0xA0(%r10)
00000099 e9 05 00 00 00 jmp 0x00000000000000A3
;; fast path: no context switch. note no null check required
0000009E 48 8b 00 movq (%rax), %rax
000000A1 ff d0 call %rax
There are a couple of things we could do here. The slow case should go out-of-line so that the fast case can just fall through (bug TBD), as forward conditional branches are statically predicted as not-taken. The bounds check should jump out-of-line to the trap (bug 1680243) for the same reason. We could hoist the table pointer load up before the signature setup so that it has more time to execute. And we could hoist the code pointer load up to just after the tls pointer load, this might benefit the fast path especially.
Clearly the shift could be done before the load of the table pointer and on x64 the table pointer could then be added from memory into the shifted value, if the latter is in the right place. This might reduce code size a little, though in reality code size doesn't seem to be an issue - there aren't that many call_indirect sites (there should tend to be many more callees than callers).
The code pointer load can be folded into the call, if we don't hoist it. This too would reduce code size but might affect the CPU's ability to schedule things.
Reporter | ||
Updated•3 years ago
|
Description
•