Cranelift: Support jump tables
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox69 | --- | fixed |
People
(Reporter: bbouvier, Assigned: lth)
References
Details
Attachments
(1 file, 2 obsolete files)
Currently, Cranelift will emit ifs for jump tables in Spidermonkey. This is not ideal for performance or code size. However, Cranelift already knows how to emit jump tables; the code doesn't know how to pass readonly data from Cranelift to Spidermonkey yet.
| Reporter | ||
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 1•6 years ago
|
||
Test case that crashes with illegal instruction:
var bin = wasmTextToBinary(
`(module
(func (export "f") (param $n i32) (result i32)
(block $A
(block $B
(block $C
(block $D
(block $E
(block $F (br_table $A $B $C $D $E $F (local.get $n))
(return (i32.const 12)))
(return (i32.const 10)))
(return (i32.const 8)))
(return (i32.const 6)))
(return (i32.const 4)))
(return (i32.const 2)))
(return (i32.const 0))))`);
var mod = new WebAssembly.Module(bin);
var ins = new WebAssembly.Instance(mod);
for ( var i=0; i < 10; i++ )
print(ins.exports.f(i));
Here's the disassembly from lldb:
0x1d09cf064020: 41 56 pushq %r14
0x1d09cf064022: 55 pushq %rbp
0x1d09cf064023: 48 8b ec movq %rsp, %rbp
0x1d09cf064026: 48 83 ec 08 subq $0x8, %rsp
0x1d09cf06402a: 83 ff 05 cmpl $0x5, %edi
0x1d09cf06402d: 73 10 jae 0x1d09cf06403f ;; skip to default if out of range
0x1d09cf06402f: 48 8d 05 33 00 00 00 leaq 0x33(%rip), %rax ;; table addr 0x...69 in rax
0x1d09cf064036: 48 63 0c b8 movslq (%rax,%rdi,4), %rcx ;; table element to rcx, sign extended
0x1d09cf06403a: 48 01 c8 addq %rcx, %rax ;; add table element to table address
0x1d09cf06403d: ff e0 jmpq *%rax ;; and jump to it
0x1d09cf06403f: b8 0a 00 00 00 movl $0xa, %eax ;; default case
0x1d09cf064044: eb 21 jmp 0x1d09cf064067
0x1d09cf064046: b8 08 00 00 00 movl $0x8, %eax
0x1d09cf06404b: eb 1a jmp 0x1d09cf064067
0x1d09cf06404d: b8 06 00 00 00 movl $0x6, %eax
0x1d09cf064052: eb 13 jmp 0x1d09cf064067
0x1d09cf064054: b8 04 00 00 00 movl $0x4, %eax
0x1d09cf064059: eb 0c jmp 0x1d09cf064067
0x1d09cf06405b: b8 02 00 00 00 movl $0x2, %eax
0x1d09cf064060: eb 05 jmp 0x1d09cf064067
0x1d09cf064062: b8 00 00 00 00 movl $0x0, %eax
0x1d09cf064067: 89 c0 movl %eax, %eax ;; return sequence starts here
0x1d09cf064069: 0xfffffff9 ;; oops, table
0x1d09cf06406d: 0xfffffff2
0x1d09cf064071: 0xffffffeb
0x1d09cf064075: 0xffffffe4
0x1d09cf064079: 0xffffffdd
0x1d09cf06407d: 4c 8b 74 24 10 movq 0x10(%rsp), %r14 ;; return sequence continues here
0x1d09cf064082: 4d 8b 3e movq (%r14), %r15
0x1d09cf064085: 48 83 c4 08 addq $0x8, %rsp
0x1d09cf064089: 5d popq %rbp
0x1d09cf06408a: 41 5e popq %r14
0x1d09cf06408c: c3 retq
The problem is that the jump table is not at the end of the function where it should be, but at the end of the function body before the epilogue. If it hadn't been for that, this would have been fine.
Presumably this is because the pasteup of the code is not aware of the epilogue, which is added after the fact.
(What's below may not be important)
When emit_inst() for the indirect jump encodes the load from the table, it gets the offset wrong in jt_disp4(). Backtrace:
frame #0: js`cranelift_codegen::isa::x86::binemit::jt_disp4::h0ce918d698c8dcf4(jt=(__0 = 0),...) at binemit.rs:341:4
frame #1: js`cranelift_codegen::isa::x86::binemit::emit_inst::h5f31c66d95a9a4c9(...) at binemit-x86.rs:3622:16
frame #2: js`core::ops::function::Fn::call::hd4e75944fb12c8a7(...) at function.rs:69:4
frame #3: js`cranelift_codegen::binemit::emit_function::hfa4c505b876bb8b4(...) at mod.rs:126:12
frame #4: js`_$LT$cranelift_codegen..isa..x86..Isa$u20$as$u20$cranelift_codegen..isa..TargetIsa$GT$\
::emit_function_to_memory::h615630a7623257c6(...) at mod.rs:132:8
| Reporter | ||
Comment 2•6 years ago
|
||
That's correct. Some ways to fix this:
- stop using the MemoryCodeSink in Spidermonkey and have our own CodeSink impl, which will emit code and keep the readonly data (jump table entries) separately; then the C++ code can request the metadata and put it after the Spidermonkey epilogue.
- split the CodeSink trait into a CodeSink and ReadonlyDataSink; table entries would get emitted by the readonly data sink instead of the code sink.
| Assignee | ||
Comment 3•6 years ago
|
||
relax_branches() in cranelift-codegen/src/binemit/relax-branches.rs computes the final offset of the jump tables and needs to be aware of the prologue, or we need to insert branches around the tables if there are any. Technically relaxation is also a little dodgy because it assumes the code starts at zero, which it does not (because of the prologue that's added later), but I don't see a reason why that's a big deal right now.
Arguably the branch tables should also be laid out on an aligned address so somebody should add some nops when necessary, but it may not matter very much.
| Assignee | ||
Comment 4•6 years ago
|
||
I see -- binemit::emit_function emits the code followed by the rodata. But it doesn't have any hooks for prologue or epilogue. (Clearly if it did have such hooks they would have to be coordinated with relaxation, and maybe with other things.)
CodeSink::begin_rodata() could perform the necessary fill for alignment, but only if it could take into account the sizes of prologue (now) and epilogue (when the epilogue comes before the jump tables).
The prologue and epilogue are added in C++ code, by GenerateCraneliftCode() in WasmCraneliftCompile.cpp, after cranlift is done with everything and we have a blob of raw machine code and rodata. This extra code includes stack frame setup and teardown.
Benjamin's solutions in comment 2 amount to moving more logic up into the C++ layer, letting it coordinate the emission of tables after the epilogue has been handled; it would have to offset the table entries additionally but this would not necessarily be bad, and there would be the option of creating padding for alignment too. It's also a little nasty since the tables will have been offset for the cranelift code size, so they're in an inbetween state.
The other solution appears to be to expose hooks that cranelift can call to make space for / generate raw bytes for prologue and epilogue; this way we do more in cranelift. I don't know how much work this is. One nice thing about it is that cranelift-generated code is always finished, it has everything needed.
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #2)
That's correct. Some ways to fix this:
- stop using the MemoryCodeSink in Spidermonkey and have our own CodeSink impl, which will emit code and keep the readonly data (jump table entries) separately; then the C++ code can request the metadata and put it after the Spidermonkey epilogue.
- split the CodeSink trait into a CodeSink and ReadonlyDataSink; table entries would get emitted by the readonly data sink instead of the code sink.
Having looked at this some, the first option looks more appealing - it is a more local fix. The CodeSink needs a couple more methods and we need to duplicate the implementation as well as emit_to_memory() but that's not much of a problem, or doesn't seem to be one so far.
| Assignee | ||
Comment 6•6 years ago
|
||
Having pondered it further, the problem is really this instruction in the disas above: leaq 0x33(%rip), %rax. When this is emitted it must know the offset from the instruction to the jump table. This offset is computed during relaxation. So I think what we really want here is a way for relaxation to know about the gap between code and its rodata, and perhaps know about the sizes of prologues and epilogues. This does not have to be super hard, it could be an additional parameter to compile() and would mirror the CodeSink's begin_rodata method (for example).
| Reporter | ||
Comment 7•6 years ago
|
||
Having pondered it further, the problem is really this instruction in the disas above: leaq 0x33(%rip), %rax.
Good point; it could just use the largest branch to be cautious, that we'd patch later, but that's more fragile than the right thing you're doing.
| Assignee | ||
Comment 8•6 years ago
|
||
Turns out that knowing the sizes of prologues and epilogues when we need to know those sizes (at relaxation time) is not so easy. The prologues and epilogues are generated into the masm before and after we copy in the CL code. They can vary a fair bit. They carry metadata (for traps at least). I'm not convinced that they can be generated early (so that we can know their sizes) and then copied around at will when we need them later, during pasteup.
Really, the cleanest and most forward-looking solution here is some kind of relocation so that we can treat the rodata as movable. We need to emit a relocation at the branch instruction so that we can later patch in the correct pc-relative offset to the table, and then each table must carry some kind of relocation information so that it can be fixed up relative to the location in memory where it is placed. The overhead of this should be manageable. However it does require that we distinguish between code and rodata in the CodeSink so that we can handle them separately.
| Assignee | ||
Comment 9•6 years ago
|
||
I think I have found a simple way to do this. The MemoryCodeSink emits simple relocs for the jump table references; we track the offset within the final CL output where the code ends and the rodata begins; then during pasteup on the C++ side we split the CL output apart, patch up the references in the code part, and fixup the offsets in the rodata part. This requires only a small amount of code on both sides of the fence.
| Reporter | ||
Comment 10•6 years ago
|
||
patch up the references in the code part
Can we do this? It seems that we can end up in a situation where:
- a branch emitted in the code is just below a branch range limit
- we insert the custom epilogue
- now the range between branch and jump table is slightly off the branch range
(Unless we conservatively assume the biggest branch range, as suggested before)
| Assignee | ||
Comment 11•6 years ago
|
||
The addend for computing the branch table base must be four bytes, but it already is that (see disassembly above). And that should be the only constraint in practical terms - the branch table has four-byte entries.
| Assignee | ||
Comment 12•6 years ago
|
||
Spidermonkey side.
| Assignee | ||
Comment 13•6 years ago
|
||
Cranelift side.
| Assignee | ||
Comment 14•6 years ago
|
||
This patch uses the machinery submitted in
https://github.com/CraneStation/cranelift/pull/774 to split the jump
table (and any other read-only data) apart from the compiled code, so
as to make room for the epilogue and proper alignment. The only
difficult part about this is to update the instructions referencing
the tables, and the table entries, when the tables are moved.
| Reporter | ||
Comment 15•6 years ago
|
||
Nice. Did you get some performance numbers that showed improvements in our set of tracked benchmarks? Did any issue related to https://github.com/CraneStation/cranelift/issues/732 show up in our test suite?
| Assignee | ||
Comment 16•6 years ago
|
||
Re performance, I think the mac laptop that I use for benchmarking is a little noisy, but it looks like ~5% win on binarytrees and otherwise nothing interesting for the tracked benchmarks. (Of the untracked ones there was a consistent > 20% gain on primes; the most likely explanation for this is that the implementation of sqrt, which is fairly hot in this program, uses a switch.)
No issues re that cranelift ticket.
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 17•6 years ago
|
||
Comment 18•6 years ago
|
||
| bugherder | ||
Description
•