Infer float/int personality for generic wasm simd operations
Categories
(Core :: JavaScript: WebAssembly, enhancement, P5)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(2 files)
On the Intel architecture there are separate functional units for float operations and packed-int operations in the FPU, with a (I believe 1-cycle) penalty for transferring data from one unit to the other. Thus we'll see the best performance if operations that could happen in either unit are done in the unit where the data will need to be next. For example, v128.load can be lowered into instructions that load into the integer unit or the float unit; when we shuffle 32-bit data, we can shuffle using a pshufd or a shufps. At the moment, we use integer instructions for everything "generic" just for uniformity. But we can do better.
Specifically, it should be possible for us to propagate information in at least the LIR graph so that code generation can pick the best instruction. This is not so hard since almost everything is using a simd register anyway, we don't have to worry about int vs float registers.
(Going to block this on the aligned-data one so as to have the best possible basis on which to judge whether this is worth the bother.)
Reporter | ||
Comment 1•4 years ago
|
||
Some notes from elsewhere:
It may be possible to back-propagate information from type-specific uses to generic definitions. For example, a load is biased in favor of an integer-unit load, but looking at the uses of that load we can determine whether it ought to be a float-unit load. We need not do this before lowering, we just need to do it before code generation, so if the lowering of one node can set attributes on its input nodes and the codegen for the input node can use these attributes then we should be OK.
We should choose the most plausible instruction bias where it's trivial to do so: Some operations are confused about whether to use float or int biased instructions. For example, swizzleInt8x16 uses vmovapd but the data almost certainly has an integer interpretation here. negInt* uses vmovaps, ditto, for example. There could also be cases where we choose an integer biased op but should use a float op.
Comment 2•3 years ago
|
||
Looks like there are several group of instructions:
- loads value into the register, such as
v128.const
,load32_zero
,load64_splat
, etc. - passes "bias" through,
v128.xor
,i8x16.shuffle
,v128.bitselect
, etc. - and rest of SIMD instructions, that define integer- or float-unit that will be used with particular operand.
It is all platform specific too, so the solution perhaps will to be limited to x86/64 codegen.
Reporter | ||
Comment 3•3 years ago
|
||
Also, I worry about whether this really matters for a compiler at Ion's level of sophistication. For a really good compiler which unrolls loops and generates highly pipelined computations (like for the inner loop of a matrix multiply) this may make a noticeable difference. For our compiler, which currently has some real issues with spurious moves and not-very-well optimized instruction selection, there may be enough things going on that these latencies are hidden in the OOO handling of other instructions in the neighborhood.
An interesting experiment might be to look at a couple of inner loop, as of matrix multiply and mandelbrot, say, and try to hack in a few changes in Ion that will make int/float instruction selection be good for those loops, and then measure. When I tried a similar experiment for bug 1641570 (align SIMD data) it was hard to detect any improvement in practice. That doesn't disquialify it, but means that the fruit is probably hanging fairly high.
Comment 4•3 years ago
|
||
(after quick'n'dirty hacking)
what was:
00000030 41 83 7e 40 00 cmpl $0x00, 0x40(%r14)
00000035 0f 85 2e 00 00 00 jnz 0x0000000000000069
0000003B 66 0f 6f 0d 7d 00 00 00 movdqax 0x00000000000000C0, %xmm1
00000043 c4 c1 78 10 57 10 vmovupsx 0x10(%r15), %xmm2
00000049 66 0f 62 ca punpckldq %xmm2, %xmm1
0000004D c5 f9 ef c9 vpxor %xmm1, %xmm0, %xmm1
00000051 c5 f9 6f d3 vmovdqa %xmm3, %xmm2
00000055 0f 58 d1 addps %xmm1, %xmm2
00000058 c5 f9 6f da vmovdqa %xmm2, %xmm3
0000005C 83 e8 01 sub $0x01, %eax
0000005F 85 c0 test %eax, %eax
00000061 75 cd jnz 0x0000000000000030
became:
00000030 41 83 7e 40 00 cmpl $0x00, 0x40(%r14)
00000035 0f 85 2c 00 00 00 jnz 0x0000000000000067
0000003B 0f 28 0d 7e 00 00 00 movapsx 0x00000000000000C0, %xmm1
00000042 c4 c1 78 10 57 10 vmovupsx 0x10(%r15), %xmm2
00000048 0f 14 ca unpcklps %xmm2, %xmm1
0000004B c5 f8 57 c9 vxorps %xmm1, %xmm0, %xmm1
0000004F c5 f8 28 d3 vmovaps %xmm3, %xmm2
00000053 0f 58 d1 addps %xmm1, %xmm2
00000056 c5 f9 6f da vmovdqa %xmm2, %xmm3
0000005A 83 e8 01 sub $0x01, %eax
0000005D 85 c0 test %eax, %eax
0000005F 75 cf jnz 0x0000000000000030
Notice changes around vpxor/vxorps, vmovdqa/vmovaps, punpckldq/unpcklps. These replacements give about 9% win performance. So it looks like it is worth to pursue this improvement.
Comment 5•3 years ago
|
||
Updated•3 years ago
|
Comment 6•3 years ago
|
||
While microbenchmarks show promising gains, though I tried to measure "real world" code generated by compiler such as the mandelbrot or intgemm. It seams compiler-generated code bases do not gain from this approach because the operations that will benefit from that are not used that often in hot code: literal SIMD128 constants, register allocation moves, bitwise ops. Keeping WIP patch for further analysis. P5?
Reporter | ||
Comment 7•3 years ago
|
||
I agree, we put this back on the shelf until register allocation and instruction selection are better.
Description
•