Open Bug 1641589 Opened 5 years ago Updated 3 years ago

Infer float/int personality for generic wasm simd operations

Categories

(Core :: JavaScript: WebAssembly, enhancement, P5)

x86_64
All
enhancement

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files)

On the Intel architecture there are separate functional units for float operations and packed-int operations in the FPU, with a (I believe 1-cycle) penalty for transferring data from one unit to the other. Thus we'll see the best performance if operations that could happen in either unit are done in the unit where the data will need to be next. For example, v128.load can be lowered into instructions that load into the integer unit or the float unit; when we shuffle 32-bit data, we can shuffle using a pshufd or a shufps. At the moment, we use integer instructions for everything "generic" just for uniformity. But we can do better.

Specifically, it should be possible for us to propagate information in at least the LIR graph so that code generation can pick the best instruction. This is not so hard since almost everything is using a simd register anyway, we don't have to worry about int vs float registers.

(Going to block this on the aligned-data one so as to have the best possible basis on which to judge whether this is worth the bother.)

Some notes from elsewhere:

It may be possible to back-propagate information from type-specific uses to generic definitions. For example, a load is biased in favor of an integer-unit load, but looking at the uses of that load we can determine whether it ought to be a float-unit load. We need not do this before lowering, we just need to do it before code generation, so if the lowering of one node can set attributes on its input nodes and the codegen for the input node can use these attributes then we should be OK.

We should choose the most plausible instruction bias where it's trivial to do so: Some operations are confused about whether to use float or int biased instructions. For example, swizzleInt8x16 uses vmovapd but the data almost certainly has an integer interpretation here. negInt* uses vmovaps, ditto, for example. There could also be cases where we choose an integer biased op but should use a float op.

Looks like there are several group of instructions:

  • loads value into the register, such as v128.const, load32_zero, load64_splat, etc.
  • passes "bias" through, v128.xor, i8x16.shuffle, v128.bitselect, etc.
  • and rest of SIMD instructions, that define integer- or float-unit that will be used with particular operand.

It is all platform specific too, so the solution perhaps will to be limited to x86/64 codegen.

Also, I worry about whether this really matters for a compiler at Ion's level of sophistication. For a really good compiler which unrolls loops and generates highly pipelined computations (like for the inner loop of a matrix multiply) this may make a noticeable difference. For our compiler, which currently has some real issues with spurious moves and not-very-well optimized instruction selection, there may be enough things going on that these latencies are hidden in the OOO handling of other instructions in the neighborhood.

An interesting experiment might be to look at a couple of inner loop, as of matrix multiply and mandelbrot, say, and try to hack in a few changes in Ion that will make int/float instruction selection be good for those loops, and then measure. When I tried a similar experiment for bug 1641570 (align SIMD data) it was hard to detect any improvement in practice. That doesn't disquialify it, but means that the fruit is probably hanging fairly high.

(after quick'n'dirty hacking)

what was:

00000030  41 83 7e 40 00            cmpl $0x00, 0x40(%r14)
00000035  0f 85 2e 00 00 00         jnz 0x0000000000000069
0000003B  66 0f 6f 0d 7d 00 00 00   movdqax 0x00000000000000C0, %xmm1
00000043  c4 c1 78 10 57 10         vmovupsx 0x10(%r15), %xmm2
00000049  66 0f 62 ca               punpckldq %xmm2, %xmm1
0000004D  c5 f9 ef c9               vpxor %xmm1, %xmm0, %xmm1
00000051  c5 f9 6f d3               vmovdqa %xmm3, %xmm2
00000055  0f 58 d1                  addps %xmm1, %xmm2
00000058  c5 f9 6f da               vmovdqa %xmm2, %xmm3
0000005C  83 e8 01                  sub $0x01, %eax
0000005F  85 c0                     test %eax, %eax
00000061  75 cd                     jnz 0x0000000000000030

became:

00000030  41 83 7e 40 00            cmpl $0x00, 0x40(%r14)
00000035  0f 85 2c 00 00 00         jnz 0x0000000000000067
0000003B  0f 28 0d 7e 00 00 00      movapsx 0x00000000000000C0, %xmm1
00000042  c4 c1 78 10 57 10         vmovupsx 0x10(%r15), %xmm2
00000048  0f 14 ca                  unpcklps %xmm2, %xmm1
0000004B  c5 f8 57 c9               vxorps %xmm1, %xmm0, %xmm1
0000004F  c5 f8 28 d3               vmovaps %xmm3, %xmm2
00000053  0f 58 d1                  addps %xmm1, %xmm2
00000056  c5 f9 6f da               vmovdqa %xmm2, %xmm3
0000005A  83 e8 01                  sub $0x01, %eax
0000005D  85 c0                     test %eax, %eax
0000005F  75 cf                     jnz 0x0000000000000030

Notice changes around vpxor/vxorps, vmovdqa/vmovaps, punpckldq/unpcklps. These replacements give about 9% win performance. So it looks like it is worth to pursue this improvement.

Assignee: nobody → ydelendik

While microbenchmarks show promising gains, though I tried to measure "real world" code generated by compiler such as the mandelbrot or intgemm. It seams compiler-generated code bases do not gain from this approach because the operations that will benefit from that are not used that often in hot code: literal SIMD128 constants, register allocation moves, bitwise ops. Keeping WIP patch for further analysis. P5?

Assignee: ydelendik → nobody

I agree, we put this back on the shelf until register allocation and instruction selection are better.

Priority: P3 → P5
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: