Closed Bug 1060789 Opened 5 years ago Closed 5 years ago

Odin SIMD: Implement splat


(Core :: JavaScript Engine: JIT, defect)

Not set





(Reporter: dougc, Assigned: dougc)




(3 files, 3 obsolete files)

No description provided.
Blocks: 1023404
Attached patch Implement splat. (obsolete) — Splinter Review
This patch optimizes a simd x4 ctor with all plane arguments the same to use the shufps instruction.

A challenge using this instruction for splat is that the source and destination need to be in the same register. For a float32x4 the patch uses defineReuseInput() for this, and I hope this will work as intended give the input is a float32 and the output a float32x4?

It was not clear if a new MIR class should have been created, or SimdValueX4 reused with all arguments the same?

The patch does not yet define splat, just optimizes the ctor.

Anyway this was adequate to confirm that this optimization gives a good performance improvement for the Flappy Bird demo. The SIMD version is now 75% fast that the non-SIMD version.
Attachment #8481797 - Flags: feedback?(sunfish)
Comment on attachment 8481797 [details] [diff] [review]
Implement splat.

Review of attachment 8481797 [details] [diff] [review]:

This looks reasonable to me, off the top of my head. A dedicated class for Splat rather than making it a special-case of MSimdValueX4 sounds good, as I think splats will be common enough.

::: js/src/jit/Lowering.cpp
@@ +3700,5 @@
> +    switch (ins->type()) {
> +      case MIRType_Int32x4:
> +        return define(lir, ins);
> +      case MIRType_Float32x4:
> +        return defineReuseInput(lir, ins, 0);

The defineReuseInput here looks right for x86/x64 using shufps, so the only problem here is that this is target-independent code here. I think splats will have to be lowered in */Lowering-*.cpp.
Attachment #8481797 - Flags: feedback?(sunfish) → feedback+
Thank you for the quick feedback.

* Moved the Lowering into the backends.

* Added a couple of tests.

Shall implement the Odin 'slat' support in a follow up patch. Optimizing the constructor is enough for the demos, and we need to land this asap.
Assignee: nobody → dtc-moz
Attachment #8481797 - Attachment is obsolete: true
Attachment #8481923 - Flags: review?(sunfish)
Please leave open to add the Odin 'splat' support.
Keywords: leave-open
Comment on attachment 8481923 [details] [diff] [review]
Add 'splat' backend support and optimize constructors to use this.

Review of attachment 8481923 [details] [diff] [review]:

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2165,5 @@
> +    switch (mir->type()) {
> +      case MIRType_Int32x4: {
> +        Register r = ToRegister(ins->getOperand(0));
> +        masm.movd(r, output);
> +        masm.shufps(0, output, output);

int32x4 splat should use pshufd instead of shufps to avoid a domain crossing penalty. pshufd(0, output, output).

@@ +2170,5 @@
> +        break;
> +      }
> +      case MIRType_Float32x4: {
> +        FloatRegister r = ToFloatRegister(ins->getOperand(0));
> +        masm.shufps(0, r, output);

You can assert r == output here, to sanity-check that the defineReusedInput did its job.
Attachment #8481923 - Flags: review?(sunfish) → review+
Thank you for the quick review, and your help. This patch integrates the feedback. Carrying forward the r+.
Attachment #8481923 - Attachment is obsolete: true
Attachment #8481933 - Flags: review+
Just splitting the patch into two parts: implementation of splat vs use it in Odin
Attachment #8481933 - Attachment is obsolete: true
Attachment #8482263 - Flags: review+
Adds proper support for int32x4.splat and float32x4 in Odin.

- int32x4.splat(x) === int32x4(x, x, x, x);
- same rules as the tuple constructors: int32x4.splat accepts intish and float32x4.splat accepts signed, unsigned, double?, floatish.
Attachment #8482610 - Flags: review?(luke)
We might have a problem here, see the unaligned movaps storing to the stack:

   0x00007f2629a030a8:	cvtsd2ss %xmm0,%xmm1
   0x00007f2629a030ac:	shufps $0x0,%xmm1,%xmm1
   0x00007f2629a030b0:	movss  0x4(%rsp),%xmm0
   0x00007f2629a030b6:	shufps $0x0,%xmm0,%xmm0
   0x00007f2629a030ba:	movaps %xmm0,%xmm2
   0x00007f2629a030bd:	movaps %xmm0,0x4(%rsp)    <<<<<<<<<<

I might have missed some important point wrt register targeting etc.
Comment on attachment 8482610 [details] [diff] [review]

Review of attachment 8482610 [details] [diff] [review]:

::: js/src/asmjs/AsmJSValidate.cpp
@@ +4720,5 @@
> +          return false;
> +        break;
> +    }
> +
> +    *type = global->simdOperationType();

Perhaps add a JS_ASSERT(global->simdOperationOp() == Splat) (with \n before and after)?

Alternatively, I wonder if we could generalize CheckBinarySimd+CheckUnarySimd into an n-ary CheckSimdCall.
Attachment #8482610 - Flags: review?(luke) → review+
Here's some of the close debug dump. Note the blendvps instruction uses the xmm0 register as an input and this can be seen to cause a spill below that tickles this problem.

[RegAlloc] Allocating v8[1] [20,33) [priority 13] [weight 0]
[RegAlloc]   Hint xmm0, used by group allocation
[RegAlloc]   xmm0 collides with fixed use [23,24)
[RegAlloc]   Spilling interval
[RegAlloc]     Allocating spill location stack:12   <<< stack allocated here
[RegAlloc] Allocating v1[1] [4,15) [priority 11] [weight 0]
[RegAlloc]   Hint xmm0, used by group allocation
[RegAlloc]   xmm0 collides with fixed use [11,12)
[RegAlloc]   Spilling interval
[RegAlloc]     Reusing group spill location stack:12
[RegAlloc] Allocating v15[1] [34,192) v15:r?@34 v15:r?@34 v15:r?@137 v15:r?@143 [priority 158] [weight 25]
[RegAlloc]   Hint xmm0, used by group allocation
[RegAlloc]   xmm0 collides with v68[0] [147,151) [weight 1000]
[RegAlloc]   Spilling interval
[RegAlloc]     Reusing group spill location stack:12

  v1[0] req(xmm0) has(xmm0) [3,4) / v1[1] req(xmm0?) has(stack:12) [4,15) / v1[2] req(xmm0?) has(xmm0) [14,15) v1:r?@14
  v8[0] req(r,xmm0?) has(xmm0) [19,20) / v8[1] req(xmm0?) has(stack:12) [20,33) / v8[2] req(xmm0?) has(xmm0) [32,33) v8:r?@32
  v15[0] req(r,xmm0?) has(xmm0) [33,34) / v15[1] req(xmm0?) has(stack:12) [34,192) v15:r?@34 v15:r?@34 v15:r?@137 v15:r?@143

[28,29 DoubleToFloat32] [def v13<f>:xmm1] [use v12:r xmm0]
[30,31 SimdSplatX4] [def v14<(null)>:xmm1] [use v13:r xmm1]
[MoveGroup] [stack:12 -> xmm0]
[32,33 SimdSplatX4] [def v15<(null)>:xmm0] [use v8:r xmm0]
[MoveGroup] [xmm0 -> xmm2] [xmm0 -> stack:12]  <<< unaligned access
[34,35 SimdBinaryArithFx4] [def v16<(null)>:xmm2] [use v15:r xmm2] [use v15:r? stack:12]

[Codegen] instruction DoubleToFloat32
[Codegen] cvtsd2ss   %xmm0, %xmm1
[Codegen] instruction SimdSplatX4
[Codegen] shufps     0x0, %xmm1, %xmm1
[Codegen] instruction MoveGroup
[Codegen] movss      0x4(%rsp), %xmm0
[Codegen] instruction SimdSplatX4
[Codegen] shufps     0x0, %xmm0, %xmm0
[Codegen] instruction MoveGroup
[Codegen] movaps     %xmm0, %xmm2
[Codegen] movaps     %xmm0, 0x4(%rsp)    <<< unaligned access
[Codegen] instruction SimdBinaryArithFx4:Mul
[Codegen] mulps      0x4(%rsp), %xmm2
[Codegen] instruction Float32x4
[Codegen] movaps     ?(%rip), %xmm3
[Codegen] instruction MoveGroup
[Codegen] movl       0x8(%rsp), %eax
Depends on: 1062067
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
You need to log in before you can comment on or make changes to this bug.