Closed Bug 1051860 Opened 5 years ago Closed 5 years ago

SIMD x86-x64: Optimize MSimdValueX4

Categories

(Core :: JavaScript Engine: JIT, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: bbouvier, Assigned: bbouvier)

References

Details

Attachments

(2 files, 3 obsolete files)

Blocks: 1023404
This unpcklps trick is really magic.
Attachment #8486702 - Flags: review?(sunfish)
Assignee: nobody → benj
Status: NEW → ASSIGNED
And pinsrd was actually already implemented for unboxing double values in
x86... Nice!
Attachment #8486705 - Flags: review?(sunfish)
So it appears this first patch was really not regalloc friendly, in particular it clobbered two inputs without any shame. We need two temporaries (for r0 and r1), so by reusing r0's register for the output, we have one; then we need only one copy. This patch splits lowering for int32x4 and float32x4 (note that there is no LOpcodes-x86-shared.h and so the opcodes have to be repeated in each of the x64/LOpcodes-x64.h and x86 files).
Attachment #8486702 - Attachment is obsolete: true
Attachment #8486702 - Flags: review?(sunfish)
Attachment #8487061 - Flags: review?(sunfish)
Attachment #8486705 - Attachment is obsolete: true
Attachment #8486705 - Flags: review?(sunfish)
Attachment #8487062 - Flags: review?(sunfish)
forgot to qref
Attachment #8487062 - Attachment is obsolete: true
Attachment #8487062 - Flags: review?(sunfish)
Attachment #8487063 - Flags: review?(sunfish)
Comment on attachment 8487061 [details] [diff] [review]
1. Optimize Float32x4 Codegen With This One Weird Trick

Review of attachment 8487061 [details] [diff] [review]:
-----------------------------------------------------------------

One Weird Trick that CPUs, um, actually like.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2150,5 @@
> +    FloatRegister r3 = ToFloatRegister(ins->getOperand(3));
> +
> +    masm.unpcklps(r3, r1);
> +    masm.unpcklps(r2, r0);
> +    masm.unpcklps(r1, r0);
Attachment #8487061 - Flags: review?(sunfish) → review+
Comment on attachment 8487063 [details] [diff] [review]
2. Optimize int32x4 codegen when SSE4.1 is available

Review of attachment 8487063 [details] [diff] [review]:
-----------------------------------------------------------------

Groovy. We could optimize the non-SSE4.1 case more, but not *that* much more, and it'd require temporary registers, so it's not worth worrying about for now.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2127,5 @@
> +    FloatRegister output = ToFloatRegister(ins->output());
> +    if (AssemblerX86Shared::HasSSE41()) {
> +        for (size_t i = 0; i < 4; ++i) {
> +            Register r = ToRegister(ins->getOperand(i));
> +            masm.pinsrd(i, r, output);

You want movd for the first element, and pinsrd for the rest. movd clears the entire register, clearing dependencies on earlier instructions.
Attachment #8487063 - Flags: review?(sunfish) → review+
Thanks!

Out of curiosity, how could we optimize the non SSE4.1 case more? Also, it looks your review comments of the first patch may have disappeared somehow (or there is just an empty comment that didn't get deleted). Please let me know if you had any remarks about the patch, I'll address them in a follow-up patch if needed.

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/179193fbcccd
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/6cc38353cfa4
https://hg.mozilla.org/mozilla-central/rev/179193fbcccd
https://hg.mozilla.org/mozilla-central/rev/6cc38353cfa4
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
(In reply to Benjamin Bouvier [:bbouvier] from comment #8)
> Thanks!
> 
> Out of curiosity, how could we optimize the non SSE4.1 case more?

movd all the integers into xmm registers and then do the shuffles like in the float32 case to pack them together. Doing a shuffle after the first 2 movd can keep it to 2 temporaries:

  movd        %(W), %xmm0
  movd        %(Y), %xmm1
  punpckldq   %xmm0, %xmm1
  movd        %(Z), %xmm2
  movd        %(X), %xmm0
  punpckldq   %xmm2, %xmm0
  punpckldq   %xmm1, %xmm0

> Also, it
> looks your review comments of the first patch may have disappeared somehow
> (or there is just an empty comment that didn't get deleted). Please let me
> know if you had any remarks about the patch, I'll address them in a
> follow-up patch if needed.

I had posted a unicode emoji character celebrating the triple-unpcklps sequence. It looks like bugzilla swallowed it. The patch is fine :-).
Depends on: 1067133
Depends on: 1100123
You need to log in before you can comment on or make changes to this bug.