Closed Bug 1051860 Opened 10 years ago Closed 10 years ago

SIMD x86-x64: Optimize MSimdValueX4

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla35

People

(Reporter: bbouvier, Assigned: bbouvier)

References

Details

Attachments

(2 files, 3 obsolete files)

Optimize SimdValueX4 codegen for float32x4 with unpcklps 10 years ago Benjamin Bouvier [:bbouvier] (inactive) 7.11 KB, patch		Details \| Diff \| Splinter Review
Optimize SimdValueX4 codegen for int32x4 if SSE4 is available 10 years ago Benjamin Bouvier [:bbouvier] (inactive) 7.49 KB, patch		Details \| Diff \| Splinter Review
1. Optimize Float32x4 Codegen With This One Weird Trick 10 years ago Benjamin Bouvier [:bbouvier] (inactive) 14.74 KB, patch	sunfish : review+	Details \| Diff \| Splinter Review
2. Optimize int32x4 codegen when SSE4.1 is available 10 years ago Benjamin Bouvier [:bbouvier] (inactive) 7.37 KB, patch		Details \| Diff \| Splinter Review
2. Optimize int32x4 codegen when SSE4.1 is available 10 years ago Benjamin Bouvier [:bbouvier] (inactive) 7.37 KB, patch	sunfish : review+	Details \| Diff \| Splinter Review

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Description

•

10 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1025475#c16 and https://bugzilla.mozilla.org/show_bug.cgi?id=1025475#c19 for more details.

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Updated

•

10 years ago

Blocks: 1023404

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 1

•

10 years ago

Attached patch Optimize SimdValueX4 codegen for float32x4 with unpcklps (obsolete) — Details — Splinter Review

This unpcklps trick is really magic.

Attachment #8486702 - Flags: review?(sunfish)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Updated

•

10 years ago

Assignee: nobody → benj

Status: NEW → ASSIGNED

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 2

•

10 years ago

Attached patch Optimize SimdValueX4 codegen for int32x4 if SSE4 is available (obsolete) — Details — Splinter Review

And pinsrd was actually already implemented for unboxing double values in
x86... Nice!

Attachment #8486705 - Flags: review?(sunfish)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 3

•

10 years ago

Attached patch 1. Optimize Float32x4 Codegen With This One Weird Trick — Details — Splinter Review

So it appears this first patch was really not regalloc friendly, in particular it clobbered two inputs without any shame. We need two temporaries (for r0 and r1), so by reusing r0's register for the output, we have one; then we need only one copy. This patch splits lowering for int32x4 and float32x4 (note that there is no LOpcodes-x86-shared.h and so the opcodes have to be repeated in each of the x64/LOpcodes-x64.h and x86 files).

Attachment #8486702 - Attachment is obsolete: true

Attachment #8486702 - Flags: review?(sunfish)

Attachment #8487061 - Flags: review?(sunfish)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 4

•

10 years ago

Attached patch 2. Optimize int32x4 codegen when SSE4.1 is available (obsolete) — Details — Splinter Review

Attachment #8486705 - Attachment is obsolete: true

Attachment #8486705 - Flags: review?(sunfish)

Attachment #8487062 - Flags: review?(sunfish)

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 5

•

10 years ago

Attached patch 2. Optimize int32x4 codegen when SSE4.1 is available — Details — Splinter Review

forgot to qref

Attachment #8487062 - Attachment is obsolete: true

Attachment #8487062 - Flags: review?(sunfish)

Attachment #8487063 - Flags: review?(sunfish)

Dan Gohman [:sunfish]

Comment 6

•

10 years ago

Comment on attachment 8487061 [details] [diff] [review]
1. Optimize Float32x4 Codegen With This One Weird Trick

Review of attachment 8487061 [details] [diff] [review]:
-----------------------------------------------------------------

One Weird Trick that CPUs, um, actually like.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2150,5 @@
> +    FloatRegister r3 = ToFloatRegister(ins->getOperand(3));
> +
> +    masm.unpcklps(r3, r1);
> +    masm.unpcklps(r2, r0);
> +    masm.unpcklps(r1, r0);

Attachment #8487061 - Flags: review?(sunfish) → review+

Dan Gohman [:sunfish]

Comment 7

•

10 years ago

Comment on attachment 8487063 [details] [diff] [review]
2. Optimize int32x4 codegen when SSE4.1 is available

Review of attachment 8487063 [details] [diff] [review]:
-----------------------------------------------------------------

Groovy. We could optimize the non-SSE4.1 case more, but not *that* much more, and it'd require temporary registers, so it's not worth worrying about for now.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2127,5 @@
> +    FloatRegister output = ToFloatRegister(ins->output());
> +    if (AssemblerX86Shared::HasSSE41()) {
> +        for (size_t i = 0; i < 4; ++i) {
> +            Register r = ToRegister(ins->getOperand(i));
> +            masm.pinsrd(i, r, output);

You want movd for the first element, and pinsrd for the rest. movd clears the entire register, clearing dependencies on earlier instructions.

Attachment #8487063 - Flags: review?(sunfish) → review+

Benjamin Bouvier [:bbouvier] (inactive)

Assignee

Comment 8

•

10 years ago

Thanks!

Out of curiosity, how could we optimize the non SSE4.1 case more? Also, it looks your review comments of the first patch may have disappeared somehow (or there is just an empty comment that didn't get deleted). Please let me know if you had any remarks about the patch, I'll address them in a follow-up patch if needed.

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/179193fbcccd
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/6cc38353cfa4

Carsten Book [:Tomcat]

Comment 9

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/179193fbcccd
https://hg.mozilla.org/mozilla-central/rev/6cc38353cfa4

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla35

Dan Gohman [:sunfish]

Comment 10

•

10 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #8)
> Thanks!
> 
> Out of curiosity, how could we optimize the non SSE4.1 case more?

movd all the integers into xmm registers and then do the shuffles like in the float32 case to pack them together. Doing a shuffle after the first 2 movd can keep it to 2 temporaries:

  movd        %(W), %xmm0
  movd        %(Y), %xmm1
  punpckldq   %xmm0, %xmm1
  movd        %(Z), %xmm2
  movd        %(X), %xmm0
  punpckldq   %xmm2, %xmm0
  punpckldq   %xmm1, %xmm0

> Also, it
> looks your review comments of the first patch may have disappeared somehow
> (or there is just an empty comment that didn't get deleted). Please let me
> know if you had any remarks about the patch, I'll address them in a
> follow-up patch if needed.

I had posted a unicode emoji character celebrating the triple-unpcklps sequence. It looks like bugzilla swallowed it. The patch is fine :-).

Steve Singer (:stevensn)

Updated

•

10 years ago

Depends on: 1067133

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

10 years ago

Depends on: 1100123

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

SIMD x86-x64: Optimize MSimdValueX4

Categories

(Core :: JavaScript Engine: JIT, defect)

Tracking

()

People

(Reporter: bbouvier, Assigned: bbouvier)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 3 obsolete files)

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Attachment

General

Description

File Name

Content Type