Closed
Bug 1051860
Opened 9 years ago
Closed 9 years ago
SIMD x86-x64: Optimize MSimdValueX4
Categories
(Core :: JavaScript Engine: JIT, defect)
Tracking
()
RESOLVED
FIXED
mozilla35
People
(Reporter: bbouvier, Assigned: bbouvier)
References
Details
Attachments
(2 files, 3 obsolete files)
14.74 KB,
patch
|
sunfish
:
review+
|
Details | Diff | Splinter Review |
7.37 KB,
patch
|
sunfish
:
review+
|
Details | Diff | Splinter Review |
See https://bugzilla.mozilla.org/show_bug.cgi?id=1025475#c16 and https://bugzilla.mozilla.org/show_bug.cgi?id=1025475#c19 for more details.
Assignee | ||
Comment 1•9 years ago
|
||
This unpcklps trick is really magic.
Attachment #8486702 -
Flags: review?(sunfish)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → benj
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•9 years ago
|
||
And pinsrd was actually already implemented for unboxing double values in x86... Nice!
Attachment #8486705 -
Flags: review?(sunfish)
Assignee | ||
Comment 3•9 years ago
|
||
So it appears this first patch was really not regalloc friendly, in particular it clobbered two inputs without any shame. We need two temporaries (for r0 and r1), so by reusing r0's register for the output, we have one; then we need only one copy. This patch splits lowering for int32x4 and float32x4 (note that there is no LOpcodes-x86-shared.h and so the opcodes have to be repeated in each of the x64/LOpcodes-x64.h and x86 files).
Attachment #8486702 -
Attachment is obsolete: true
Attachment #8486702 -
Flags: review?(sunfish)
Attachment #8487061 -
Flags: review?(sunfish)
Assignee | ||
Comment 4•9 years ago
|
||
Attachment #8486705 -
Attachment is obsolete: true
Attachment #8486705 -
Flags: review?(sunfish)
Attachment #8487062 -
Flags: review?(sunfish)
Assignee | ||
Comment 5•9 years ago
|
||
forgot to qref
Attachment #8487062 -
Attachment is obsolete: true
Attachment #8487062 -
Flags: review?(sunfish)
Attachment #8487063 -
Flags: review?(sunfish)
Comment 6•9 years ago
|
||
Comment on attachment 8487061 [details] [diff] [review] 1. Optimize Float32x4 Codegen With This One Weird Trick Review of attachment 8487061 [details] [diff] [review]: ----------------------------------------------------------------- One Weird Trick that CPUs, um, actually like. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2150,5 @@ > + FloatRegister r3 = ToFloatRegister(ins->getOperand(3)); > + > + masm.unpcklps(r3, r1); > + masm.unpcklps(r2, r0); > + masm.unpcklps(r1, r0);
Attachment #8487061 -
Flags: review?(sunfish) → review+
Comment 7•9 years ago
|
||
Comment on attachment 8487063 [details] [diff] [review] 2. Optimize int32x4 codegen when SSE4.1 is available Review of attachment 8487063 [details] [diff] [review]: ----------------------------------------------------------------- Groovy. We could optimize the non-SSE4.1 case more, but not *that* much more, and it'd require temporary registers, so it's not worth worrying about for now. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2127,5 @@ > + FloatRegister output = ToFloatRegister(ins->output()); > + if (AssemblerX86Shared::HasSSE41()) { > + for (size_t i = 0; i < 4; ++i) { > + Register r = ToRegister(ins->getOperand(i)); > + masm.pinsrd(i, r, output); You want movd for the first element, and pinsrd for the rest. movd clears the entire register, clearing dependencies on earlier instructions.
Attachment #8487063 -
Flags: review?(sunfish) → review+
Assignee | ||
Comment 8•9 years ago
|
||
Thanks! Out of curiosity, how could we optimize the non SSE4.1 case more? Also, it looks your review comments of the first patch may have disappeared somehow (or there is just an empty comment that didn't get deleted). Please let me know if you had any remarks about the patch, I'll address them in a follow-up patch if needed. remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/179193fbcccd remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/6cc38353cfa4
Comment 9•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/179193fbcccd https://hg.mozilla.org/mozilla-central/rev/6cc38353cfa4
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Comment 10•9 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #8) > Thanks! > > Out of curiosity, how could we optimize the non SSE4.1 case more? movd all the integers into xmm registers and then do the shuffles like in the float32 case to pack them together. Doing a shuffle after the first 2 movd can keep it to 2 temporaries: movd %(W), %xmm0 movd %(Y), %xmm1 punpckldq %xmm0, %xmm1 movd %(Z), %xmm2 movd %(X), %xmm0 punpckldq %xmm2, %xmm0 punpckldq %xmm1, %xmm0 > Also, it > looks your review comments of the first patch may have disappeared somehow > (or there is just an empty comment that didn't get deleted). Please let me > know if you had any remarks about the patch, I'll address them in a > follow-up patch if needed. I had posted a unicode emoji character celebrating the triple-unpcklps sequence. It looks like bugzilla swallowed it. The patch is fine :-).
You need to log in
before you can comment on or make changes to this bug.
Description
•