Closed
Bug 1115766
Opened 10 years ago
Closed 9 years ago
Misc codegen optimizations
Categories
(Core :: JavaScript Engine: JIT, enhancement)
Tracking
()
RESOLVED
FIXED
mozilla37
People
(Reporter: sunfish, Assigned: sunfish)
Details
Attachments
(3 files)
17.12 KB,
patch
|
bbouvier
:
review+
|
Details | Diff | Splinter Review |
2.32 KB,
patch
|
bbouvier
:
review+
|
Details | Diff | Splinter Review |
1.28 KB,
patch
|
bbouvier
:
review+
|
Details | Diff | Splinter Review |
While investigating https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning , I made the following low-level optimizations. This first patch eliminates the defineReuseInput constraint for LSimdShuffle on AVX, eliminating unnecessary copy instructions.
Attachment #8541781 -
Flags: review?(benj)
Assignee | ||
Comment 1•10 years ago
|
||
This patch eliminates more copies around vshufps when AVX is available.
Attachment #8541782 -
Flags: review?(benj)
Assignee | ||
Comment 2•10 years ago
|
||
This patch eliminates the extra use of the lhs operand in LMulI when it isn't needed. Since the extra use is not AtStart, it causes extra register conflicts.
Attachment #8541783 -
Flags: review?(benj)
Comment 3•9 years ago
|
||
Comment on attachment 8541781 [details] [diff] [review] shuffle-constraints.patch Review of attachment 8541781 [details] [diff] [review]: ----------------------------------------------------------------- Nice. r=me if you agree with all fixes, otherwise please re-flag as r? ::: js/src/jit/shared/BaseAssembler-x86-shared.h @@ +3165,5 @@ > twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst); > } > + void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > + { > + twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst); (not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have the same code, could we delete one of these, or just have one call the other? See http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/BaseAssembler-x86-shared.h#4728,4747 ) @@ +3178,5 @@ > twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst); > } > + void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > + { > + twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst); It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd expects an int32_t offset @@ +3843,5 @@ > threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst); > } > + void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > + { > + threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst); (can't find this function in dxr) ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2450,5 @@ > if (ins->lanesMatch(2, 3, 6, 7)) { > + FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); > + FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg; > + masm.vmovhlps(lhs, rhsCopy, output); > + if (!AssemblerX86Shared::HasAVX()) Considering that reusedInputAlignedFloat32x4 executes a different path if !HasAVX(), i'd really prefer having of all this explicit in grouped branches: (the synthetic form is a mess, although it reduces the number of lines) if (HasAVX()) { FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); masm.vmovhlps(lhs, rhsCopy, out); } else { masm.loadAlignedFloat32x4(rhs, ScratchSimdReg); masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg); masm.moveFloat32x4(ScratchSimdReg, out); } or even FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); if (HasAVX()) { masm.vmovhlps(lhs, rhsCopy, out); } else { masm.vmovhlps(lhs, rhsCopy, ScratchSimdReg); // or masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg); masm.moveFloat32x4(ScratchSimdReg, out); } @@ +2462,5 @@ > + rhsCopy = FloatRegister::FromCode(rhs.fpu()); > + } else { > + masm.loadAlignedFloat32x4(rhs, ScratchSimdReg); > + rhsCopy = ScratchSimdReg; > + } Might be worth either explaining why we're not using masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't overwritten) or having an helper function next to reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten @@ +2478,5 @@ > + FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); > + FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg; > + masm.vunpcklps(lhs, rhsCopy, output); > + if (!AssemblerX86Shared::HasAVX()) > + masm.moveFloat32x4(ScratchSimdReg, out); Same comment as for mask (2, 3, 6, 7). @@ +2493,5 @@ > + FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); > + FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg; > + masm.vunpckhps(lhs, rhsCopy, output); > + if (!AssemblerX86Shared::HasAVX()) > + masm.moveFloat32x4(ScratchSimdReg, out); ditto
Attachment #8541781 -
Flags: review?(benj) → review+
Comment 4•9 years ago
|
||
Comment on attachment 8541782 [details] [diff] [review] vex-shufps-nocopy.patch Review of attachment 8541782 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +1016,3 @@ > } > void shuffleMix(uint32_t mask, const Operand &src, FloatRegister dest) { > // Note this uses vshufps, which is a cross-domain penaly on CPU where it pre-existing: penalty*
Attachment #8541782 -
Flags: review?(benj) → review+
Comment 5•9 years ago
|
||
Comment on attachment 8541783 [details] [diff] [review] lmuli-omit-use.patch Review of attachment 8541783 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/shared/Lowering-x86-shared.cpp @@ +156,5 @@ > LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs) > { > + // Note: If we need a negative zero check, lhs is used twice. > + LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation(); > + LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy); Couldn't LMulI get a third parameter which is a |const LDefinition &temp| instead, use |setTemp| in the ctor body, and this lhsCopy would be a tempCopy() instead? Or is this not compatible with our register allocators so far?
Attachment #8541783 -
Flags: review?(benj) → review+
Assignee | ||
Comment 6•9 years ago
|
||
> @@ +156,5 @@ > > LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs) > > { > > + // Note: If we need a negative zero check, lhs is used twice. > > + LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation(); > > + LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy); > > Couldn't LMulI get a third parameter which is a |const LDefinition &temp| > instead, use |setTemp| in the ctor body, and this lhsCopy would be a > tempCopy() instead? Or is this not compatible with our register allocators > so far? Temps are for registers which the instruction can write to. In this case, we need an extra copy of the input, but we don't need to write to it. I think what it's doing is right. Checked in the two simpler patches here; I'll address your comments on the shuffle-constraints.patch separately. https://hg.mozilla.org/integration/mozilla-inbound/rev/6fb6a530e257 https://hg.mozilla.org/integration/mozilla-inbound/rev/e6d8db11ed44
Whiteboard: [leave open]
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #3) > Comment on attachment 8541781 [details] [diff] [review] > shuffle-constraints.patch > > Review of attachment 8541781 [details] [diff] [review]: > ----------------------------------------------------------------- > > Nice. r=me if you agree with all fixes, otherwise please re-flag as r? > > ::: js/src/jit/shared/BaseAssembler-x86-shared.h > @@ +3165,5 @@ > > twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst); > > } > > + void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > > + { > > + twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst); > > (not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have > the same code, could we delete one of these, or just have one call the > other? See > http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/ > BaseAssembler-x86-shared.h#4728,4747 ) As of 09a2d1f803a4, twoByteOpVex calls memoryModRM rather than memoryModRM_disp32, so they're not exactly the same code anymore. There is still a lot of redundancy in that file though. I'm wondering if we should push the Assembler's Operand class into the BaseAssembler interfaces so that we don't have separate copies of instructions for each way of accessing memory, and perhaps we could make disp32-ness a flag in Operand too, so that we don't have to have duplication between regular and disp32 functions. > @@ +3178,5 @@ > > twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst); > > } > > + void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > > + { > > + twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst); > > It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd > expects an int32_t offset As of 0ea34b180725, twoByteOpSimd expects an int32_t offset. > @@ +3843,5 @@ > > threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst); > > } > > + void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst) > > + { > > + threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst); > > (can't find this function in dxr) It exists as of 736d53322a1d. > ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp > @@ +2450,5 @@ > > if (ins->lanesMatch(2, 3, 6, 7)) { > > + FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg); > > + FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg; > > + masm.vmovhlps(lhs, rhsCopy, output); > > + if (!AssemblerX86Shared::HasAVX()) > > Considering that reusedInputAlignedFloat32x4 executes a different path if > !HasAVX(), i'd really prefer having of all this explicit in grouped > branches: (the synthetic form is a mess, although it reduces the number of > lines) > > if (HasAVX()) { > FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, > ScratchSimdReg); > masm.vmovhlps(lhs, rhsCopy, out); > } else { > masm.loadAlignedFloat32x4(rhs, ScratchSimdReg); > masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg); > masm.moveFloat32x4(ScratchSimdReg, out); > } Sounds good to me. I agree it's more readable this way. I've updated the patch in the three places where this comes up. > @@ +2462,5 @@ > > + rhsCopy = FloatRegister::FromCode(rhs.fpu()); > > + } else { > > + masm.loadAlignedFloat32x4(rhs, ScratchSimdReg); > > + rhsCopy = ScratchSimdReg; > > + } > > Might be worth either explaining why we're not using > masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't > overwritten) or having an helper function next to > reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten I added a comment. I decided against a helper function because the vmovlhps case is a fairly unique situation, since the instruction doesn't have a memory-operand form. https://hg.mozilla.org/integration/mozilla-inbound/rev/f46c2220f91b
Whiteboard: [leave open]
https://hg.mozilla.org/mozilla-central/rev/6fb6a530e257 https://hg.mozilla.org/mozilla-central/rev/e6d8db11ed44 https://hg.mozilla.org/mozilla-central/rev/f46c2220f91b
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
You need to log in
before you can comment on or make changes to this bug.
Description
•