Closed Bug 1115766 Opened 5 years ago Closed 5 years ago

Misc codegen optimizations

Categories

(Core :: JavaScript Engine: JIT, enhancement)

x86_64
All
enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla37

People

(Reporter: sunfish, Assigned: sunfish)

Details

Attachments

(3 files)

While investigating https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning , I made the following low-level optimizations.

This first patch eliminates the defineReuseInput constraint for LSimdShuffle on AVX, eliminating unnecessary copy instructions.
Attachment #8541781 - Flags: review?(benj)
This patch eliminates more copies around vshufps when AVX is available.
Attachment #8541782 - Flags: review?(benj)
This patch eliminates the extra use of the lhs operand in LMulI when it isn't needed. Since the extra use is not AtStart, it causes extra register conflicts.
Attachment #8541783 - Flags: review?(benj)
Comment on attachment 8541781 [details] [diff] [review]
shuffle-constraints.patch

Review of attachment 8541781 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. r=me if you agree with all fixes, otherwise please re-flag as r?

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +3165,5 @@
>          twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst);
>      }
> +    void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst);

(not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have the same code, could we delete one of these, or just have one call the other? See http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/BaseAssembler-x86-shared.h#4728,4747 )

@@ +3178,5 @@
>          twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst);
>      }
> +    void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst);

It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd expects an int32_t offset

@@ +3843,5 @@
>          threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst);
>      }
> +    void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst);

(can't find this function in dxr)

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2450,5 @@
>      if (ins->lanesMatch(2, 3, 6, 7)) {
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vmovhlps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())

Considering that reusedInputAlignedFloat32x4 executes a different path if !HasAVX(), i'd really prefer having of all this explicit in grouped branches: (the synthetic form is a mess, although it reduces the number of lines)

if (HasAVX()) {
  FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
  masm.vmovhlps(lhs, rhsCopy, out);
} else {
  masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
  masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
  masm.moveFloat32x4(ScratchSimdReg, out);
}


or even 


FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
if (HasAVX()) {
  masm.vmovhlps(lhs, rhsCopy, out);
} else {
  masm.vmovhlps(lhs, rhsCopy, ScratchSimdReg); // or masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
  masm.moveFloat32x4(ScratchSimdReg, out);
}

@@ +2462,5 @@
> +            rhsCopy = FloatRegister::FromCode(rhs.fpu());
> +        } else {
> +            masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
> +            rhsCopy = ScratchSimdReg;
> +        }

Might be worth either explaining why we're not using masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't overwritten) or having an helper function next to reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten

@@ +2478,5 @@
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vunpcklps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())
> +             masm.moveFloat32x4(ScratchSimdReg, out);

Same comment as for mask (2, 3, 6, 7).

@@ +2493,5 @@
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vunpckhps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())
> +             masm.moveFloat32x4(ScratchSimdReg, out);

ditto
Attachment #8541781 - Flags: review?(benj) → review+
Comment on attachment 8541782 [details] [diff] [review]
vex-shufps-nocopy.patch

Review of attachment 8541782 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/MacroAssembler-x86-shared.h
@@ +1016,3 @@
>      }
>      void shuffleMix(uint32_t mask, const Operand &src, FloatRegister dest) {
>          // Note this uses vshufps, which is a cross-domain penaly on CPU where it

pre-existing: penalty*
Attachment #8541782 - Flags: review?(benj) → review+
Comment on attachment 8541783 [details] [diff] [review]
lmuli-omit-use.patch

Review of attachment 8541783 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +156,5 @@
>  LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs)
>  {
> +    // Note: If we need a negative zero check, lhs is used twice.
> +    LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation();
> +    LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy);

Couldn't LMulI get a third parameter which is a |const LDefinition &temp| instead, use |setTemp| in the ctor body, and this lhsCopy would be a tempCopy() instead? Or is this not compatible with our register allocators so far?
Attachment #8541783 - Flags: review?(benj) → review+
> @@ +156,5 @@
> >  LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs)
> >  {
> > +    // Note: If we need a negative zero check, lhs is used twice.
> > +    LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation();
> > +    LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy);
> 
> Couldn't LMulI get a third parameter which is a |const LDefinition &temp|
> instead, use |setTemp| in the ctor body, and this lhsCopy would be a
> tempCopy() instead? Or is this not compatible with our register allocators
> so far?

Temps are for registers which the instruction can write to. In this case, we need an extra copy of the input, but we don't need to write to it. I think what it's doing is right.

Checked in the two simpler patches here; I'll address your comments on the shuffle-constraints.patch separately.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6fb6a530e257
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6d8db11ed44
Whiteboard: [leave open]
(In reply to Benjamin Bouvier [:bbouvier] from comment #3)
> Comment on attachment 8541781 [details] [diff] [review]
> shuffle-constraints.patch
> 
> Review of attachment 8541781 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice. r=me if you agree with all fixes, otherwise please re-flag as r?
> 
> ::: js/src/jit/shared/BaseAssembler-x86-shared.h
> @@ +3165,5 @@
> >          twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst);
> >      }
> > +    void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst);
> 
> (not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have
> the same code, could we delete one of these, or just have one call the
> other? See
> http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/
> BaseAssembler-x86-shared.h#4728,4747 )

As of 09a2d1f803a4, twoByteOpVex calls memoryModRM rather than memoryModRM_disp32, so they're not exactly the same code anymore.

There is still a lot of redundancy in that file though. I'm wondering if we should push the Assembler's Operand class into the BaseAssembler interfaces so that we don't have separate copies of instructions for each way of accessing memory, and perhaps we could make disp32-ness a flag in Operand too, so that we don't have to have duplication between regular and disp32 functions.

> @@ +3178,5 @@
> >          twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst);
> >      }
> > +    void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst);
> 
> It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd
> expects an int32_t offset

As of 0ea34b180725, twoByteOpSimd expects an int32_t offset.

> @@ +3843,5 @@
> >          threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst);
> >      }
> > +    void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst);
> 
> (can't find this function in dxr)

It exists as of 736d53322a1d.

> ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
> @@ +2450,5 @@
> >      if (ins->lanesMatch(2, 3, 6, 7)) {
> > +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> > +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> > +        masm.vmovhlps(lhs, rhsCopy, output);
> > +        if (!AssemblerX86Shared::HasAVX())
> 
> Considering that reusedInputAlignedFloat32x4 executes a different path if
> !HasAVX(), i'd really prefer having of all this explicit in grouped
> branches: (the synthetic form is a mess, although it reduces the number of
> lines)
> 
> if (HasAVX()) {
>   FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs,
> ScratchSimdReg);
>   masm.vmovhlps(lhs, rhsCopy, out);
> } else {
>   masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
>   masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
>   masm.moveFloat32x4(ScratchSimdReg, out);
> }

Sounds good to me. I agree it's more readable this way. I've updated the patch in the three places where this comes up.

> @@ +2462,5 @@
> > +            rhsCopy = FloatRegister::FromCode(rhs.fpu());
> > +        } else {
> > +            masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
> > +            rhsCopy = ScratchSimdReg;
> > +        }
> 
> Might be worth either explaining why we're not using
> masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't
> overwritten) or having an helper function next to
> reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten

I added a comment. I decided against a helper function because the vmovlhps case is a fairly unique situation, since the instruction doesn't have a memory-operand form.

https://hg.mozilla.org/integration/mozilla-inbound/rev/f46c2220f91b
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.