Closed Bug 1115766 Opened 10 years ago Closed 9 years ago

Misc codegen optimizations

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla37

People

(Reporter: sunfish, Assigned: sunfish)

Details

Attachments

(3 files)

shuffle-constraints.patch 10 years ago Dan Gohman [:sunfish] 17.12 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review
vex-shufps-nocopy.patch 10 years ago Dan Gohman [:sunfish] 2.32 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review
lmuli-omit-use.patch 10 years ago Dan Gohman [:sunfish] 1.28 KB, patch	bbouvier : review+	Details \| Diff \| Splinter Review

Dan Gohman [:sunfish]

Assignee

Description

•

10 years ago

Attached patch shuffle-constraints.patch — Details — Splinter Review

While investigating https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning , I made the following low-level optimizations.

This first patch eliminates the defineReuseInput constraint for LSimdShuffle on AVX, eliminating unnecessary copy instructions.

Attachment #8541781 - Flags: review?(benj)

Dan Gohman [:sunfish]

Assignee

Comment 1

•

10 years ago

Attached patch vex-shufps-nocopy.patch — Details — Splinter Review

This patch eliminates more copies around vshufps when AVX is available.

Attachment #8541782 - Flags: review?(benj)

Dan Gohman [:sunfish]

Assignee

Comment 2

•

10 years ago

Attached patch lmuli-omit-use.patch — Details — Splinter Review

This patch eliminates the extra use of the lhs operand in LMulI when it isn't needed. Since the extra use is not AtStart, it causes extra register conflicts.

Attachment #8541783 - Flags: review?(benj)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 3

•

9 years ago

Comment on attachment 8541781 [details] [diff] [review]
shuffle-constraints.patch

Review of attachment 8541781 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. r=me if you agree with all fixes, otherwise please re-flag as r?

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +3165,5 @@
>          twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst);
>      }
> +    void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst);

(not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have the same code, could we delete one of these, or just have one call the other? See http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/BaseAssembler-x86-shared.h#4728,4747 )

@@ +3178,5 @@
>          twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst);
>      }
> +    void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst);

It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd expects an int32_t offset

@@ +3843,5 @@
>          threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst);
>      }
> +    void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> +    {
> +        threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst);

(can't find this function in dxr)

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +2450,5 @@
>      if (ins->lanesMatch(2, 3, 6, 7)) {
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vmovhlps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())

Considering that reusedInputAlignedFloat32x4 executes a different path if !HasAVX(), i'd really prefer having of all this explicit in grouped branches: (the synthetic form is a mess, although it reduces the number of lines)

if (HasAVX()) {
  FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
  masm.vmovhlps(lhs, rhsCopy, out);
} else {
  masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
  masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
  masm.moveFloat32x4(ScratchSimdReg, out);
}


or even 


FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
if (HasAVX()) {
  masm.vmovhlps(lhs, rhsCopy, out);
} else {
  masm.vmovhlps(lhs, rhsCopy, ScratchSimdReg); // or masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
  masm.moveFloat32x4(ScratchSimdReg, out);
}

@@ +2462,5 @@
> +            rhsCopy = FloatRegister::FromCode(rhs.fpu());
> +        } else {
> +            masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
> +            rhsCopy = ScratchSimdReg;
> +        }

Might be worth either explaining why we're not using masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't overwritten) or having an helper function next to reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten

@@ +2478,5 @@
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vunpcklps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())
> +             masm.moveFloat32x4(ScratchSimdReg, out);

Same comment as for mask (2, 3, 6, 7).

@@ +2493,5 @@
> +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> +        masm.vunpckhps(lhs, rhsCopy, output);
> +        if (!AssemblerX86Shared::HasAVX())
> +             masm.moveFloat32x4(ScratchSimdReg, out);

ditto

Attachment #8541781 - Flags: review?(benj) → review+

Benjamin Bouvier [:bbouvier] (inactive)

Comment 4

•

9 years ago

Comment on attachment 8541782 [details] [diff] [review]
vex-shufps-nocopy.patch

Review of attachment 8541782 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/MacroAssembler-x86-shared.h
@@ +1016,3 @@
>      }
>      void shuffleMix(uint32_t mask, const Operand &src, FloatRegister dest) {
>          // Note this uses vshufps, which is a cross-domain penaly on CPU where it

pre-existing: penalty*

Attachment #8541782 - Flags: review?(benj) → review+

Benjamin Bouvier [:bbouvier] (inactive)

Comment 5

•

9 years ago

Comment on attachment 8541783 [details] [diff] [review]
lmuli-omit-use.patch

Review of attachment 8541783 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +156,5 @@
>  LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs)
>  {
> +    // Note: If we need a negative zero check, lhs is used twice.
> +    LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation();
> +    LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy);

Couldn't LMulI get a third parameter which is a |const LDefinition &temp| instead, use |setTemp| in the ctor body, and this lhsCopy would be a tempCopy() instead? Or is this not compatible with our register allocators so far?

Attachment #8541783 - Flags: review?(benj) → review+

Dan Gohman [:sunfish]

Assignee

Comment 6

•

9 years ago

> @@ +156,5 @@
> >  LIRGeneratorX86Shared::lowerMulI(MMul *mul, MDefinition *lhs, MDefinition *rhs)
> >  {
> > +    // Note: If we need a negative zero check, lhs is used twice.
> > +    LAllocation lhsCopy = mul->canBeNegativeZero() ? use(lhs) : LAllocation();
> > +    LMulI *lir = new(alloc()) LMulI(useRegisterAtStart(lhs), useOrConstant(rhs), lhsCopy);
> 
> Couldn't LMulI get a third parameter which is a |const LDefinition &temp|
> instead, use |setTemp| in the ctor body, and this lhsCopy would be a
> tempCopy() instead? Or is this not compatible with our register allocators
> so far?

Temps are for registers which the instruction can write to. In this case, we need an extra copy of the input, but we don't need to write to it. I think what it's doing is right.

Checked in the two simpler patches here; I'll address your comments on the shuffle-constraints.patch separately.

https://hg.mozilla.org/integration/mozilla-inbound/rev/6fb6a530e257
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6d8db11ed44

Whiteboard: [leave open]

Dan Gohman [:sunfish]

Assignee

Comment 7

•

9 years ago

(In reply to Benjamin Bouvier [:bbouvier] from comment #3)
> Comment on attachment 8541781 [details] [diff] [review]
> shuffle-constraints.patch
> 
> Review of attachment 8541781 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice. r=me if you agree with all fixes, otherwise please re-flag as r?
> 
> ::: js/src/jit/shared/BaseAssembler-x86-shared.h
> @@ +3165,5 @@
> >          twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, src1, src0, dst);
> >      }
> > +    void vunpcklps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        twoByteOpSimd("vunpcklps", VEX_PS, OP2_UNPCKLPS_VsdWsd, offset, base, src0, dst);
> 
> (not in this patch, but wondering: it seems that twoByteOpVex{,_disp32} have
> the same code, could we delete one of these, or just have one call the
> other? See
> http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/
> BaseAssembler-x86-shared.h#4728,4747 )

As of 09a2d1f803a4, twoByteOpVex calls memoryModRM rather than memoryModRM_disp32, so they're not exactly the same code anymore.

There is still a lot of redundancy in that file though. I'm wondering if we should push the Assembler's Operand class into the BaseAssembler interfaces so that we don't have separate copies of instructions for each way of accessing memory, and perhaps we could make disp32-ness a flag in Operand too, so that we don't have to have duplication between regular and disp32 functions.

> @@ +3178,5 @@
> >          twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, src1, src0, dst);
> >      }
> > +    void vunpckhps_mr(int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        twoByteOpSimd("vunpckhps", VEX_PS, OP2_UNPCKHPS_VsdWsd, offset, base, src0, dst);
> 
> It would be nice that either 'offset' becomes an 'int', or twoByteOpSimd
> expects an int32_t offset

As of 0ea34b180725, twoByteOpSimd expects an int32_t offset.

> @@ +3843,5 @@
> >          threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, src1, src0, dst);
> >      }
> > +    void vinsertps_imr(uint32_t mask, int32_t offset, RegisterID base, XMMRegisterID src0, XMMRegisterID dst)
> > +    {
> > +        threeByteOpImmSimd("vinsertps", VEX_PD, OP3_INSERTPS_VpsUps, ESCAPE_INSERTPS, mask, offset, base, src0, dst);
> 
> (can't find this function in dxr)

It exists as of 736d53322a1d.

> ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
> @@ +2450,5 @@
> >      if (ins->lanesMatch(2, 3, 6, 7)) {
> > +        FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs, ScratchSimdReg);
> > +        FloatRegister output = AssemblerX86Shared::HasAVX() ? out : ScratchSimdReg;
> > +        masm.vmovhlps(lhs, rhsCopy, output);
> > +        if (!AssemblerX86Shared::HasAVX())
> 
> Considering that reusedInputAlignedFloat32x4 executes a different path if
> !HasAVX(), i'd really prefer having of all this explicit in grouped
> branches: (the synthetic form is a mess, although it reduces the number of
> lines)
> 
> if (HasAVX()) {
>   FloatRegister rhsCopy = masm.reusedInputAlignedFloat32x4(rhs,
> ScratchSimdReg);
>   masm.vmovhlps(lhs, rhsCopy, out);
> } else {
>   masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
>   masm.vmovhlps(lhs, ScratchSimdReg, ScratchSimdReg);
>   masm.moveFloat32x4(ScratchSimdReg, out);
> }

Sounds good to me. I agree it's more readable this way. I've updated the patch in the three places where this comes up.

> @@ +2462,5 @@
> > +            rhsCopy = FloatRegister::FromCode(rhs.fpu());
> > +        } else {
> > +            masm.loadAlignedFloat32x4(rhs, ScratchSimdReg);
> > +            rhsCopy = ScratchSimdReg;
> > +        }
> 
> Might be worth either explaining why we're not using
> masm.reusedInputAlignedFloat32x4 (the reason being that rhs isn't
> overwritten) or having an helper function next to
> reusedInputAlignedFloat32x4 for this case where rhs isn't overwritten

I added a comment. I decided against a helper function because the vmovlhps case is a fairly unique situation, since the instruction doesn't have a memory-operand form.

https://hg.mozilla.org/integration/mozilla-inbound/rev/f46c2220f91b

Whiteboard: [leave open]

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 8

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/6fb6a530e257
https://hg.mozilla.org/mozilla-central/rev/e6d8db11ed44
https://hg.mozilla.org/mozilla-central/rev/f46c2220f91b

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla37

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Misc codegen optimizations

Categories

(Core :: JavaScript Engine: JIT, enhancement)

Tracking

()

People

(Reporter: sunfish, Assigned: sunfish)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type