Closed Bug 1065339 Opened 5 years ago Closed 5 years ago

IonMonkey: support the x86/x64 VEX instruction encoding.

Categories

(Core :: JavaScript Engine: JIT, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla37

People

(Reporter: dougc, Assigned: dougc)

References

Details

Attachments

(7 files, 24 obsolete files)

94.96 KB, patch
dougc
: review+
dougc
: checkin+
Details | Diff | Splinter Review
191.41 KB, patch
Details | Diff | Splinter Review
50.06 KB, patch
Details | Diff | Splinter Review
91.58 KB, patch
jandem
: review+
Details | Diff | Splinter Review
10.45 KB, patch
jandem
: review+
Details | Diff | Splinter Review
8.60 KB, patch
jandem
: review+
Details | Diff | Splinter Review
6.42 KB, patch
jandem
: review+
Details | Diff | Splinter Review
More recent x86/x64 processors, with the AVX feature, support a new encoding for many SSE instructions that is often more compact and also can encode more operands. Being able to encode more operands should help avoid unnecessary copying and improve register usage. These instructions can also help avoid penalties that occur when high bits of an SSE register need to be preserved when executing instructions - the VEX encoded instructions can often zero the high bits or copy them from an additional argument.
Assignee: nobody → dtc-moz
The patch makes some use of the VEX encoding, mainly for relatively trivial code generation paths, such as simple maths operations. There are likely many more uses for the VEX encoding in more complex code generation paths, but this patch should cover a lot.
Worked on this to try and solve some performance issues with the SIMD code. Unfortunately it did not resolve the issue. The generated code does look a lot better, but the out-of-order processors appear to be able to hide a lot of the instructions that this helps eliminate. Is this something that you might wanted to land any time soon?
Flags: needinfo?(jdemooij)
(In reply to Douglas Crosher [:dougc] from comment #3)
> Worked on this to try and solve some performance issues with the SIMD code.
> Unfortunately it did not resolve the issue. The generated code does look a
> lot better, but the out-of-order processors appear to be able to hide a lot
> of the instructions that this helps eliminate. Is this something that you
> might wanted to land any time soon?

Part 1 is a big patch but most of it seems to be just adding more instructions to the assembler. This looks really interesting, although it'd help if we had a testcase where it showed wins...

I don't know a lot about SIMD though. sunfish/bbouvier, do you guys think this will help at some point? Do compilers like LLVM/GCC emit these instructions as well, with the right compiler flags? Does it help them?
Flags: needinfo?(sunfish)
Flags: needinfo?(jdemooij)
Flags: needinfo?(benj)
Comment on attachment 8487130 [details] [diff] [review]
1. Assembler support for the x86/x64 VEX instruction encoding.

Review of attachment 8487130 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/Assembler-x86-shared.cpp
@@ +196,5 @@
>          maxSSEVersion = Min(maxSSEVersion, maxEnabledSSEVersion);
> +
> +    static const int AVXBit = 1 << 28;
> +    if (getenv("DISABLE_AVX"))
> +        avxEnabled = false;

We should add a proper flag to do this, like the --no-sse4 flag, instead of using an environment variable.

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +496,5 @@
>      }
>  
>      void push_i32(int imm)
>      {
> +        spew("push       %s$0x%x", PRETTY_PRINT_OFFSET(imm));

In general, please split out general formatting fixes into a separate patch. The era of semantic diffing it not yet upon us ;-).

@@ +682,5 @@
> +    void vpaddd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> +    {
> +        spew("vpaddd     %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));
> +        m_formatter.twoByteOpVex(PRE_SSE_66, OP2_PADDD_VdqWdq, (RegisterID)dst, src1, (RegisterID)src2);
> +    }

Random idea: Having separate functions for the non-VEX and VEX versions of every instruction at this level means that we'll have a lot of duplication in our wrapper layers, and in the Codegen code itself. What if this BaseAssembler API just used a single function for both versions that would take the extra operand, and then do MOZ_ASSERT_IF(hasVEX, src2 == dst) or so? I haven't thought about this deeply yet though. What do you think?

@@ +4196,5 @@
>          m_formatter.twoByteOp(OP2_SQRTSD_VsdWsd, (RegisterID)dst, (RegisterID)src);
>      }
> +    void vsqrtsd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> +    {
> +        spew("vsqrtsd    %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));

I believe vsqrtsd only has two register operands, as it only has one input.

@@ +4237,5 @@
>      }
> +    void vroundss_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst, RoundingMode mode)
> +    {
> +        spew("vroundss   %s, %s, %s, %d",
> +             nameFPReg(src1), nameFPReg(src2), nameFPReg(dst), (int)mode);

GCC and LLVM both print vroundss and friends with the immediate listed first (in the AT&T-style syntax at least, which is what the rest of the assembly spew is using). I'd prefer to match their syntax, when possible.

@@ +5351,5 @@
>              putModRm(mode, reg, hasSib);
>              m_buffer.putByteUnchecked((scale << 6) | ((index & 7) << 3) | (base & 7));
>          }
>  
> +    public:

This looks a little suspicious at first glance. Is vblendvps the only thing that needs this? If so, what makes it special? If not, what's the right API here?
Comment on attachment 8487131 [details] [diff] [review]
2. Make use of the x86/x64 VEX instruction encoding.

Review of attachment 8487131 [details] [diff] [review]:
-----------------------------------------------------------------

Cool, this is good stuff. Here are some comments based on a quick readthrough.

::: js/src/jit/LIR-Common.h
@@ +308,5 @@
>  };
>  
>  // SIMD selection of lanes from two int32x4 or float32x4 arguments based on a
>  // int32x4 argument.
> +class LSimdSelect : public LInstructionHelper<1, 3, 1>

Is the temp here for the and+andn+or sequence? On AVX, if we use vblendvps, we can avoid this.

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +101,5 @@
>  
>  bool
>  LIRGeneratorX86Shared::lowerForFPU(LInstructionHelper<1, 2, 0> *ins, MDefinition *mir, MDefinition *lhs, MDefinition *rhs)
>  {
> +    if (Assembler::HasAVX()) {

Testing for "AVX" is a little confusing here, because we're still using "SSE" instructions here, but with the VEX encoding. Would it make sense to say "HasVEX" for this kind of use?

@@ +327,5 @@
> +            lins->setOperand(1, use(ins->getOperand(1)));
> +            lins->setOperand(2, useRegister(ins->getOperand(2)));
> +            lins->setTemp(0, tempFloat32());
> +            return define(lins, ins);
> +        }

When we have AVX, can we go straight to vblendvps (since all chips with AVX have SSE4.1)?

@@ +357,5 @@
>          return define(lir, ins);
> +      }
> +      case MIRType_Float32x4: {
> +        if (Assembler::HasAVX()) {
> +            LAllocation x = useRegister(ins->getOperand(0));

Elsewhere in this patch LSimdSplatX4 is Codegen'd with vshufps(0, r, r, output). That's safe for useRegisterAtStart because it reads both of the inputs before writing the output.

@@ +378,5 @@
> +
> +    if (ins->compareType() == MSimdBinaryComp::CompareInt32x4) {
> +        LSimdBinaryCompIx4 *add = new(alloc()) LSimdBinaryCompIx4();
> +        if (ins->operation() == MSimdBinaryComp::lessThan) {
> +            // Only have a greaterThen instruction so swap the arguments.

I think this comment is backward; we have cmpltps but not cmpgtps.
(In reply to Jan de Mooij [:jandem] from comment #4)
> Part 1 is a big patch but most of it seems to be just adding more
> instructions to the assembler. This looks really interesting, although it'd
> help if we had a testcase where it showed wins...
> 
> I don't know a lot about SIMD though. sunfish/bbouvier, do you guys think
> this will help at some point? Do compilers like LLVM/GCC emit these
> instructions as well, with the right compiler flags? Does it help them?

Yes, LLVM and GCC both emit VEX-encoded SSE instructions when they believe the target machine supports them. I don't know the performance numbers offhand.
Flags: needinfo?(sunfish)
(In reply to Dan Gohman [:sunfish] from comment #5)
> Comment on attachment 8487130 [details] [diff] [review]
> 1. Assembler support for the x86/x64 VEX instruction encoding.
> 
> Review of attachment 8487130 [details] [diff] [review]:
> -----------------------------------------------------------------

Thank you for the feedback.

> ::: js/src/jit/shared/Assembler-x86-shared.cpp
> @@ +196,5 @@
> >          maxSSEVersion = Min(maxSSEVersion, maxEnabledSSEVersion);
> > +
> > +    static const int AVXBit = 1 << 28;
> > +    if (getenv("DISABLE_AVX"))
> > +        avxEnabled = false;
> 
> We should add a proper flag to do this, like the --no-sse4 flag, instead of
> using an environment variable.

There is a proper flags too, '--no-avx', but I like to test browsers too. We have such environment variable controls on the ARM backend. Should I add a pref?

> ::: js/src/jit/shared/BaseAssembler-x86-shared.h
> @@ +496,5 @@
> >      }
> >  
> >      void push_i32(int imm)
> >      {
> > +        spew("push       %s$0x%x", PRETTY_PRINT_OFFSET(imm));
> 
> In general, please split out general formatting fixes into a separate patch.
> The era of semantic diffing it not yet upon us ;-).

Ok, shall do.

> @@ +682,5 @@
> > +    void vpaddd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> > +    {
> > +        spew("vpaddd     %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));
> > +        m_formatter.twoByteOpVex(PRE_SSE_66, OP2_PADDD_VdqWdq, (RegisterID)dst, src1, (RegisterID)src2);
> > +    }
> 
> Random idea: Having separate functions for the non-VEX and VEX versions of
> every instruction at this level means that we'll have a lot of duplication
> in our wrapper layers, and in the Codegen code itself. What if this
> BaseAssembler API just used a single function for both versions that would
> take the extra operand, and then do MOZ_ASSERT_IF(hasVEX, src2 == dst) or
> so? I haven't thought about this deeply yet though. What do you think?

My first attempt was to have the code in BaseAssembler dynamically change the emitted instructions based on the AVX feature. Optional arguments were added to the instructions and if not supplied then they defaulted to values that were close to the no-VEX semantics. It really did not save much code. It made the VEX instructions harder to read with the extra operands at the end - typically the extra operand is a src1 and is much easier to read being placed at the start of the arguments. The semantics were not always the same and it did not seem appropriate for the BaseAssembler layer to be changing semantics based on features. Developers might want to depend on the semantics and it seems cleaner to do so by using differently named methods. The BaseAssembler does not make any other feature based decisions - they are all done at a higher layer. It also changed all the exiting methods which might increase the risks of taking the patch.

Could you accept having separate new methods for now?

> @@ +4196,5 @@
> >          m_formatter.twoByteOp(OP2_SQRTSD_VsdWsd, (RegisterID)dst, (RegisterID)src);
> >      }
> > +    void vsqrtsd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> > +    {
> > +        spew("vsqrtsd    %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));
> 
> I believe vsqrtsd only has two register operands, as it only has one input.

This one has three operands. It takes the sqrt of src2 and copies the high bits from src1.

> @@ +4237,5 @@
> >      }
> > +    void vroundss_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst, RoundingMode mode)
> > +    {
> > +        spew("vroundss   %s, %s, %s, %d",
> > +             nameFPReg(src1), nameFPReg(src2), nameFPReg(dst), (int)mode);
> 
> GCC and LLVM both print vroundss and friends with the immediate listed first
> (in the AT&T-style syntax at least, which is what the rest of the assembly
> spew is using). I'd prefer to match their syntax, when possible.

Ok, shall do. I was just following the formatting for roundss which has the immediate at the end - should this be updated too?

I see we do not follow the Intel conventions, and are closer to the gdb disassembly conventions, but is there a reference you want us to follow?
 
> @@ +5351,5 @@
> >              putModRm(mode, reg, hasSib);
> >              m_buffer.putByteUnchecked((scale << 6) | ((index & 7) << 3) | (base & 7));
> >          }
> >  
> > +    public:
> 
> This looks a little suspicious at first glance. Is vblendvps the only thing
> that needs this? If so, what makes it special? If not, what's the right API
> here?

Yes, just for vblendvps. It went from two operands with an implicit use of xmm0 to four operands. Was loath to add a formatter just for this but shall do if that's the right thing to do?
Flags: needinfo?(sunfish)
(In reply to Dan Gohman [:sunfish] from comment #6)
> Comment on attachment 8487131 [details] [diff] [review]
> 2. Make use of the x86/x64 VEX instruction encoding.
> 
> Review of attachment 8487131 [details] [diff] [review]:
> -----------------------------------------------------------------

Thank you again.

> ::: js/src/jit/LIR-Common.h
> @@ +308,5 @@
> >  };
> >  
> >  // SIMD selection of lanes from two int32x4 or float32x4 arguments based on a
> >  // int32x4 argument.
> > +class LSimdSelect : public LInstructionHelper<1, 3, 1>
> 
> Is the temp here for the and+andn+or sequence? On AVX, if we use vblendvps,
> we can avoid this.

Yes, for and+andn+or. Could have avoided the temp by ending one of the arguments at the start but the trade off would have been that an unnecessary copy might be needed if the argument is used again. With VEX there is an opportunity to improve register allocation.

The semantics of vblendvps are different - it only checks the high bits. It has been tested and did not resolve performance issues. We might need to limit the use of vblendvps to cases in which we can prove that the semantic difference does not change the semantics of the code, for example when following a comparison that sets all bits. So we might still need and+andn+or as a fallback. I'll have to leave this for others to work out the spec details, but there is a patch in bug 1061637 that can at least be used to explore the performance, and the assembler support is being added here.

> ::: js/src/jit/shared/Lowering-x86-shared.cpp
> @@ +101,5 @@
> >  
> >  bool
> >  LIRGeneratorX86Shared::lowerForFPU(LInstructionHelper<1, 2, 0> *ins, MDefinition *mir, MDefinition *lhs, MDefinition *rhs)
> >  {
> > +    if (Assembler::HasAVX()) {
> 
> Testing for "AVX" is a little confusing here, because we're still using
> "SSE" instructions here, but with the VEX encoding. Would it make sense to
> say "HasVEX" for this kind of use?

Intel use the AVX feature to denote the availability of VEX encoded instructions. There are more available with the AVX2 feature, but it's not called VEX2. HasAVX() looks ok to me, but just let me know if you want it renamed?
 
> @@ +327,5 @@
> > +            lins->setOperand(1, use(ins->getOperand(1)));
> > +            lins->setOperand(2, useRegister(ins->getOperand(2)));
> > +            lins->setTemp(0, tempFloat32());
> > +            return define(lins, ins);
> > +        }
> 
> When we have AVX, can we go straight to vblendvps (since all chips with AVX
> have SSE4.1)?

See above about resolving the semantic difference of blendvps.
 
> @@ +357,5 @@
> >          return define(lir, ins);
> > +      }
> > +      case MIRType_Float32x4: {
> > +        if (Assembler::HasAVX()) {
> > +            LAllocation x = useRegister(ins->getOperand(0));
> 
> Elsewhere in this patch LSimdSplatX4 is Codegen'd with vshufps(0, r, r,
> output). That's safe for useRegisterAtStart because it reads both of the
> inputs before writing the output.

It not a safety issue, rather an optimization. There is no need to end the life of the input at the start of the operation, it can still be live at the end when using the VEX encoding. This could save an unnecessary copy if the argument is used again.
 
> @@ +378,5 @@
> > +
> > +    if (ins->compareType() == MSimdBinaryComp::CompareInt32x4) {
> > +        LSimdBinaryCompIx4 *add = new(alloc()) LSimdBinaryCompIx4();
> > +        if (ins->operation() == MSimdBinaryComp::lessThan) {
> > +            // Only have a greaterThen instruction so swap the arguments.
> 
> I think this comment is backward; we have cmpltps but not cmpgtps.

Looks ok. This is for packed signed integers. There's cmpgtpd but not cmpltpd.
Flags: needinfo?(benj)
(In reply to Douglas Crosher [:dougc] from comment #9)
> (In reply to Dan Gohman [:sunfish] from comment #6)
> > Comment on attachment 8487131 [details] [diff] [review]
> > 2. Make use of the x86/x64 VEX instruction encoding.
> > 
> > Review of attachment 8487131 [details] [diff] [review]:
> > -----------------------------------------------------------------
> 
> Thank you again.
> 
> > ::: js/src/jit/LIR-Common.h
> > @@ +308,5 @@
> > >  };
> > >  
> > >  // SIMD selection of lanes from two int32x4 or float32x4 arguments based on a
> > >  // int32x4 argument.
> > > +class LSimdSelect : public LInstructionHelper<1, 3, 1>
> > 
> > Is the temp here for the and+andn+or sequence? On AVX, if we use vblendvps,
> > we can avoid this.
> 
> Yes, for and+andn+or. Could have avoided the temp by ending one of the
> arguments at the start but the trade off would have been that an unnecessary
> copy might be needed if the argument is used again. With VEX there is an
> opportunity to improve register allocation.
> 
> The semantics of vblendvps are different - it only checks the high bits. It
> has been tested and did not resolve performance issues. We might need to
> limit the use of vblendvps to cases in which we can prove that the semantic
> difference does not change the semantics of the code, for example when
> following a comparison that sets all bits. So we might still need
> and+andn+or as a fallback. I'll have to leave this for others to work out
> the spec details, but there is a patch in bug 1061637 that can at least be
> used to explore the performance, and the assembler support is being added
> here.

Ah, I'm getting a little ahead of myself here. This is actually an issue in our current spec for the select operation; it doesn't document what happens if a lane of the input mask is not all-ones or all-zeros.

I'm considering proposing that we fix the spec by saying that select only looks at the sign bits. Platforms which need to do something like and+andn+or will then need to do an extra comparison, or a vector signed-right-shift, to create all-zeros and all-ones values, but they can avoid it in the case where they can prove that the input is a comparison or similar. What do you think?

> > ::: js/src/jit/shared/Lowering-x86-shared.cpp
> > @@ +101,5 @@
> > >  
> > >  bool
> > >  LIRGeneratorX86Shared::lowerForFPU(LInstructionHelper<1, 2, 0> *ins, MDefinition *mir, MDefinition *lhs, MDefinition *rhs)
> > >  {
> > > +    if (Assembler::HasAVX()) {
> > 
> > Testing for "AVX" is a little confusing here, because we're still using
> > "SSE" instructions here, but with the VEX encoding. Would it make sense to
> > say "HasVEX" for this kind of use?
> 
> Intel use the AVX feature to denote the availability of VEX encoded
> instructions. There are more available with the AVX2 feature, but it's not
> called VEX2. HasAVX() looks ok to me, but just let me know if you want it
> renamed?

Ok, HasAVX is fine. I tend to associate "AVX" with 256-bit vectors and so on, and "VEX" with 3-operand encodings, which I think of as orthogonal, but you're right that it's all considered one ISA feature, so I'll just have to get used to it :-).

> > @@ +327,5 @@
> > > +            lins->setOperand(1, use(ins->getOperand(1)));
> > > +            lins->setOperand(2, useRegister(ins->getOperand(2)));
> > > +            lins->setTemp(0, tempFloat32());
> > > +            return define(lins, ins);
> > > +        }
> > 
> > When we have AVX, can we go straight to vblendvps (since all chips with AVX
> > have SSE4.1)?
> 
> See above about resolving the semantic difference of blendvps.

Ok.

> > @@ +357,5 @@
> > >          return define(lir, ins);
> > > +      }
> > > +      case MIRType_Float32x4: {
> > > +        if (Assembler::HasAVX()) {
> > > +            LAllocation x = useRegister(ins->getOperand(0));
> > 
> > Elsewhere in this patch LSimdSplatX4 is Codegen'd with vshufps(0, r, r,
> > output). That's safe for useRegisterAtStart because it reads both of the
> > inputs before writing the output.
> 
> It not a safety issue, rather an optimization. There is no need to end the
> life of the input at the start of the operation, it can still be live at the
> end when using the VEX encoding. This could save an unnecessary copy if the
> argument is used again.

The AtStart bit just specifies where this one instruction needs the value to be live. If other instructions use the same value, the lifetime is automatically extended regardless. Using AtStart is actually the more optimizable form, since it says that the input doesn't overlap with the output, making it possible for the input to use the same register as one of the outputs.

> > @@ +378,5 @@
> > > +
> > > +    if (ins->compareType() == MSimdBinaryComp::CompareInt32x4) {
> > > +        LSimdBinaryCompIx4 *add = new(alloc()) LSimdBinaryCompIx4();
> > > +        if (ins->operation() == MSimdBinaryComp::lessThan) {
> > > +            // Only have a greaterThen instruction so swap the arguments.
> > 
> > I think this comment is backward; we have cmpltps but not cmpgtps.
> 
> Looks ok. This is for packed signed integers. There's cmpgtpd but not
> cmpltpd.

Oh, you mean there's pcmpgtd but not pcmpltd. Bizarre. The comment is fine then, though it might be nice to add a mention that integer comparisons are different than floating-point comparisons in this detail.
Flags: needinfo?(sunfish)
(In reply to Douglas Crosher [:dougc] from comment #8)
> (In reply to Dan Gohman [:sunfish] from comment #5)
> > Comment on attachment 8487130 [details] [diff] [review]
> > 1. Assembler support for the x86/x64 VEX instruction encoding.
> > 
> > Review of attachment 8487130 [details] [diff] [review]:
> > -----------------------------------------------------------------
> 
> Thank you for the feedback.
> 
> > ::: js/src/jit/shared/Assembler-x86-shared.cpp
> > @@ +196,5 @@
> > >          maxSSEVersion = Min(maxSSEVersion, maxEnabledSSEVersion);
> > > +
> > > +    static const int AVXBit = 1 << 28;
> > > +    if (getenv("DISABLE_AVX"))
> > > +        avxEnabled = false;
> > 
> > We should add a proper flag to do this, like the --no-sse4 flag, instead of
> > using an environment variable.
> 
> There is a proper flags too, '--no-avx', but I like to test browsers too. We
> have such environment variable controls on the ARM backend. Should I add a
> pref?

Yes. I think this feature is interesting enough to motivate adding a pref. Thanks!

> > @@ +682,5 @@
> > > +    void vpaddd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> > > +    {
> > > +        spew("vpaddd     %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));
> > > +        m_formatter.twoByteOpVex(PRE_SSE_66, OP2_PADDD_VdqWdq, (RegisterID)dst, src1, (RegisterID)src2);
> > > +    }
> > 
> > Random idea: Having separate functions for the non-VEX and VEX versions of
> > every instruction at this level means that we'll have a lot of duplication
> > in our wrapper layers, and in the Codegen code itself. What if this
> > BaseAssembler API just used a single function for both versions that would
> > take the extra operand, and then do MOZ_ASSERT_IF(hasVEX, src2 == dst) or
> > so? I haven't thought about this deeply yet though. What do you think?
> 
> My first attempt was to have the code in BaseAssembler dynamically change
> the emitted instructions based on the AVX feature. Optional arguments were
> added to the instructions and if not supplied then they defaulted to values
> that were close to the no-VEX semantics. It really did not save much code.
> It made the VEX instructions harder to read with the extra operands at the
> end - typically the extra operand is a src1 and is much easier to read being
> placed at the start of the arguments. The semantics were not always the same
> and it did not seem appropriate for the BaseAssembler layer to be changing
> semantics based on features. Developers might want to depend on the
> semantics and it seems cleaner to do so by using differently named methods.
> The BaseAssembler does not make any other feature based decisions - they are
> all done at a higher layer. It also changed all the exiting methods which
> might increase the risks of taking the patch.
> 
> Could you accept having separate new methods for now?

I could, but I'd like to explore this idea a little further first. Bubbling this concern all the way up to CodeGenerator-* files is unfortunate, as the CodeGenerator-* files have a greater need for clarity IMHO; they tend to get changed more often, and they are more often visited by people who otherwise have no need to think about instruction encoding issues.

I think that if this approach makes sense, the extra changes to existing code can be justified.

Using asserts to check that src0 == dst in non-AVX mode would avoid the argument ordering problem you mention with the default argument approach.

What are the situations where the semantics of non-AVX differing from AVX, aside from the unused high parts of the registers?

> > @@ +4196,5 @@
> > >          m_formatter.twoByteOp(OP2_SQRTSD_VsdWsd, (RegisterID)dst, (RegisterID)src);
> > >      }
> > > +    void vsqrtsd_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst)
> > > +    {
> > > +        spew("vsqrtsd    %s, %s, %s", nameFPReg(src1), nameFPReg(src2), nameFPReg(dst));
> > 
> > I believe vsqrtsd only has two register operands, as it only has one input.
> 
> This one has three operands. It takes the sqrt of src2 and copies the high
> bits from src1.

Ah, you're right. Bizarre again ;-). Please add a comment mentioning this.

> > @@ +4237,5 @@
> > >      }
> > > +    void vroundss_rr(XMMRegisterID src1, XMMRegisterID src2, XMMRegisterID dst, RoundingMode mode)
> > > +    {
> > > +        spew("vroundss   %s, %s, %s, %d",
> > > +             nameFPReg(src1), nameFPReg(src2), nameFPReg(dst), (int)mode);
> > 
> > GCC and LLVM both print vroundss and friends with the immediate listed first
> > (in the AT&T-style syntax at least, which is what the rest of the assembly
> > spew is using). I'd prefer to match their syntax, when possible.
> 
> Ok, shall do. I was just following the formatting for roundss which has the
> immediate at the end - should this be updated too?

Yes, the immediate for roundss should be first.

> I see we do not follow the Intel conventions, and are closer to the gdb
> disassembly conventions, but is there a reference you want us to follow?

There's no official reference for the format, but gdb, gas, gcc, as well as llvm etc. all use the same format by default on x86, so you can use whichever of those tools you prefer.

> > @@ +5351,5 @@
> > >              putModRm(mode, reg, hasSib);
> > >              m_buffer.putByteUnchecked((scale << 6) | ((index & 7) << 3) | (base & 7));
> > >          }
> > >  
> > > +    public:
> > 
> > This looks a little suspicious at first glance. Is vblendvps the only thing
> > that needs this? If so, what makes it special? If not, what's the right API
> > here?
> 
> Yes, just for vblendvps. It went from two operands with an implicit use of
> xmm0 to four operands. Was loath to add a formatter just for this but shall
> do if that's the right thing to do?

Yeah, I think with the way the code is currently organized, a wrapper for this makes sense. That said, I wouldn't be opposed if someone wanted to reorganize this code :-).
wip, rebased, and addresses some of the issues raised in feedback.
Attachment #8487130 - Attachment is obsolete: true
wip.
Attachment #8487131 - Attachment is obsolete: true
Rebase.
Attachment #8489375 - Attachment is obsolete: true
Rebased. Added support for newly added instructions: min, max, insert operations. Some fixes.
Attachment #8489383 - Attachment is obsolete: true
Rebase. Use VEX instructions for min, max, and insert operations, plus a few others.
Attachment #8489384 - Attachment is obsolete: true
Rebase.
Attachment #8493074 - Attachment is obsolete: true
Rebase, adding VEX versions of cvttps2dq and cvtdq2ps.
Attachment #8493076 - Attachment is obsolete: true
Rebase to use VEX instructions for SimdConvertFrom.
Attachment #8493077 - Attachment is obsolete: true
Rebase. Some spew formatting fixes.
Attachment #8495876 - Attachment is obsolete: true
Rebase. Add VEX support for some SIMD shift instructions. Some fixes.
Attachment #8495877 - Attachment is obsolete: true
Rebase. Make use of VEX for the SIMD shift operations.
Attachment #8495878 - Attachment is obsolete: true
Rebase.
Attachment #8498844 - Attachment is obsolete: true
Rebase. Add VEX instructions for rcpps, sqrtp, and sqrtps, and the assembler support and use these when VEX is available. Optimize the SIMD unary operations when VEX is available to use-at-start the input.
Attachment #8498845 - Attachment is obsolete: true
Rebase. Optimize the SIMD unary operations when VEX is available to use-at-start the input.
Attachment #8498846 - Attachment is obsolete: true
Comment on attachment 8503587 [details] [diff] [review]
1. Remove unnecessary newlines in argument lists.

Review of attachment 8503587 [details] [diff] [review]:
-----------------------------------------------------------------

I'm cool with that.
Attachment #8503587 - Flags: feedback+
Rebase. If you're happy with these changes then it would help to landing this one.
Attachment #8503587 - Attachment is obsolete: true
Attachment #8509399 - Flags: review?(sunfish)
Rebase, mainly adding the assembler support for bug 1021716 and adding VEX encoding for the new instructions.
Attachment #8503588 - Attachment is obsolete: true
Rebase, mainly making use of VEX instructions for optimize the SIMD shuffle changes from bug 1021716. There appears to be more room for improvement here.
Attachment #8503589 - Attachment is obsolete: true
Comment on attachment 8509399 [details] [diff] [review]
1. Remove unnecessary newlines in argument lists.

Review of attachment 8509399 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8509399 - Flags: review?(sunfish) → review+
Rebase. Carry forward r+. Try run:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=a7f3eb96a200
Attachment #8509399 - Attachment is obsolete: true
Attachment #8510789 - Flags: review+
Rebase. Carry forward r+.
Attachment #8510789 - Attachment is obsolete: true
Attachment #8510812 - Flags: review+
Rebase, mainly around the atomic ops.
Attachment #8509400 - Attachment is obsolete: true
Rebase.
Attachment #8509402 - Attachment is obsolete: true
Requesting checkin of patch 1. Try run in comment 32. This is just some source code formatting changes.
Attachment #8510812 - Flags: checkin+
Comment on attachment 8510813 [details] [diff] [review]
2. Support the x86/x64 VEX instruction encoding.

Review of attachment 8510813 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +847,5 @@
>          m_formatter.twoByteOp(OP2_SUBPS_VpsWps, (RegisterID)dst, base, offset);
>      }
> +    void vsubps_mr(XMMRegisterID src1, int offset, RegisterID base, XMMRegisterID dst)
> +    {
> +        spew("vsubps     %s, %s0x%x(%s), %s",

Putting the memory operand in the middle deviates from both AT&T style (what GCC and LLVM do by default) and Intel style. This affects the spew string, but it is also reflected in the order of the operands to functions like vsubps_mr. Given that the rest of the file is in AT&T style, we need to use that style consistently, even if it is a little counter-intuitive for things like vsubps.
Attached patch vex.patch (obsolete) — Splinter Review
I was really curious what might be done under the approach of having the assembler expose one interface which can do both VEX and "legacy" SSE encoding. The attached patch is an initial attempt to see what this could look like. The actual VEX encoding code is derived from patch 2 here. It's also based on top of the patches in bug 1096684 and bug 1096707.

This patch keeps code duplication fairly low, and it also permitted a cute code-size optimization of using SSE encodings when three-input encoding isn't actually needed. I didn't use any default arguments, and possibly there are other opportunities for refactoring. What do you think?
Attachment #8520328 - Flags: feedback?(dtc-moz)
Comment on attachment 8520328 [details] [diff] [review]
vex.patch

Review of attachment 8520328 [details] [diff] [review]:
-----------------------------------------------------------------

If it works for the code generation we need then this seems fine.

fwiw it seemed more appropriate to me to just enumerate all the different instructions that Intel defines at the Assembler layer as there are semantic differences, and the macro-assembler layer seemed like the more appropriate layer to refactor common usage. How can we select one encoding or the other, for the semantic difference, if the only interface makes the decision for us?

I'd like to see this more forward and if you have a plan then you are welcome to this bug and perhaps I can help review and scrutinies patches.

::: js/src/jit/shared/Assembler-x86-shared.cpp
@@ +195,5 @@
>      if (maxEnabledSSEVersion != UnknownSSE)
>          maxSSEVersion = Min(maxSSEVersion, maxEnabledSSEVersion);
> +
> +    static const int AVXBit = 1 << 28;
> +    avxPresent = (flagsECX & AVXBit) && avxEnabled;

fwiw I did look into having a preference that could control the use of AVX encodings. The preference would need to be captured in an Ion compilation context to be consistent and there is already some support in Ion for this. However it appeared that the assembler was being used in other contexts too so it was looking more suited to a separate bug that might also allow control of the use of the other features. Using an environment variable seems ok for a start.

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +4155,5 @@
> +        }
> +
> +        // If src0 is the same as the output register, we might as well use
> +        // the SSE encoding, since it is smaller. However, this is only
> +        // beneficial as long as we're not using ymm registers anywhere.

It was not clear that this is the case. I got the impression that the VEX encoding was an attempt to make instruction encodings more compact. Note the patches were not using optimal VEX encodings and could have been improved in some cases. Also the semantics differ and it did appears these might even affect performance.
Attachment #8520328 - Flags: feedback?(dtc-moz) → feedback+
Attached patch vex.patch (obsolete) — Splinter Review
Here's an updated patch implement VEX encodings.

> It was not clear that this is the case. I got the impression that the VEX
> encoding was an attempt to make instruction encodings more compact. Note the
> patches were not using optimal VEX encodings and could have been improved in
> some cases. Also the semantics differ and it did appears these might even
> affect performance.

I also looked at encodings from GAS and llvm-mc, and AVX is consistently 1 byte longer per instruction there too. The reduction in code size comes from needing fewer copy instructions.

According to Intel's optimization guide, and Agner, there's no performance difference between the encodings. However, if we do ever discover otherwise, it's pretty easy to change the code in the useLegacySSEEncoding function.
Attachment #8520328 - Attachment is obsolete: true
Attached patch vex-vblendps-vblendvps.patch (obsolete) — Splinter Review
This patch implements vblendps and vblendvps on top of the vex.patch patch, as an example of how we can handle special cases were the AVX and SSE versions are different.
Depends on: 1096707
(In reply to Douglas Crosher [:dougc] from comment #41)
> If it works for the code generation we need then this seems fine.
> 
> fwiw it seemed more appropriate to me to just enumerate all the different
> instructions that Intel defines at the Assembler layer as there are semantic
> differences, and the macro-assembler layer seemed like the more appropriate
> layer to refactor common usage. How can we select one encoding or the other,
> for the semantic difference, if the only interface makes the decision for us?
> 
> I'd like to see this more forward and if you have a plan then you are
> welcome to this bug and perhaps I can help review and scrutinies patches.

Ok, I updated the patch and added another patch which implements vblendps and vblendvps, as an example of how we can handle this. I think this approach is working, and I still like the way it reduces redundancy in the Assembler layer. What do you think?

> ::: js/src/jit/shared/Assembler-x86-shared.cpp
> @@ +195,5 @@
> >      if (maxEnabledSSEVersion != UnknownSSE)
> >          maxSSEVersion = Min(maxSSEVersion, maxEnabledSSEVersion);
> > +
> > +    static const int AVXBit = 1 << 28;
> > +    avxPresent = (flagsECX & AVXBit) && avxEnabled;
> 
> fwiw I did look into having a preference that could control the use of AVX
> encodings. The preference would need to be captured in an Ion compilation
> context to be consistent and there is already some support in Ion for this.
> However it appeared that the assembler was being used in other contexts too
> so it was looking more suited to a separate bug that might also allow
> control of the use of the other features. Using an environment variable
> seems ok for a start.

Ok, I filed bug 1097265 to discuss this.
(In reply to Dan Gohman [:sunfish] from comment #44)
> (In reply to Douglas Crosher [:dougc] from comment #41)
> > If it works for the code generation we need then this seems fine.
> > 
> > fwiw it seemed more appropriate to me to just enumerate all the different
> > instructions that Intel defines at the Assembler layer as there are semantic
> > differences, and the macro-assembler layer seemed like the more appropriate
> > layer to refactor common usage. How can we select one encoding or the other,
> > for the semantic difference, if the only interface makes the decision for us?
> > 
> > I'd like to see this more forward and if you have a plan then you are
> > welcome to this bug and perhaps I can help review and scrutinies patches.
> 
> Ok, I updated the patch and added another patch which implements vblendps
> and vblendvps, as an example of how we can handle this. I think this
> approach is working, and I still like the way it reduces redundancy in the
> Assembler layer. What do you think?

Same comments as before. The semantics are changing depending on the arguments in a way that does not match the instruction manual, and how can we choose one encoding over the other for the semantic difference. It's not a bit issue so long at it supports all the needed code generation, just bike shedding, so you decide. Only way to know is to finish it and get some experience using it.

Looks like v8 is adding VEX support https://codereview.chromium.org/764863002/ and I saw another exploiting FMA.
> Same comments as before. The semantics are changing depending on the
> arguments in a way that does not match the instruction manual, and how can
> we choose one encoding over the other for the semantic difference. It's not
> a bit issue so long at it supports all the needed code generation, just bike
> shedding, so you decide. Only way to know is to finish it and get some
> experience using it.

The semantic differences between blendps and blendvps, and vblendps and vblendvps, seem to be (a) blendps and blendvps impose stricter constraints on their inputs, and (b) blendps and blendvps don't zero the high half of the output ymm register. Is there something further? If not, I'm proposing that we can handle the former with assertions and the latter by ignoring the high halves of ymm registers (for now).

I'm also assuming that when we start caring about full ymm registers, we'll need to avoid all non-AVX encodings anyway, because mixing non-AVX encodings and uses of full ymm registers are a performance hazard. Based on these assumptions, the CodeGenerator won't ever find itself wanting to choose between VEX and non-VEX encodings for semantic reasons.

An explicitly non-AVX API would be more convenient for the CodeGenerator in cases where it has to emit extra code to conform to the extra constraints, but less convenient when the constraints are handled by Lowering, and looking at both patch sets here, I don't think the benefits are worth the extra redundancy at the Assembler layer, which already has a lot of redundancy for dispatching on Operand kinds.
Attached patch vex.patchSplinter Review
Rebased and updated patch.

I also added some convenience methods, reusedInputFloat32x4 and similar, to make it easy for the CodeGenerator to avoid some copies when AVX is available.

I believe I'm ready to submit this for review; I haven't found any insurmountable problems with the approach of having the assembler automatically determine which encoding to use, and this approach avoids significant code duplication.
Attachment #8523163 - Attachment is obsolete: true
Attachment #8532154 - Flags: review?(jdemooij)
Rebased and updated.
Attachment #8523165 - Attachment is obsolete: true
Attachment #8532155 - Flags: review?(jdemooij)
One extra followup patch for using vblendvps for minNum and maxNum.
Attachment #8532156 - Flags: review?(jdemooij)
One more little patch to rename moveAlignedInt32x4 to moveInt32x4 since it's a register-register move and alignment isn't relevant. The other patches depend on this.
Attachment #8532157 - Flags: review?(jdemooij)
Comment on attachment 8532154 [details] [diff] [review]
vex.patch

Review of attachment 8532154 [details] [diff] [review]:
-----------------------------------------------------------------

It's great the assembler can figure this out and the rest of the code doesn't have to worry about it.

Pretty exciting that we'll use a new encoding now.

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +111,5 @@
> +    if (!Assembler::HasAVX() ||
> +        !(IsFloatingPointType(mir->type()) ||
> +          (IsSimdType(mir->type()) &&
> +           IsFloatingPointType(SimdTypeToScalarType(mir->type())))))
> +    { 

Nit: some trailing whitespace here and we can remove the empty line below. Also the condition is pretty hard to read, it might be nice to add a static helper function like this:

if (!Assembler::HasAVX())
    return ...;
if (...)
    return ...;
if (...)
    return ...;
return ...;
Attachment #8532154 - Flags: review?(jdemooij) → review+
Attachment #8532155 - Flags: review?(jdemooij) → review+
Comment on attachment 8532156 [details] [diff] [review]
vex-vblendvps-minnum.patch

Review of attachment 8532156 [details] [diff] [review]:
-----------------------------------------------------------------

One instruction instead of 3 or 4? Nice.
Attachment #8532156 - Flags: review?(jdemooij) → review+
Attachment #8532157 - Flags: review?(jdemooij) → review+
Keywords: leave-open
For posterity, on AWFY x64 this was a 12.7% win on Kraken gaussian blur with the backtracking allocator and 5.7% with LSRA. Also a few % on darkroom. On x86 gaussian-blur is 6.5% faster with BT.

BT seems to benefit more than LSRA so the gap on Kraken is a bit smaller now. Pretty nice.
Depends on: 1110570
Depends on: 1118235
You need to log in before you can comment on or make changes to this bug.