Add experimental support for SIMD codegen

RESOLVED WORKSFORME

Status

()

Core
JavaScript Engine: JIT
--
enhancement
RESOLVED WORKSFORME
5 years ago
3 months ago

People

(Reporter: sunfish, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [js:t])

Attachments

(6 attachments, 12 obsolete attachments)

11.97 KB, patch
Details | Diff | Splinter Review
4.35 KB, patch
nbp
: review+
nbp
: checkin+
Details | Diff | Splinter Review
12.90 KB, patch
Details | Diff | Splinter Review
2.41 KB, patch
nbp
: review+
RyanVM
: checkin+
Details | Diff | Splinter Review
20.86 KB, patch
Details | Diff | Splinter Review
77.83 KB, patch
Details | Diff | Splinter Review
(Reporter)

Description

5 years ago
Created attachment 8344342 [details] [diff] [review]
simd-regs.patch

The attached patch adds experimental support for register allocation and spilling of 128-bit SIMD registers. Some remaining tasks include:
 - change PushRegsInMask and PopRegsInMask to save and restore 128-bit registers
 - several FIXMEs
 - decide if/when/how to pick between movapd/movaps/movdqa on x86/x64
 - probably other stuff

It would also be nice to refactor the enum situation a bit. Between LAllocation::Kind, LDefinition::Type, MoveOperand::Kind, MoveOperand::AddressKind, Move::Kind, and MIRType, there's a fair amount of redundancy/conversion/confusion.

Comment 1

5 years ago
We need to distinguish float32x4 and int32x4 in the codegen as when bailout happens, we need to box the SIMD128 register or SIMD128Slot into a typed object and float32x4 and init32x4 are different typed objects.

Comment 2

5 years ago
I met assert after picking up this code with the other SIMD patches at the https://github.com/nikomatsakis/iontrail/tree/hfeng1-bug943766-unboxed-float32x4. The test case is from test/emca_6/TypedObject/simd/sum.js in the above branch.

The LIR is: [MoveGroup] [=xmm0 -> stack:v4] [=rax -> stack:i6], and =xmm0 is considered as a double register, and we assert at JS_ASSERT(from->isDouble() == to->isDouble()); from jit/CodeGenerator.cpp:1184.

The reason is that the "assign" function gives a LAllocation(best) to a liveInterval, the LAllocation(best) could only distinguish double register and general registers in the current implementation, when SIMD128 registers comes in, we need to refactor this function a little bit. I suggest:

bool assign(AnyRegister best);

And in the assign, set the correct LAllocation according to the best and current->virtualRegister()->type() which was mentioned by Dan from the email discussion.

Comment 3

5 years ago
Reading more codes reveals that assign also handles stack slots. Maybe we only need to add a new function.

bool assign(AnyRegister best) {
  LAllocation allocation(best, current->virtualRegister()->type());
  assign(allocation);
}

I assume the spilling slots for SIMD128 is already handled well :)

Comment 4

5 years ago
Created attachment 8345129 [details]
0001-Introduce-registerKind-and-LSIMD128Reg-for-SIMD128.patch

Introduce registerKind and LSIMD128Reg for SIMD API. I have committed it at https://github.com/nikomatsakis/iontrail/commit/42b24a7b6d78c14e241ce046937a331ce2899e3c.

Comment 5

5 years ago
Created attachment 8345142 [details]
0002-Introduce-ToSIMD128Register.patch

Now the assertion is at jit/shared/MoveEmitter-x86-shared.cpp:139.
"
          case Move::SIMD128:                                                   
            MOZ_ASSUME_UNREACHABLE("FIXME");
"

I will implement this.
(Reporter)

Comment 6

5 years ago
Comment on attachment 8345129 [details]
0001-Introduce-registerKind-and-LSIMD128Reg-for-SIMD128.patch

Review of attachment 8345129 [details]:
-----------------------------------------------------------------

Overall this patch looks good! Thanks for working on this. I have a few comments below.

As you mentioned earlier, we'll need to split these SIMD128 things into float32x4 and int32x4, but we can easily do that later.

I wish there were some way we could avoid having to pass a kind to LAllocation's constructor in so many places, but for now I think it's fine.

::: js/src/jit/LIR.h
@@ +360,5 @@
> +
> +    FloatRegister reg() const {
> +        return FloatRegister::FromCode(data());
> +    }
> +};

Instead of creating an LSIMD128Reg class, which will need to be split for float32x4 and int32x4, would it make sense to just use LFloatReg for all allocations of FloatRegister, including SIMD128 allocations? You could replace LFloatReg's hard-coded FPU by adding a kind argument to its constructor and passing that to LAllocation's constructor.

::: js/src/jit/Lowering.cpp
@@ +1530,4 @@
>                                 tempFixed(CallTempReg2),
>                                 tempFixed(CallTempReg3),
>                                 tempFixed(CallTempReg4));
> +    if (!defineFixed(lir, ins, LAllocation(AnyRegister(CallTempReg5), LAllocation::GPR)))

Can this be LGeneralReg(CallTempReg5) instead of calling LAllocation's constructor directly?

@@ +1553,4 @@
>                                       tempFixed(CallTempReg1),
>                                       tempFixed(CallTempReg2),
>                                       tempFixed(CallTempReg3));
> +    if (!defineFixed(lir, ins, LAllocation(AnyRegister(CallTempReg5), LAllocation::GPR)))

Same here.

@@ +1609,4 @@
>  LIRGenerator::visitOsrEntry(MOsrEntry *entry)
>  {
>      LOsrEntry *lir = new LOsrEntry;
> +    return defineFixed(lir, entry, LAllocation(AnyRegister(OsrFrameReg), LAllocation::GPR));

And here.

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +171,4 @@
>      LDivI *lir = new LDivI(useFixed(div->lhs(), eax), useRegister(div->rhs()), tempFixed(edx));
>      if (div->fallible() && !assignSnapshot(lir, Bailout_BaselineInfo))
>          return false;
> +    return defineFixed(lir, div, LAllocation(AnyRegister(eax), LAllocation::GPR));

Can this be LGeneralReg(eax) instead of calling LAllocation's constructor directly?

@@ +209,4 @@
>                             tempFixed(eax));
>      if (mod->fallible() && !assignSnapshot(lir, Bailout_BaselineInfo))
>          return false;
> +    return defineFixed(lir, mod, LAllocation(AnyRegister(edx), LAllocation::GPR));

Same here.

@@ +241,4 @@
>                                       tempFixed(edx));
>      if (div->fallible() && !assignSnapshot(lir, Bailout_BaselineInfo))
>          return false;
> +    return defineFixed(lir, div, LAllocation(AnyRegister(eax), LAllocation::GPR));

And here.

@@ +263,4 @@
>                                       tempFixed(eax));
>      if (mod->fallible() && !assignSnapshot(lir, Bailout_BaselineInfo))
>          return false;
> +    return defineFixed(lir, mod, LAllocation(AnyRegister(edx), LAllocation::GPR));

And here.

Comment 7

5 years ago
Thanks for the comment, I will use LGeneralReg as you have suggested. I have thought unifying a LRegister with a kind so that we do not need to distinguish general, float and simd128 registers. This could be done later.

Now I pretend the SIMD128 is float32x4, I will try to compile an int32x4 example to see where we have to use the float32x4/int32x4 information. In V8, we only use those information for deoptimization table, i.e., recording the raw XMM register and stack slot with a kind information. For IonMonkey, it seems the corresponding concept is snapshot, if we could get the float32x4/int32x4 info when we need to construct a snapshot entry (from mir or others), we do not need to waste a kind info for them.
(Reporter)

Comment 8

5 years ago
I think having a separate kind for float32x4 and int32x4 would be ok, if you need it. One nice aspect of having separate types is that it would facilitate the use of movaps for float32x4 and movdqa for int32x4.

Updated

5 years ago
Depends on: 948853

Comment 9

5 years ago
All the comments are addressed and created an entry (https://bugzilla.mozilla.org/show_bug.cgi?id=948853) for the refactoring.

Comment 10

5 years ago
This patch does not compile for x86.
(Reporter)

Comment 11

5 years ago
Many of the changes here are now subsumed by the patches in bug 950703. As we discussed elsewhere, I now think it makes more sense to put type information in LDefinition::Type rather than LAllocation::Kind. I apologize for the churn and extra work here; I myself am learning my way around this code.

One those changes are in, the patches for this bug will be able to be substantially simplified.
Depends on: 950703

Comment 12

5 years ago
No problem, thanks for working on this. I will rebase our SIMD branch when all your patches are committed.
(Reporter)

Comment 13

5 years ago
Comment on attachment 8344342 [details] [diff] [review]
simd-regs.patch

The changes in this patch are all either incorporated in other patches, or no longer relevant.
Attachment #8344342 - Attachment is obsolete: true
(Reporter)

Comment 14

5 years ago
The main known tasks here are now:

 - add new SIMD types to LDefinition::Type, MoveOp::Type, and SnapshotReader::SlotMode, and code to handle them
 - change LinearScan's freeAllocation and allocateSlotFor to handle the new registers
 - change PushRegsInMask and PopRegsInMask to save and restore 128-bit registers
 - change MachineState to use 128-bit values for FloatRegisters

The list above included deciding if/when/how to pick between movaps/movaps/movdqa on x86/x64; the "how" is now clear. The MoveOp::Type will tell the move emitter which move to emit.
(In reply to Dan Gohman [:sunfish] from comment #14)
> The main known tasks here are now:
>
>  - add new SIMD types to LDefinition::Type, MoveOp::Type, and
> SnapshotReader::SlotMode, and code to handle them
>  - change PushRegsInMask and PopRegsInMask to save and restore 128-bit
> registers
>  - change MachineState to use 128-bit values for FloatRegisters

Ask me for review on these 3 topics.

Updated

5 years ago
Attachment #8345129 - Attachment is obsolete: true
Attachment #8345129 - Attachment is patch: false

Updated

5 years ago
Attachment #8345142 - Attachment is obsolete: true
Attachment #8345142 - Attachment is patch: false

Comment 16

5 years ago
Created attachment 8355444 [details] [diff] [review]
0001-Introduce-fpreg_value_t.patch

For X86 and X64 ports, ARM port is missing.
1) change MachineState to use 128-bit values for FloatRegisters
2) change PushRegsInMask and PopRegsInMask to save and restore 128-bit registers

Comment 17

5 years ago
Created attachment 8355445 [details] [diff] [review]
0002-Allocate-SIMD128-registers.patch

Comment 18

5 years ago
Created attachment 8355446 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

The Create function will be used by the snapshot to re-construct a Value from float32x4/int32x4 register/stack.

Comment 19

5 years ago
Created attachment 8355447 [details] [diff] [review]
0004-Support-SIMD128-in-Snapshot.patch

Comment 20

5 years ago
Dan and nbp, please have a look.

This series of patch are extracted from https://github.com/nikomatsakis/gecko-dev/tree/simd128, in that branch, I could run some SIMD benchmarks with 3~5 speedup.

The remaining issues:
  1. A couple of regressions (16 new fails) when run the ./test/jstest.py. It will be great if you could point out what I have missed through reviewing the codes.
  2. ARM port is broken. It will be great if you could point out what is the ARM instruction of 128-bit float register load/store and I will include them in the patch set.

Thanks
-Haitao

Updated

4 years ago
Attachment #8355444 - Flags: review?(nicolas.b.pierron)

Updated

4 years ago
Attachment #8355445 - Flags: review?(nicolas.b.pierron)

Updated

4 years ago
Attachment #8355446 - Flags: review?(nicolas.b.pierron)

Updated

4 years ago
Attachment #8355447 - Flags: review?(nicolas.b.pierron)
(Reporter)

Comment 21

4 years ago
Comment on attachment 8355444 [details] [diff] [review]
0001-Introduce-fpreg_value_t.patch

Review of attachment 8355444 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonFrames.cpp
@@ +1334,4 @@
>  {
>      switch (slot.mode()) {
>        case SnapshotReader::DOUBLE_REG:
> +        return DoubleValue(machine_.read(slot.floatReg()).double_value);

I'd like to suggest adding a readDouble member function to MachineState which returns the double value. Then this code could just to machine_.readDouble(slot.floatReg()), which I think is cleaner.

@@ +1341,3 @@
>          // The register contains the encoding of a float32. We just read
>          // the bits without making any conversion.
> +        return Float32Value(machine_.read(slot.floatReg()).float_value);

Same here; a readFloat32 on MachineState would make this easier to read.

::: js/src/jit/x86/Trampoline-x86.cpp
@@ +683,5 @@
>  
>        case Type_Double:
> +        if (cx->runtime()->jitSupportsFloatingPoint) {
> +            masm.loadDouble(Address(esp, 0), ReturnFloatReg);
> +            masm.freeStack(sizeof(double));

Please add a comment here that mentions that this code doesn't use Pop since Pop now pops a whole SIMD register.
Comment on attachment 8355444 [details] [diff] [review]
0001-Introduce-fpreg_value_t.patch

Review of attachment 8355444 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonMacroAssembler.cpp
@@ +242,4 @@
>  void
>  MacroAssembler::PushRegsInMask(RegisterSet set)
>  {
> +    int32_t diffF = set.fpus().size() * sizeof(fpreg_value_t);

This change sounds unacceptable to me!

When we are not using SIMD this modification still adds a lot of overhead. Both in terms of memory when considering the space reserved on the stack, and in terms of speed, knowing the we spill/load more things.  This does not matter for bailouts, but this function is also used for some calls (see saveLive).

I would prefer that we add an extra set, for vector elements of the same size.  Even if both sets would be sets of FloatRegisters.
Attachment #8355444 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8355447 [details] [diff] [review]
0004-Support-SIMD128-in-Snapshot.patch

Review of attachment 8355447 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonFrames.cpp
@@ +1360,5 @@
>          return Float32Value(ReadFrameFloat32Slot(fp_, slot.stackSlot()));
>  
> +      case SnapshotReader::FLOAT32X4_REG:
> +      {
> +        JSContext *cx = GetIonContext()->cx;

GetIonContext is deprecated.

::: js/src/jit/Snapshots.cpp
@@ +56,5 @@
>  //                 0-27: Constant integer; Int32Value(n)
> +//                   24: Variable int32x4; Int32x4 register code
> +//                   25: Variable int32x4; Stack index
> +//                   26: Variable float32x4; Float32x4 register code
> +//                   27: Variable float32x4; Stack index

The constant integer comment above is not updated, and we should not waste the immediate constant representation like that.
Use a vector header, where you specify the number, and type of the content, and merge it with float.
Attachment #8355447 - Flags: review?(nicolas.b.pierron)

Comment 24

4 years ago
nbp,
Thanks for the review and recommendation. I will read saveLive and others to figure out how exactly we know whether a FPU register is used for double/float or 128-bit float32x4 and int32x4 value in the register set.

Marty,
nbp recommended you as the Ion ARM expert. Do we already have ARM loadSIMD128 or storeSIMD128 instructions in the tree? Do you have interested to implement SIMD API for ARM port? :) x86/x64 port prototype is nearly done at https://github.com/nikomatsakis/gecko-dev/tree/simd128.
Comment on attachment 8355445 [details] [diff] [review]
0002-Allocate-SIMD128-registers.patch

Review of attachment 8355445 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonTypes.h
@@ +74,5 @@
>      MIRType_Double,
>      MIRType_Float32,
>      MIRType_String,
> +    MIRType_Float32x4,
> +    MIRType_Int32x4,

Make a structure out of the MIRType, to represent vector types, and these 2 are just special cases of vector types.
We should represent the MIRType as Vector + Int32 *4 and Vector + Float32 *4.

If we do not do so, then we will have issues for vectorizing Adds, and for apply types in a near future, as this will increase a lot the complexity of isNumberType(…) and others, which by the way should be modified to support vector types too.

::: js/src/jit/LIR.h
@@ +435,5 @@
>          SLOTS,      // Slots/elements pointer that may be moved by minor GCs (GPR).
>          FLOAT32,    // 32-bit floating-point value (FPU).
>          DOUBLE,     // 64-bit floating-point value (FPU).
> +        FLOAT32X4,  // 128-bit floating-point value (FPU).
> +        INT32X4,    // 128-bit floating-point value (FPU).

We should represent that as VectorInt32 and VectorFloat32, and move the 4 to a dedicated field of the LDefinition.

::: js/src/jit/StackSlotAllocator.h
@@ +31,2 @@
>  
> +    uint32_t allocateSIMDSlot() {

nit: Rename this function to include the width of the SIMD data.
Attachment #8355445 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8355446 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

Review of attachment 8355446 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/SIMD.h
@@ +61,5 @@
> +    }
> +    static Elem toType(Elem a) {
> +        return ToInt32(a);
> +    }
> +    static void toType2(JSContext *cx, JS::Handle<JS::Value> v, Elem *out) {

Any reason to have a "2" at the end of this function name?
Attachment #8355446 - Flags: review?(nicolas.b.pierron) → review+
Comment on attachment 8355446 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

Review of attachment 8355446 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/SIMD.cpp
@@ +367,5 @@
>  }
>  
>  template<typename V>
> +JSObject *
> +js::Create(JSContext *cx, typename V::Elem *data)

nit: Add template specialization for Int32x4 and Float32x4 at the end of this function to prevent link time errors.

Comment 28

4 years ago
(In reply to Nicolas B. Pierron [:nbp] from comment #26)
> Comment on attachment 8355446 [details] [diff] [review]
> 0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch
> 
> Review of attachment 8355446 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/builtin/SIMD.h
> @@ +61,5 @@
> > +    }
> > +    static Elem toType(Elem a) {
> > +        return ToInt32(a);
> > +    }
> > +    static void toType2(JSContext *cx, JS::Handle<JS::Value> v, Elem *out) {
> 
> Any reason to have a "2" at the end of this function name?

We could ask Ivan and Niko on this. I refactored them as we need to use the Create function to translate a Float32x4 or Int32x4 value into a TypedObject.
(In reply to haitao from comment #24)
...
> Do we already have ARM
> loadSIMD128 or storeSIMD128 instructions in the tree? Do you have interested
> to implement SIMD API for ARM port? :) x86/x64 port prototype is nearly done
> at https://github.com/nikomatsakis/gecko-dev/tree/simd128.

There does not appear to be support for these SIMD operations, but it would not he hard to add support.  The requested instructions are probably VLD4 and VST4.  Note that the ARMv7 SIMD float hardware only supports 'ARM Standard Floating Point Arithmetic' which flushes denormalized numbers to zero and might this make it unusable?

Is it practical for the denormal number handling to be implementation dependent for the SIMD operations?

Comment 30

4 years ago
dougc, thanks for the answer. The denormal number handling should be part of the spec. Add John for commenting on it.

Comment 31

4 years ago
Created attachment 8358230 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

nits addressed. Could you help me to land it first?
Attachment #8355446 - Attachment is obsolete: true

Updated

4 years ago
Attachment #8358230 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8358230 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

Review of attachment 8358230 [details] [diff] [review]:
-----------------------------------------------------------------

Apply nits and r=me.

::: js/src/builtin/SIMD.cpp
@@ +372,1 @@
>  Create(JSContext *cx, typename V::Elem *data)

nit: remove the namespace and prefix this function name by js:: instead.

@@ +384,5 @@
>  }
>  
> +template JSObject *Create<Float32x4>(JSContext *cx, Float32x4::Elem *data);
> +template JSObject *Create<Int32x4>(JSContext *cx, Int32x4::Elem *data);
> +}

nit: same here.

::: js/src/builtin/SIMD.h
@@ +75,1 @@
>  }  /* namespace js */

nit: Add a new line before the end of the namespace.
Attachment #8358230 - Flags: review?(nicolas.b.pierron) → review+

Comment 33

4 years ago
Created attachment 8359630 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

nits addressed.
Attachment #8358230 - Attachment is obsolete: true
Attachment #8359630 - Flags: review?(nicolas.b.pierron)

Comment 34

4 years ago
Created attachment 8359631 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Introduce loadSIMD128, storeSIMD128 and moveSIMD128 macro assembler instructions for x64 and x86. This could be landed first and we need implement the those three macro instructions for ARM before allocating 128-bit register.
Attachment #8355444 - Attachment is obsolete: true
Attachment #8359631 - Flags: review?(nicolas.b.pierron)

Comment 35

4 years ago
Created attachment 8359673 [details] [diff] [review]
Introduce-MIRType_Float32x4-MIRType_Int32x4-and-MIRT.patch

Address the comments to "Make a structure out of the MIRType, to represent vector types, and these 2 are just special cases of vector types. We should represent the MIRType as Vector + Int32 *4 and Vector + Float32 *4", We could refactor the MIRType like LDefinition in the future.
Attachment #8359673 - Flags: review?(nicolas.b.pierron)
Attachment #8359630 - Flags: review?(nicolas.b.pierron) → review+

Updated

4 years ago
Attachment #8359630 - Flags: checkin?(nicolas.b.pierron)

Updated

4 years ago
Attachment #8355445 - Flags: review?(jdemooij)

Comment 36

4 years ago
I asked jandem to review the register allocation so that I could address sunfish, nbp and his comments together to accelerate this progress.

For the issues I mentioned, 1) the regression issue is solved. 2) for the ARM port, I submitted a patch to introduce loadSIMD128 and storeSIMD128 for x86 and x64 port and asked nbp for review, if this is OK, we need to add them for ARM (Thanks dougc for pointing out the instruction, I think we could look at the Dart implementation to know the encoding), and then register allocator could handle x86/x64 and ARM together.

I am preparing the patch to address nbp's comments on pushing/poping the right size for float registers when saveLive (8 for double and 16 for SIMD), and will address sunfish's comment in the same patch set.
Comment on attachment 8359630 [details] [diff] [review]
0003-Refactor-SIMD-implementation-to-expose-the-Create-AP.patch

https://hg.mozilla.org/integration/mozilla-inbound/rev/defc74986fcc
Attachment #8359630 - Flags: checkin?(nicolas.b.pierron) → checkin+
Whiteboard: [leave-open]
Comment on attachment 8359631 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8359631 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/MacroAssembler-x86-shared.h
@@ +465,4 @@
>          movaps(src, dest);
>      }
>  
> +    void loadSIMD128(const Address &src, FloatRegister dest) {

As we might want to ensure that we have the correct allocation for both spilling and loading, I think it might be better if we have a SIMDRegister instead of a FloatRegister even if both x86 and ARM[1] are implementing this on top of FPU register.

In addition, later, I think that we might want to check there is no overlap between FloatRegisters and SIMDRegisters while checking the LIRGraph.

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/CJAIJHFC.html
Attachment #8359631 - Flags: review?(nicolas.b.pierron) → feedback+
Comment on attachment 8359673 [details] [diff] [review]
Introduce-MIRType_Float32x4-MIRType_Int32x4-and-MIRT.patch

Review of attachment 8359673 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonTypes.h
@@ +89,5 @@
> +    MIRType_Elements,      // An elements vector
> +    MIRType_Pointer,       // An opaque pointer that receives no special treatment
> +    MIRType_Shape,         // A Shape pointer.
> +    MIRType_ForkJoinSlice, // js::ForkJoinSlice*
> +    MIRType_Float32x4 = MIRType_Float32 | ((1 << ELEMENT_TYPE_BITS) * 2),

Add
  MIRType_Last = MIRType_ForkJoinSlice

And a static assert, which ensure that ELEMENT_TYPE_BITS is large enough to prevent any overlapping.

@@ +91,5 @@
> +    MIRType_Shape,         // A Shape pointer.
> +    MIRType_ForkJoinSlice, // js::ForkJoinSlice*
> +    MIRType_Float32x4 = MIRType_Float32 | ((1 << ELEMENT_TYPE_BITS) * 2),
> +    MIRType_Int32x4   = MIRType_Int32   | ((1 << ELEMENT_TYPE_BITS) * 2),
> +    MIRType_Doublex2  = MIRType_Double  | ((1 << ELEMENT_TYPE_BITS) * 1)

Use VECTOR_SCALE_SHIFT instead of ELEMENT_TYPE_BITS.
Write it:
 … | (2 << VECTOR_SIZE_SHIFT)
 … | (1 << VECTOR_SIZE_SHIFT)

Note that if x4 correspond to 2, and x2 correspond to 1, (and implictly x1 correspond to 0) then you should rename VECTOR_SIZE to VECTOR_SCALE.

@@ +97,5 @@
>  
>  static inline MIRType
> +ElementType(MIRType type)
> +{
> +    return static_cast<MIRType>((type >> ELEMENT_TYPE_SHIFT) && ELEMENT_TYPE_MASK);

&& Mask --> & Mask

And identically below.

@@ +103,5 @@
> +
> +static inline uint32_t
> +VectorSize(MIRType type)
> +{
> +    return 2 << ((type >> VECTOR_SIZE_SHIFT) && VECTOR_SIZE_MASK);

Replace 2 by 1, such as:
  1 << 0 == 1
  1 << 1 == 2
  1 << 2 == 4
Attachment #8359673 - Flags: review?(nicolas.b.pierron)

Comment 41

4 years ago
Created attachment 8362529 [details] [diff] [review]
distinguish-saveLive.patch
Attachment #8362529 - Flags: feedback?(nicolas.b.pierron)
Comment on attachment 8362529 [details] [diff] [review]
distinguish-saveLive.patch

Review of attachment 8362529 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonFrames.cpp
@@ +359,1 @@
>          machine.setRegisterLocation(*iter, --floatSpill);

What does nonDouble means?  Does it means SIMD128, Float32 ?

::: js/src/jit/IonMacroAssembler.cpp
@@ +294,5 @@
>      JS_ASSERT(diffF == 0);
> +
> +#ifdef JS_CPU_ARM
> +    adjustFrame(diffD);
> +    diffD += transferMultipleByRuns(doubleSet, IsStore, StackPointer, DB);

For ARM, I will suggest that we merge the sets of register as ARM registers are overlapping. Otherwise we might spill in 2 instructions what we could have done in one.

The best would be to ask Marty.

::: js/src/jit/Registers.h
@@ +78,4 @@
>      }
>  };
>  
> +typedef FloatRegister DoubleRegister;

This does not prevent against implicit type coercion.  This is a dangerous interface.

::: js/src/jit/shared/CodeGenerator-shared-inl.h
@@ +161,4 @@
>  {
>      JS_ASSERT(!ins->isCall());
>      LSafepoint *safepoint = ins->safepoint();
> +    masm.PopRegsInMaskIgnore(safepoint->liveRegs(), ignore, safepoint->doubleRegs());

the ignore flag should remain the last one, to see the difference compared to other function prototypes.
Attachment #8362529 - Flags: feedback?(nicolas.b.pierron)

Comment 43

4 years ago
Created attachment 8362742 [details] [diff] [review]
Introduce-MIRType_Float32x4-MIRType_Int32x4-and-MIRT.patch

Thanks for the review and sorry for the typos.
Attachment #8359673 - Attachment is obsolete: true
Attachment #8362742 - Flags: review?(nicolas.b.pierron)
(In reply to Nicolas B. Pierron [:nbp] from comment #42)
> Comment on attachment 8362529 [details] [diff] [review]
> distinguish-saveLive.patch
> 
> Review of attachment 8362529 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/IonFrames.cpp
> @@ +359,1 @@
> >          machine.setRegisterLocation(*iter, --floatSpill);
> 
> What does nonDouble means?  Does it means SIMD128, Float32 ?
> 
> ::: js/src/jit/IonMacroAssembler.cpp
> @@ +294,5 @@
> >      JS_ASSERT(diffF == 0);
> > +
> > +#ifdef JS_CPU_ARM
> > +    adjustFrame(diffD);
> > +    diffD += transferMultipleByRuns(doubleSet, IsStore, StackPointer, DB);
> 
> For ARM, I will suggest that we merge the sets of register as ARM registers
> are overlapping. Otherwise we might spill in 2 instructions what we could
> have done in one.

Good point.

For example, the 128 bit SIMD register Q0 shares state with the 64 bit double float registers D0 and D1, and with the float32 registers S0 to S3 - in a similar manner to JS typed array views of an array buffer.

Some new infrastructure might be needed as I do not think Ion has support to manage this overlap.

There was a similar issue when adding the float32 support but it was worked around by only using the even float32 registers S0, S2, S4, .. S30 which allowed them to be accounted for as double float registers in many cases.
Depends on: 962263
Comment on attachment 8355445 [details] [diff] [review]
0002-Allocate-SIMD128-registers.patch

Review of attachment 8355445 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay. Looks good, just some minor comments, if you can post an updated patch I'll review it faster.

::: js/src/jit/LinearScan.cpp
@@ +822,5 @@
>      else if (IsNunbox(reg))
>          freed = &finishedNunboxSlots_;
>  #endif
> +    else if (reg->type() == LDefinition::FLOAT32X4 ||
> +             reg->type() == LDefinition::INT32X4)

Nit: this fits on one line.

@@ +914,5 @@
>              else if (mine->type() == LDefinition::BOX)
>                  finishedDoubleSlots_.append(interval);
>  #endif
> +            else if (mine->type() == LDefinition::FLOAT32X4 ||
> +                     mine->type() == LDefinition::INT32X4)

Same here.

::: js/src/jit/MoveResolver.h
@@ +119,5 @@
>          INT32,
>          FLOAT32,
> +        DOUBLE,
> +        FLOAT32X4,
> +        INT32X4

Do we really need separate values for these? IIUC a single MoveOp::SIMD128 should work.
Attachment #8355445 - Flags: review?(jdemooij)
(Reporter)

Comment 46

4 years ago
(In reply to Jan de Mooij [:jandem] from comment #45)
> ::: js/src/jit/MoveResolver.h
> @@ +119,5 @@
> >          INT32,
> >          FLOAT32,
> > +        DOUBLE,
> > +        FLOAT32X4,
> > +        INT32X4
> 
> Do we really need separate values for these? IIUC a single MoveOp::SIMD128
> should work.

Using separate values for these enables an optimization. On Nehalem processors for example, it's faster to use movaps for FLOAT32X4 data and movdqa for INT32X4 data, even though they are functionally equivalent [0].

[0] See "Data bypass delays on Nehalem" in http://www.agner.org/optimize/microarchitecture.pdf
(In reply to Dan Gohman [:sunfish] from comment #46)
> Using separate values for these enables an optimization. On Nehalem
> processors for example, it's faster to use movaps for FLOAT32X4 data and
> movdqa for INT32X4 data, even though they are functionally equivalent [0].

Ah interesting, this is probably fine then. haitao, it would be nice if we could change the move resolver to use movdqa and movaps as Dan described, with a comment explaining why we emit different instructions.

Comment 48

4 years ago
jandem, thanks for the review. I will address you and sunfish's comments in the move resolver. I am preparing a patch to introduce VectorRegister discussed with nbp offline, will update the patches together.
Attachment #8362742 - Flags: review?(nicolas.b.pierron) → review+

Updated

4 years ago
Attachment #8362742 - Flags: checkin?(nicolas.b.pierron)

Comment 49

4 years ago
Created attachment 8366593 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
Attachment #8359631 - Attachment is obsolete: true
Attachment #8366593 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8366593 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8366593 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/Registers.h
@@ +72,1 @@
>          return code_ == other.code_;

This is not true on ARM, as this does not account for the overlapping of s3 and d1, nor d1 and q0.

::: js/src/jit/arm/MacroAssembler-arm.h
@@ +1268,5 @@
>      void loadDouble(const Address &addr, const FloatRegister &dest);
>      void loadDouble(const BaseIndex &src, const FloatRegister &dest);
>  
> +    void loadSIMD128(const Address &src, SIMD128Register dest, MoveOp::Type type) {
> +        //TODO: Implement this.

We write ":TODO:" in the JS engine, then when there is a TODO there is a bug number to a bug which concerns is to remove it.

::: js/src/jit/shared/MacroAssembler-x86-shared.h
@@ +477,5 @@
> +          default:
> +            MOZ_ASSUME_UNREACHABLE("unexpected MoveOp type");
> +        }
> +    }
> +    void loadSIMD128(const Operand &src, SIMD128Register dest, MoveOp::Type type) {

if you overload a template without any parameter, I think that the C++ still requires that you define an empty list of parameters, such as:

  template <>
  void loadSIMD128(const Operand &src, …
Attachment #8366593 - Flags: review?(nicolas.b.pierron) → feedback+
Comment on attachment 8362742 [details] [diff] [review]
Introduce-MIRType_Float32x4-MIRType_Int32x4-and-MIRT.patch

https://hg.mozilla.org/integration/mozilla-inbound/rev/2d1fd4b1fdaa
Attachment #8362742 - Flags: checkin?(nicolas.b.pierron) → checkin+

Comment 53

4 years ago
(In reply to Nicolas B. Pierron [:nbp] from comment #50)
> Comment on attachment 8366593 [details] [diff] [review]
> 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> 
> Review of attachment 8366593 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/Registers.h
> @@ +72,1 @@
> >          return code_ == other.code_;
> 
> This is not true on ARM, as this does not account for the overlapping of s3
> and d1, nor d1 and q0.
>
I am introducing the overlapping relationship into AnyRegister and RegisterSet. When SIMD128Register is introduced, we should not compare FloatRegister with SIMD128Register directly, instead, we should return a overlapping SIMD128Register from FloatRegister and then compare, and vice versa. Basically we need a RegisterKind (GRP, FPU_64BITS and FPU_128BITS) in the AnyRegister to replace bool isFloat_, and replace isFloat() with kind() in all the places.
 
> ::: js/src/jit/arm/MacroAssembler-arm.h
> @@ +1268,5 @@
> >      void loadDouble(const Address &addr, const FloatRegister &dest);
> >      void loadDouble(const BaseIndex &src, const FloatRegister &dest);
> >  
> > +    void loadSIMD128(const Address &src, SIMD128Register dest, MoveOp::Type type) {
> > +        //TODO: Implement this.
> 
> We write ":TODO:" in the JS engine, then when there is a TODO there is a bug
> number to a bug which concerns is to remove it.
Done.
> 
> ::: js/src/jit/shared/MacroAssembler-x86-shared.h
> @@ +477,5 @@
> > +          default:
> > +            MOZ_ASSUME_UNREACHABLE("unexpected MoveOp type");
> > +        }
> > +    }
> > +    void loadSIMD128(const Operand &src, SIMD128Register dest, MoveOp::Type type) {
> 
> if you overload a template without any parameter, I think that the C++ still
> requires that you define an empty list of parameters, such as:
> 
>   template <>
>   void loadSIMD128(const Operand &src, …
We do not need that as for function templates, C++ finds the exact match before instantiating templates.

Comment 54

4 years ago
Created attachment 8367233 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
Attachment #8366593 - Attachment is obsolete: true
Attachment #8367233 - Flags: review?(nicolas.b.pierron)
I have to write the following down, as it sounds like a good plan, and this is related to what we want to achieve here.

One thing that I noticed, is that VFPRegister are good to express Float32 and Double registers, but NeonRegisters are the only one able to express Quad registers, but they can also address Double registers too (and still overlap singles).  I do not think this would be a good thing to distinguish VFP & Neon in the LIR / MIR.  Having them separated by size/type sounds like a better approach, such as:

  Float32Register
  DoubleRegister
  SIMD128Register

 or

  F32Register
  F64Register
  F32x4Register
  U32x4Register

Also, on ARM, we might want to convert them to either a VFPRegister or a NeonRegister, with something such as the VFPRegister constructor.  Such as we can give them as argument of the MacroAssembler-arm functions and they would be converted to the right type for the Assembler-arm.  To avoid duplicating functions, such as ma_vsub and ma_vsub_f32, maybe we should write them as:

  void ma_vsub(F32Register src1, F32Register src2, F32Register dst)
  void ma_vsub(F64Register src1, F64Register src2, F64Register dst)

 or

  template <FRegister>
  void ma_vsub(FRegister src1, FRegister src2, FRegister dst)

  template <> void MacroAssemblerARM::ma_vsub<F32Register>(F32Register src1, F32Register src2, F32Register dst);
  template <> void MacroAssemblerARM::ma_vsub<F64Register>(F64Register src1, F64Register src2, F64Register dst);

 or

  // Assert that all 3 registers have the same type.
  void ma_vsub(F32OrF64Register src1, F32OrF64Register src2, F32OrF64Register dst)

The reason why we want multiple Register kind is that we want multiple things:

 - CodeGen should be logical, an instruction which manipulate float32 (fp32) / double (fp64) / SIMD128 (fp128, fp32x4?) should be explicit about it and not have a generic ToFloatRegister().

 - The register allocator should allocate the right size/alignment for spilling these registers.

 - The MacroAssembler should have clear types to understand how registers are manipulated.

      AddScalar(Float32Register, Float32x4Register)

   sounds better than

      AddScalar(FloatRegister, FloatRegister)

 - The registers should be enumerable in a RegisterSet to be used by the register allocator and encoded in Safepoints.  One thing we can do is keep a generic FloatRegisterSet, from which we can take any kind of FloatingPointRegister.

 - We want to be able to quickly check that 2 FloatingPointRegister do not alias, as we want to assert for it in the LIR verification pass, as well as in the CodeGen (as sometimes we might want to have overlapping).

(In reply to haitao from comment #53)
> (In reply to Nicolas B. Pierron [:nbp] from comment #50)
> > Comment on attachment 8366593 [details] [diff] [review]
> > 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> > 
> > Review of attachment 8366593 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: js/src/jit/Registers.h
> > @@ +72,1 @@
> > >          return code_ == other.code_;
> > 
> > This is not true on ARM, as this does not account for the overlapping of s3
> > and d1, nor d1 and q0.
> >
> I am introducing the overlapping relationship into AnyRegister and
> RegisterSet. When SIMD128Register is introduced, we should not compare
> FloatRegister with SIMD128Register directly, instead, we should return a
> overlapping SIMD128Register from FloatRegister and then compare, and vice
> versa. Basically we need a RegisterKind (GRP, FPU_64BITS and FPU_128BITS) in
> the AnyRegister to replace bool isFloat_, and replace isFloat() with kind()
> in all the places.

I agree we want to be able to check if AnyRegister alias another AnyRegister, but we also want to check among specialized registers without creating an AnyRegister.  We might want to assert in the CodeGen or in the MacroAssembler that we have different registers (inputs do not alias) or that we have identical registers (output alias the input), with potentially different kind of operands, such as the Intel instruction which duplicates a Double (Double register) to the upper part of an xmm register (Double x2 register).
Comment on attachment 8367233 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8367233 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/Architecture-arm.h
@@ +230,5 @@
> +        q12,
> +        q13,
> +        q14,
> +        q15,
> +        q16,

There is no more than 16 Quad register, from q0 to q15.

@@ +256,5 @@
> +    }
> +
> +    static const Code Invalid = invalid_freg;
> +
> +    static const uint32_t Total = 8;

Any reason to limit the number of quad register to 8?
Attachment #8367233 - Flags: review?(nicolas.b.pierron)

Comment 57

4 years ago
nbp, Thanks for explaining the plan. I will be on vacation for the Chinese New Year holiday for a week and continue upstreaming patches according to this plan.  

(In reply to Nicolas B. Pierron [:nbp] from comment #55)
> I have to write the following down, as it sounds like a good plan, and this
> is related to what we want to achieve here.
> 
> One thing that I noticed, is that VFPRegister are good to express Float32
> and Double registers, but NeonRegisters are the only one able to express
> Quad registers, but they can also address Double registers too (and still
> overlap singles).  I do not think this would be a good thing to distinguish
> VFP & Neon in the LIR / MIR.  Having them separated by size/type sounds like
> a better approach, such as:
> 
>   Float32Register
>   DoubleRegister
>   SIMD128Register
> 
>  or
> 
>   F32Register
>   F64Register
>   F32x4Register
>   U32x4Register
> 
> Also, on ARM, we might want to convert them to either a VFPRegister or a
> NeonRegister, with something such as the VFPRegister constructor.  Such as
> we can give them as argument of the MacroAssembler-arm functions and they
> would be converted to the right type for the Assembler-arm.  To avoid
> duplicating functions, such as ma_vsub and ma_vsub_f32, maybe we should
> write them as:
> 
>   void ma_vsub(F32Register src1, F32Register src2, F32Register dst)
>   void ma_vsub(F64Register src1, F64Register src2, F64Register dst)
> 
>  or
> 
>   template <FRegister>
>   void ma_vsub(FRegister src1, FRegister src2, FRegister dst)
> 
>   template <> void MacroAssemblerARM::ma_vsub<F32Register>(F32Register src1,
> F32Register src2, F32Register dst);
>   template <> void MacroAssemblerARM::ma_vsub<F64Register>(F64Register src1,
> F64Register src2, F64Register dst);
> 
>  or
> 
>   // Assert that all 3 registers have the same type.
>   void ma_vsub(F32OrF64Register src1, F32OrF64Register src2,
> F32OrF64Register dst)
> 
> The reason why we want multiple Register kind is that we want multiple
> things:
> 
>  - CodeGen should be logical, an instruction which manipulate float32 (fp32)
> / double (fp64) / SIMD128 (fp128, fp32x4?) should be explicit about it and
> not have a generic ToFloatRegister().
> 
>  - The register allocator should allocate the right size/alignment for
> spilling these registers.
> 
>  - The MacroAssembler should have clear types to understand how registers
> are manipulated.
> 
>       AddScalar(Float32Register, Float32x4Register)
> 
>    sounds better than
> 
>       AddScalar(FloatRegister, FloatRegister)
> 
>  - The registers should be enumerable in a RegisterSet to be used by the
> register allocator and encoded in Safepoints.  One thing we can do is keep a
> generic FloatRegisterSet, from which we can take any kind of
> FloatingPointRegister.
> 
>  - We want to be able to quickly check that 2 FloatingPointRegister do not
> alias, as we want to assert for it in the LIR verification pass, as well as
> in the CodeGen (as sometimes we might want to have overlapping).
> 
> (In reply to haitao from comment #53)
> > (In reply to Nicolas B. Pierron [:nbp] from comment #50)
> > > Comment on attachment 8366593 [details] [diff] [review]
> > > 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> > > 
> > > Review of attachment 8366593 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > 
> > > ::: js/src/jit/Registers.h
> > > @@ +72,1 @@
> > > >          return code_ == other.code_;
> > > 
> > > This is not true on ARM, as this does not account for the overlapping of s3
> > > and d1, nor d1 and q0.
> > >
> > I am introducing the overlapping relationship into AnyRegister and
> > RegisterSet. When SIMD128Register is introduced, we should not compare
> > FloatRegister with SIMD128Register directly, instead, we should return a
> > overlapping SIMD128Register from FloatRegister and then compare, and vice
> > versa. Basically we need a RegisterKind (GRP, FPU_64BITS and FPU_128BITS) in
> > the AnyRegister to replace bool isFloat_, and replace isFloat() with kind()
> > in all the places.
> 
> I agree we want to be able to check if AnyRegister alias another
> AnyRegister, but we also want to check among specialized registers without
> creating an AnyRegister.  We might want to assert in the CodeGen or in the
> MacroAssembler that we have different registers (inputs do not alias) or
> that we have identical registers (output alias the input), with potentially
> different kind of operands, such as the Intel instruction which duplicates a
> Double (Double register) to the upper part of an xmm register (Double x2
> register).

Comment 58

4 years ago
(In reply to Nicolas B. Pierron [:nbp] from comment #56)
> Comment on attachment 8367233 [details] [diff] [review]
> 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> 
> Review of attachment 8367233 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/arm/Architecture-arm.h
> @@ +230,5 @@
> > +        q12,
> > +        q13,
> > +        q14,
> > +        q15,
> > +        q16,
> 
> There is no more than 16 Quad register, from q0 to q15.
I copied this from assembler/assembler/ARMv7Assembler.h.
> 
> @@ +256,5 @@
> > +    }
> > +
> > +    static const Code Invalid = invalid_freg;
> > +
> > +    static const uint32_t Total = 8;
> 
> Any reason to limit the number of quad register to 8?
There are 16 total FloatRegisters (FP_64BITS), and quad registers overlap with float registers by 2, so the total number should be 8. This also simplifies the bailout handling as we only need to save/restore SIMD128Registers::Total * sizeof(simd128_t) on the stack for FP states. simd128_t contains a union of 4 floats, 4 int32s and 2 doubles.

Comment 59

4 years ago
Created attachment 8392019 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
Attachment #8367233 - Attachment is obsolete: true
Attachment #8392019 - Flags: review?(nicolas.b.pierron)

Comment 60

4 years ago
Created attachment 8392058 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
Attachment #8392019 - Attachment is obsolete: true
Attachment #8392019 - Flags: review?(nicolas.b.pierron)
Attachment #8392058 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8392058 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8392058 [details] [diff] [review]:
-----------------------------------------------------------------

I am adding Marty as a reviewer as he is currently working on getting the Float/Double differentiation.

I would also prefer if we could get the Float & Double right before implementing SIMD registers, sorry for the delay.
Attachment #8392058 - Flags: review?(mrosenberg)
Comment on attachment 8392058 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8392058 [details] [diff] [review]:
-----------------------------------------------------------------

Many things seems to be lacking compared to what is in the WIP of Marty (see Bug 957504).

Do you have the full set of modification yet?  This patch is incomplete, and this does not reassure me compare to all problem that Marty had faced in his WIP.  Do you have usage examples in MacroAssembler / CodeGen functions?

::: js/src/jit/shared/MacroAssembler-x86-shared.h
@@ +498,5 @@
>  
> +    template <class T>
> +    void loadSIMD128(const T &src, SIMD128Register dest, MoveOp::Type type) {
> +        switch (type) {
> +          case MoveOp::FLOAT32X4:

Why do we have to carry around the MoveOp::Type, I thought the idea of the SIMD128Register was to avoid overloading the FloatRegister with different meaning.  Don't we want to do the same here?

Is there any other instruction where this kind of manipulation would be needed?

The MoveOp::Type is an additional abstraction on top of the MacroAssembler, so using this type in the macro assembler without using the MoveResolver is a sign of something which is leaking out-side its intended scope.

::: js/src/jit/x64/Architecture-x64.h
@@ +186,5 @@
> +    switch (kind1) {
> +      case GPR:
> +        return false;
> +      case FPU_64BITS:
> +        return kind2 == FPU_128BITS && code1 == code2;

do not use a switch case for that.

::: js/src/jit/x86/Architecture-x86.h
@@ +164,5 @@
> +    switch (kind1) {
> +      case GPR:
> +        return false;
> +      case FPU_64BITS:
> +        return kind2 == FPU_128BITS && code1 == code2;

same here.
Attachment #8392058 - Flags: review?(nicolas.b.pierron)

Comment 63

4 years ago
Created attachment 8392847 [details] [diff] [review]
0002-Allocate-SIMD128-Registers.patch
Attachment #8355445 - Attachment is obsolete: true
Attachment #8392847 - Flags: feedback?(nicolas.b.pierron)

Comment 64

4 years ago
(In reply to Nicolas B. Pierron [:nbp] from comment #62)
> Comment on attachment 8392058 [details] [diff] [review]
> 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> 
> Review of attachment 8392058 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Many things seems to be lacking compared to what is in the WIP of Marty (see
> Bug 957504).

I attached 0002-Allocate-SIMD128-Registers.patch to get your feedback on allocating SIMD128 registers.  We need an invariant when calling PushRegsInMask and PopRegsInMask with a RegisterSet, I am thinking we could assert that the FloatRegisters and SIMD128Registers does not intersect, this is true for liveRegs, for pushing and popping volatile/full registers, we could set the FloatRegisters empty and SIMD128 to volatile/full as SIMD128 registers contains FloatRegisters. We also need to think more on how to handle findBestBlockedRegister for ARM in the register allocator.

> 
> Do you have the full set of modification yet?  This patch is incomplete, and
> this does not reassure me compare to all problem that Marty had faced in his
> WIP.  Do you have usage examples in MacroAssembler / CodeGen functions?

The full set of modification is in the simd128 branch at https://github.com/nikomatsakis/gecko-dev/tree/simd128, in that branch, the register allocation is only done for x86 and x64. Here I am trying to solve the register overlapping problems (from ARM) in the platform-independent files. I will rebase the simd128 branch with the register allocation algorithm here to see if it works after getting your feedback.

> 
> ::: js/src/jit/shared/MacroAssembler-x86-shared.h
> @@ +498,5 @@
> >  
> > +    template <class T>
> > +    void loadSIMD128(const T &src, SIMD128Register dest, MoveOp::Type type) {
> > +        switch (type) {
> > +          case MoveOp::FLOAT32X4:
> 
> Why do we have to carry around the MoveOp::Type, I thought the idea of the
> SIMD128Register was to avoid overloading the FloatRegister with different
> meaning.  Don't we want to do the same here?

This is to address sunfish and jandem's comments on using movups for float32x4 and movdqu for int32x4. As the SIMD128Register will be used for float32x4, int32x4 and float64x2, we need an extra flag to distinguish them when they are shuffled from stack slots to SIMD128 registers. Previously I used movups for both float32x4 and int32x4 and we did not need MoveOp::Type.

> 
> Is there any other instruction where this kind of manipulation would be
> needed?
> 
> The MoveOp::Type is an additional abstraction on top of the MacroAssembler,
> so using this type in the macro assembler without using the MoveResolver is
> a sign of something which is leaking out-side its intended scope.

The MoveResolver will emit the loadSIMD128 or storeSIMD128 macro instructions.

> 
> ::: js/src/jit/x64/Architecture-x64.h
> @@ +186,5 @@
> > +    switch (kind1) {
> > +      case GPR:
> > +        return false;
> > +      case FPU_64BITS:
> > +        return kind2 == FPU_128BITS && code1 == code2;
> 
> do not use a switch case for that.

Do you mean using if statements? We might also introduce FPU_32BITS in the future.

> 
> ::: js/src/jit/x86/Architecture-x86.h
> @@ +164,5 @@
> > +    switch (kind1) {
> > +      case GPR:
> > +        return false;
> > +      case FPU_64BITS:
> > +        return kind2 == FPU_128BITS && code1 == code2;
> 
> same here.
(In reply to Nicolas B. Pierron [:nbp] from comment #62)
> Comment on attachment 8392058 [details] [diff] [review]
> 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> 
> Review of attachment 8392058 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Many things seems to be lacking compared to what is in the WIP of Marty (see
> Bug 957504).
> 
> Do you have the full set of modification yet?  This patch is incomplete, and
> this does not reassure me compare to all problem that Marty had faced in his
> WIP.  Do you have usage examples in MacroAssembler / CodeGen functions?
> 
> ::: js/src/jit/shared/MacroAssembler-x86-shared.h
> @@ +498,5 @@
> >  
> > +    template <class T>
> > +    void loadSIMD128(const T &src, SIMD128Register dest, MoveOp::Type type) {
> > +        switch (type) {
> > +          case MoveOp::FLOAT32X4:
> 
> Why do we have to carry around the MoveOp::Type, I thought the idea of the
> SIMD128Register was to avoid overloading the FloatRegister with different
> meaning.  Don't we want to do the same here?
> 
> Is there any other instruction where this kind of manipulation would be
> needed?
> 
> The MoveOp::Type is an additional abstraction on top of the MacroAssembler,
> so using this type in the macro assembler without using the MoveResolver is
> a sign of something which is leaking out-side its intended scope.

My point is simple, I do not want these checks to happen at runtime when these can be decided at compile time.
Keywords: feature, leave-open
Whiteboard: [leave-open] → [js:p1]
Keywords: feature
Whiteboard: [js:p1] → [js:t]
> Also, on ARM, we might want to convert them to either a VFPRegister or a
> NeonRegister, with something such as the VFPRegister constructor.  Such as
> we can give them as argument of the MacroAssembler-arm functions and they
> would be converted to the right type for the Assembler-arm.  To avoid
> duplicating functions, such as ma_vsub and ma_vsub_f32, maybe we should
> write them as:
> 
>   void ma_vsub(F32Register src1, F32Register src2, F32Register dst)
>   void ma_vsub(F64Register src1, F64Register src2, F64Register dst)
> 
As we've dicussed on irc, I feel like it is not really sane to have two copies of every single function just to have two different types representing float32 and float64 registers.
>  or
> 
>   template <FRegister>
>   void ma_vsub(FRegister src1, FRegister src2, FRegister dst)
> 
>   template <> void MacroAssemblerARM::ma_vsub<F32Register>(F32Register src1,
> F32Register src2, F32Register dst);
>   template <> void MacroAssemblerARM::ma_vsub<F64Register>(F64Register src1,
> F64Register src2, F64Register dst);
> 
template are fine, but oh god that is ugly, also we end up with two copies of everything in memory, which feels kind of bad.

>  or
> 
>   // Assert that all 3 registers have the same type.
>   void ma_vsub(F32OrF64Register src1, F32OrF64Register src2,
> F32OrF64Register dst)
> 
> The reason why we want multiple Register kind is that we want multiple
> things:
> 
>  - CodeGen should be logical, an instruction which manipulate float32 (fp32)
> / double (fp64) / SIMD128 (fp128, fp32x4?) should be explicit about it and
> not have a generic ToFloatRegister().
I'm tempted to let the architecture and the ARM ARM guide the division of types here.  All VFP instructions can operate on either Float32
or Float64 data, and in fact, it is a single bit in every vfp instruction that differentiates the two cases.  Because of this, I'm pretty ok with letting there be a single ToFloatRegister that handles both Float32 and Double.  HOWEVER.  Neon is an entirely different beast, has a completely different instruction set, and gives a non-trivial penalty for switching to/from it.  This would indicate that NeonReg should be different fromVFPReg.  But, having separate F32Reg and F64Reg is nice,  I think a solution would be to make both F32Reg and F64Reg inherit from VFPReg, and only allow an up-cast when the VFPReg knows it is the correct type. That way, any instruction that *does* have some dependence on Float32 vs. Float64 can accept only the correct type, which would force everyone up the callstack to use the correct F32Reg, back up to using a ToFloat32Register() from the LAllocation.

> 
>  - The register allocator should allocate the right size/alignment for
> spilling these registers.
> 
>  - The MacroAssembler should have clear types to understand how registers
> are manipulated.
> 
>       AddScalar(Float32Register, Float32x4Register)
> 
>    sounds better than
> 
>       AddScalar(FloatRegister, FloatRegister)
> 
>  - The registers should be enumerable in a RegisterSet to be used by the
> register allocator and encoded in Safepoints.  One thing we can do is keep a
> generic FloatRegisterSet, from which we can take any kind of
> FloatingPointRegister.
> 
>  - We want to be able to quickly check that 2 FloatingPointRegister do not
> alias, as we want to assert for it in the LIR verification pass, as well as
> in the CodeGen (as sometimes we might want to have overlapping).
> 

> (In reply to haitao from comment #53)
> > (In reply to Nicolas B. Pierron [:nbp] from comment #50)
> > > Comment on attachment 8366593 [details] [diff] [review]
> > > 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> > > 
> > > Review of attachment 8366593 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > 
> > > ::: js/src/jit/Registers.h
> > > @@ +72,1 @@
> > > >          return code_ == other.code_;
> > > 
> > > This is not true on ARM, as this does not account for the overlapping of s3
> > > and d1, nor d1 and q0.
> > >
> > I am introducing the overlapping relationship into AnyRegister and
> > RegisterSet. When SIMD128Register is introduced, we should not compare
> > FloatRegister with SIMD128Register directly, instead, we should return a
> > overlapping SIMD128Register from FloatRegister and then compare, and vice
> > versa. Basically we need a RegisterKind (GRP, FPU_64BITS and FPU_128BITS) in
> > the AnyRegister to replace bool isFloat_, and replace isFloat() with kind()
> > in all the places.
> 
> I agree we want to be able to check if AnyRegister alias another
> AnyRegister, but we also want to check among specialized registers without
> creating an AnyRegister.  We might want to assert in the CodeGen or in the
> MacroAssembler that we have different registers (inputs do not alias) or
> that we have identical registers (output alias the input), with potentially
> different kind of operands, such as the Intel instruction which duplicates a
> Double (Double register) to the upper part of an xmm register (Double x2
> register).
The architecture independent code has no right sticking its nose where it doesn't belong, e.g. the aliasing semantics of the floating point and vector registers.
Comment on attachment 8392058 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

Review of attachment 8392058 [details] [diff] [review]:
-----------------------------------------------------------------

Rather than looking at how dart implements the arm encodings, you should probably grab a copy of the ARM ARM, which is freely available (you need to create an account on arm.com).  You'll also need to be careful, because *none* of the current vfp instructions can or should be used with the NEON registers.

::: js/src/jit/arm/Architecture-arm.cpp
@@ +297,5 @@
> +            return Code(i);
> +    }
> +
> +    return Invalid;
> +}

... What is using this code? it seems like a horrible idea.

::: js/src/jit/arm/Architecture-arm.h
@@ +45,5 @@
>  
> +enum RegisterKind {
> +    GPR,
> +    FPU_64BITS,
> +    FPU_128BITS

there should probably be a FPU_32BITS in there somewhere or other.

Comment 68

4 years ago
(In reply to Marty Rosenberg [:mjrosenb] from comment #67)
> Comment on attachment 8392058 [details] [diff] [review]
> 0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch
> 
> Review of attachment 8392058 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Rather than looking at how dart implements the arm encodings, you should
> probably grab a copy of the ARM ARM, which is freely available (you need to
> create an account on arm.com).  You'll also need to be careful, because
> *none* of the current vfp instructions can or should be used with the NEON
> registers.
> 
> ::: js/src/jit/arm/Architecture-arm.cpp
> @@ +297,5 @@
> > +            return Code(i);
> > +    }
> > +
> > +    return Invalid;
> > +}
> 
> ... What is using this code? it seems like a horrible idea.

I see there is such an interface for FloatRegisters, so I added it for SIMD in case it will be used.

> 
> ::: js/src/jit/arm/Architecture-arm.h
> @@ +45,5 @@
> >  
> > +enum RegisterKind {
> > +    GPR,
> > +    FPU_64BITS,
> > +    FPU_128BITS
> 
> there should probably be a FPU_32BITS in there somewhere or other.

Yes, we should have FPU_32BITS, 64BITS and 128BITS.
Comment on attachment 8392847 [details] [diff] [review]
0002-Allocate-SIMD128-Registers.patch

Review of attachment 8392847 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should wait on the Float32 patches (Bug 991153) before going forward with this patch.
For SIMD on Asm.js, with should not need to add such amount of code, as mostly the reg-alloc part would matter, and we do not have to handle safe-points.

Feel free to ping bbouvier on IRC, on what you could do to help him moving forward.
Ask Marty and me for feedback when Bug 991153 lands.

Sorry for the delay.

::: js/src/jit/x64/Assembler-x64.h
@@ +274,5 @@
>          movsd(src, Address(StackPointer, 0));
>      }
> +    void push(const SIMD128Register &src) {
> +        subq(Imm32(sizeof(simd128_t)), StackPointer);
> +        movups(src, Address(StackPointer, 0));

movups or movdqu ?

This push is used by the spilling mechanism, and this is sad, but it seems that the final allocation type (not the move op) will need to be transferred to this point.

I do not think this matters here, as this function is unused, but PushRegsInMask will have the same issue.
Attachment #8392847 - Flags: feedback?(nicolas.b.pierron)
Depends on: 1074102
Comment on attachment 8392058 [details] [diff] [review]
0001-Introduce-loadSIMD128-and-storeSIMD128-macro-instruc.patch

(remove old review request)
Attachment #8392058 - Flags: review?(marty.rosenberg)
Marking as works-for-me, as this is a solved issue, and there is more than one bug to add supports of SIMD in our CodeGen.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.