Closed Bug 1113338 Opened 10 years ago Closed 9 years ago

SIMD: implement partial loads / stores in Odin

Categories

(Core :: JavaScript Engine: JIT, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox38 --- fixed

People

(Reporter: sunfish, Assigned: bbouvier)

References

Details

Attachments

(6 files, 6 obsolete files)

17.38 KB, patch
luke
: review+
Details | Diff | Splinter Review
42.95 KB, patch
luke
: review+
Details | Diff | Splinter Review
43.21 KB, patch
luke
: review+
Details | Diff | Splinter Review
16.32 KB, patch
luke
: review+
Details | Diff | Splinter Review
116.40 KB, patch
luke
: review+
Details | Diff | Splinter Review
4.12 KB, patch
Details | Diff | Splinter Review
When compiling Bullet (https://github.com/kripken/emscripten/issues/2793), it turns out that we're getting code which does 3-element loads and stores. I've now implemented this in Emscripten, by having it emit loadXYZ and storeXYZ, so we'll want support for that in Odin as well.

Emscripten can now also emit loadXY, loadX, storeXY, and storeX, so we should add those too.
I'll take it, thanks for opening the bug.
Assignee: nobody → benj
Status: NEW → ASSIGNED
Summary: SIMD in Odin: loadXYZ, storeXYZ, etc. → SIMD: implement partial loads / stores in Odin
Attached patch Odin changes WIP (obsolete) — Splinter Review
(getting back to ye old bugzilla for r? and f?, waiting for mozreview to get better)

Ideally, we'd want partial loads / stores to fit in the whole MAsmJSHeapAccess hierarchy. This introduces an enum in MAsmJSHeapAccess which precises the width of the load/store, and then we can reuse this in code generation, to switch on the width. This is simpler than creating a new MIR node. Luke, what do you think? (code generation is actually in another WIP patch that i'd like to enhance before submitting)
Attachment #8541246 - Flags: feedback?(luke)
Comment on attachment 8541246 [details] [diff] [review]
Odin changes WIP

Review of attachment 8541246 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/MIR.h
@@ +12185,5 @@
> +    enum Width {
> +        WidthFull = 0,
> +        Width32   = 1,
> +        Width64   = 2,
> +        Width96   = 3

Yes, I think it makes sense to add an extra parameter the existing load/store MIR nodes.  Instead of giving each width a symbolic name, how about just an 'unsigned numElems' where we have a precondition numElems <= ScalarTypeToLength(viewType) (where we've generalized SimdTypeToLength to ScalarTypeToLength to return 1 for non-SIMD types).

Also, do you have any patches pending to rename "viewType" to just "type"?  I vaguely recall that from a previous review.
Attachment #8541246 - Flags: feedback?(luke) → feedback+
FWIW, this will also be useful for Emscripten in implementing the <xmmintrin.h> intrinsics _mm_loadl_pi, _mm_load_ss, and _mm_storel_pi. We may also use it for _mm_loadh_pi and _mm_storeh_pi, since we don't currently have plans for loadZW or storeZW functions.
SIMD tests for load/store are already ~250 lines, and they're about to get much more, with partial loads and stores.
Attachment #8541246 - Attachment is obsolete: true
Attachment #8554691 - Flags: review?(luke)
( https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1f053155ef0 )

This copies out the logic of OnDetachedHeap, modulo we may need to align the stack to call into C++ if we ever need to (an OOB load/store can happen at any $sp).

The onOutOfBoundsLabel wasn't put in the OutOfLine..., as this OOL struct is also used by MLoadTypedArrayElementStatic on x86, which doesn't have such a label, right now.  This implies some code is duplicated (around the cmp) but there's code duplication there anyways.  Any better idea will be appreciated :)
Attachment #8554696 - Flags: review?(luke)
As requested and discussed on IRC :)
Attachment #8554697 - Flags: review?(luke)
Preparation of handling partial load/store: specifying how many elements we're loading at once.
Attachment #8554700 - Flags: review?(luke)
WIP patch. This handles loads on x64 only.
SIMD.loadX just uses movss;
SIMD.loadXY just uses movsd;

For SIMD.loadXYZ, the current strategy is the following (Dan, any thoughts?):
- load Z as a single element (movss)
- load XY as a double element (movsd)
- put ZW over XY (the W lane is 0 because movss(memory,register)) using movlhps
If loadXYZ is out of bounds, this means either loading Z or loading XY will be an out-of-bounds access, and as we now throw on OOB, one of these two will throw.  Me like this.
Attachment #8554707 - Flags: feedback?(sunfish)
Attachment #8554707 - Flags: feedback?(luke)
Attachment #8554691 - Flags: review?(luke) → review+
Comment on attachment 8554696 [details] [diff] [review]
2-throw-on-oob.patch

Review of attachment 8554696 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.

::: js/src/asmjs/AsmJSSignalHandlers.cpp
@@ +894,5 @@
>          return false;
>  
>      // We now know that this is an out-of-bounds access made by an asm.js
> +    // load/store that we should handle.
> +    // SIMD out-of-bounds loads and stores just need to throw.

Can you put a \n separating the two comments (here and in the other two)?

::: js/src/asmjs/AsmJSValidate.cpp
@@ +1810,5 @@
> +    void setHasSimdHeapAccesses() {
> +        module_->setUsesOutOfBoundsExit();
> +    }
> +    bool hasSimdHeapAccesses() const {
> +        return module_->hasUsesOutOfBoundsExit();

Since the stub is tiny, I wouldn't bother with the bool in AsmJSModule or these functions and just unconditionally generate it.

@@ +8836,5 @@
> +    MacroAssembler &masm = m.masm();
> +    masm.bind(&m.onOutOfBoundsLabel());
> +
> +    // sp can be anything at this point, so ensure it is aligned when calling
> +    // into C++.

Perhaps also add: we unconditionally jump to throw so don't worry about restoring sp.

@@ +8844,5 @@
> +    masm.assertStackAlignment(ABIStackAlignment);
> +    masm.call(AsmJSImmPtr(AsmJSImm_OnOutOfBounds));
> +    masm.jump(throwLabel);
> +
> +    return m.finishGeneratingInlineStub(&m.onOutOfBoundsLabel()) && !masm.oom();

Can you add a test case to the end of jit-tests/asm.js/testProfiling.js (in the same style as the // Detachment exit test above) for taking the on-out-of-bounds exit?

::: js/src/jit-test/tests/asm.js/testSIMD-load-store.js
@@ +21,5 @@
>      assertFunc(real.z, expected[2]);
>      assertFunc(real.w, expected[3]);
>  }
>  
> +function assertThrowsRangeError(func) {

Can you use assertThrowsInstaceOf(..., RangeError) instead?  (load(libdir + "asserts.js"))

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +278,2 @@
>          CodeOffsetLabel cmp = masm.cmp32WithPatch(ToRegister(ptr), Imm32(0));
> +        if (Scalar::isSimdType(vt)) {

What about "if (mir->outOfBoundsLabel())"?  That way, if we, e.g., switch Atomics to throw, you won't need to update this code here.  Ditto for x86.

@@ +332,5 @@
>      if (mir->needsBoundsCheck()) {
>          CodeOffsetLabel cmp = masm.cmp32WithPatch(ToRegister(ptr), Imm32(0));
> +        if (Scalar::isSimdType(vt))
> +            // Out of bounds SIMD heap accesses throw a RangeError.
> +            masm.j(Assembler::AboveOrEqual, mir->outOfBoundsLabel());

I know it's a single statement, but it just looks a bit off having two lines w/o curlies.  Perhaps replace with "// Throws RangeError" that goes on the same line as the masm.j. Ditto for x86.
Attachment #8554696 - Flags: review?(luke) → review+
Comment on attachment 8554697 [details] [diff] [review]
3-rename-viewtype-to-accesstype.patch

Review of attachment 8554697 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8554697 - Flags: review?(luke) → review+
Comment on attachment 8554707 [details] [diff] [review]
5-implement-partial-load-stores.wip.patch

Review of attachment 8554707 [details] [diff] [review]:
-----------------------------------------------------------------

Cool. I like this approach too. Since the XY and Z parts each get their own AsmJSHeapAccess entry, it should be easy for me to make this play nice with the patches in bug 986981.

The other major way to do this is to just do a movups and then fix things up. This has the potential to be more efficient in some cases, but is probably not worth the extra complexity it'd require right now.

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +2358,5 @@
> +    {
> +        spew("movq       %p, %s", addr, nameFPReg(dst));
> +        m_formatter.prefix(PRE_SSE_66);
> +        m_formatter.twoByteOp64(OP2_MOVQ_VdEd, addr, (RegisterID) dst);
> +    }

We'll eventually want to AVX-ize these, but movd/movq are Weird(TM), and also they don't care about 3-operand encoding, so it's fine to do them later.

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +256,5 @@
>  
>  void
> +CodeGeneratorX64::emitSimdLoad(Scalar::Type type, MAsmJSLoadHeap *mir, const Operand &srcAddr,
> +                               const LDefinition *out, const LAllocation *ptr,
> +                               uint32_t maybeCmpOffset, bool *saveAccess)

Ok, semi-radical suggestion, feel free to decline if it's too crazy:

Change the mir parameter to emitSimdLoad to be the numElems, since that's all we need it for, move the appending of the AsmJSHeapAccess into emitSimdLoad, at the end, and then, for the 3-elem cases, just make  recursive calls to emitSimdLoad to do the XY and Z loads, and do 'return' instead of 'break' so that they don't go on to create their own extra AsmJSHeapAccess. This probably requires an offset argument too. But in exchange, it obviates the saveAccess argument :-). And it eliminates a little bit of code duplication.
Attachment #8554707 - Flags: feedback?(sunfish) → feedback+
Comment on attachment 8554700 [details] [diff] [review]
4-generalize-asmjs-load-store.patch

Review of attachment 8554700 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/asmjs/AsmJSValidate.cpp
@@ +2764,5 @@
>          curBlock_->setSlot(info().localSlot(local.slot), def);
>      }
>  
> +    MDefinition *loadHeap(Scalar::Type accessType, MDefinition *ptr, NeedsBoundsCheck chk,
> +                          unsigned numElems = 0)

Perhaps time to bust out a new (load|store)SimdHeap?

@@ +5629,5 @@
>              return f.fail(indexExpr, "constant index out of range");
>  
> +        static_assert(sizeof(int32_t) == sizeof(float), "all elements have the same size");
> +        size_t dataSize = sizeof(int32_t) * numElems;
> +        if (!f.m().tryRequireHeapLengthToBeAtLeast(indexLit + dataSize)) {

To be conservative, maybe we should keep the Simd128DataSize.  That way, if we chose to later, we could use the load-128-bits-then-zero scheme.

::: js/src/jit/MIR.h
@@ +12058,5 @@
>  {
>      Scalar::Type accessType_;
>      bool needsBoundsCheck_;
>      Label *outOfBoundsLabel_;
> +    unsigned numElems_;

How about numSimdElems (here and in param/memfun names)?
Attachment #8554700 - Flags: review?(luke) → review+
Comment on attachment 8554707 [details] [diff] [review]
5-implement-partial-load-stores.wip.patch

Review of attachment 8554707 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good.

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +256,5 @@
>  
>  void
> +CodeGeneratorX64::emitSimdLoad(Scalar::Type type, MAsmJSLoadHeap *mir, const Operand &srcAddr,
> +                               const LDefinition *out, const LAllocation *ptr,
> +                               uint32_t maybeCmpOffset, bool *saveAccess)

Ooh, like the sound of Dan's semi-radical suggestion.

@@ +334,5 @@
> +
> +            // Load Z
> +            before = masm.size();
> +            masm.vmovd(shiftedOffset, ScratchSimdReg);
> +            masm.append(AsmJSHeapAccess(before, masm.size(), Scalar::Int32x4, AnyRegister(ScratchSimdReg), maybeCmpOffset));

Since they now use the OnOutOfBounds handler, the SIMD AsmJSHeapAccesses no longer need the output register, so perhaps it's worth also having separate AsmJSHeapAccess ctor overloads; the Simd one would take numSimdElems and no out reg.
Attachment #8554707 - Flags: feedback?(luke) → feedback+
I've fixed all remarks. The semi radical change seems to work great, thanks Dan!

Just one question to Luke:

(In reply to Luke Wagner [:luke] from comment #10)
> @@ +8844,5 @@
> > +    masm.assertStackAlignment(ABIStackAlignment);
> > +    masm.call(AsmJSImmPtr(AsmJSImm_OnOutOfBounds));
> > +    masm.jump(throwLabel);
> > +
> > +    return m.finishGeneratingInlineStub(&m.onOutOfBoundsLabel()) && !masm.oom();
> 
> Can you add a test case to the end of jit-tests/asm.js/testProfiling.js (in
> the same style as the // Detachment exit test above) for taking the
> on-out-of-bounds exit?

Yes I can, however the code path won't be taken, as this test is only carried out on ARM, and ARM doesn't have SIMD so far.  So I can't even verify what the expected output should be, but reading the other tests, I think it should be:

    assertStackContainsSeq(stacks, ">,f,>,inline stub,f,>");

which corresponds to: ">" (entry), then "f,>" (we're in f()), then "inline stub,f,>" (we're in the OOB stub which is going to call into C++), then nothing, we're out of asm.js.  Is that correct? I think there should be one more "inline stub,f,>", for the throw stub, but as it is the same as the previous one, it won't be shown, as per the C++ impl. of SingleStepProfilingCallback.
Flags: needinfo?(luke)
Meh, the semi-radical change has only one drawback for loadXYZ: on x86 / x64-without-signals, the first AsmJSHeapAccess's bounds check patch (at static linking) will be overwritten by the second; besides, the second won't be correct (as it will check for 2 elements, while we should check for 3 elements).  The solution is to add a third AsmJSHeapAccess after the two recursive calls, which will be there only for patching the bounds check.  Then make the two emitSimdLoad recursive calls have no bounds checks, and add a way to say that an AsmJSHeapAccess doesn't need its offset to be patched (for x86).  But that's kinda gross...
Attached patch 5-wip-x64.patch (obsolete) — Splinter Review
Works on x64, with and without JS_NO_SIGNALS=1 (please contemplate the double negation).  Last step will be x86, and then i'll flag for review.  Feedback is not requested on this one, but would be appreciated.
Attachment #8554707 - Attachment is obsolete: true
Blocks: 986981
Sorry for the delay in responding.  It seems like, with throw-on-oob semantics, we'd only need a single branch when JS_NO_SIGNALS on x64.  It seems like the initial bounds check in visitAsmJSLoadHeap would suffice as long as you decrement the heap-length appropriately.  Incidentally, Dan has a nice strategy for doing this in his patch in bug 986981 (put the delta in the initial Imm32() written to the cmp so you can pull it back out when patching heap length); perhaps you could extract that part and land in this bug (since Dan said bug 986981 blocks on this).  For x86, you need an AsmJSHeapAccess for the second load (to patch base/limit), but at least that one could NoLengthCheck.
Flags: needinfo?(luke)
Alternatively, we can land bug 986981 first, if that makes things easier. There's no reason either bug here fundamentally blocks the other, but whichever goes second will need to make some adjustments.
Attached patch 5-folded.patch (obsolete) — Splinter Review
Submitting patch, for auto-review purposes first, and also to keep some saved version before some rebasing expected to be quite big.
Attachment #8557044 - Attachment is obsolete: true
Attached patch 5-folded.patch (obsolete) — Splinter Review
Let's start some reviews (I do not expect this one to be r+ directly, but I need some advices here).

So I was hitting a silent compiler issue: the newly added constructor of AsmJSHeapAccess (for SIMD loads/stores, which previously took 4 arguments + 1 optional) was used in place of the one with 3 arguments + 1 optional (for non-SIMD stores). Solved by making mandatory the cmp parameter of the SIMD variant.

When we're doing loadXYZ (and store as well), we may have up to 3 AsmJSHeapAccess-es:
  - one for the load of size 1, as the offset needs to be patched (no cmpOffset)
  - one for the load of size 2, for the same reason (no cmpOffset)
  - up to one for the cmp offset patch (on x64-without-signals and x86).
While the cmpOffset argument could be given to one of the other two loads, I haven't done it because the load of size 1 (resp. 2) would patch the cmpOffset for checking that we can load at least 1 (resp. 2) element-s, while we want to check for being able to load 3 elements.

Three solutions I see:
- directly inline the code of the load of size 2 (i.e. not recursively call emitSimdLoad(2,...)), so that we have control on the AsmJSHeapAccess we're creating.  This would reduce the usefulness of the recursive call.
- Add a parameter to emitSimdLoad, unsigned numElementsBoundsCheck (or something alike), that would be used in the AsmJSHeapAccess ctor.
- Just let the code as is.  Sounds acceptable if load/storeXYZ aren't very frequent.
What do you think?
Attachment #8557859 - Attachment is obsolete: true
Attachment #8557917 - Flags: review?(sunfish)
Attachment #8557917 - Flags: review?(luke)
Nice description of the problem.  Ideally, I think, we'd have one common join path that always created a AsmJSHeapAccess for patching offset/cmp, comparing the total loaded size, and only in the special-case of loadXYZ would we have a second load with a second AsmJSHeapAccess for patching offset w/ no cmp.  The recursive idea was nice when it sounded like it could be a clean cut, but it sounds now like it actually makes things more complicated, so even if the code is longer, perhaps it is better to just solve it non-recursively.  If you see obvious code duplication, perhaps you can just hoist it out with little helper functions.  Sorry for the churn; I appreciate your attention to trying to keep things simple.
Comment on attachment 8557917 [details] [diff] [review]
5-folded.patch

Review of attachment 8557917 [details] [diff] [review]:
-----------------------------------------------------------------

I think loadxyz/storexyz aren't so common as to make this a significant concern. I actually kind of like having three AsmJSHeapAccess entries, as each one has its unique purpose. However, I'm ok with Luke's opinion here too.
Attachment #8557917 - Flags: review?(sunfish) → feedback+
Attached patch 5-folded.patch (obsolete) — Splinter Review
Diff:
- make the emitSimd{Load,Store} functions not recursive, unless it really simplifies things (constant pointers on x86).
- simplify a few bounds checks (and comment deeply why we can do it).
Attachment #8557917 - Attachment is obsolete: true
Attachment #8557917 - Flags: review?(luke)
Attachment #8558598 - Flags: review?(sunfish)
Attachment #8558598 - Flags: review?(luke)
Comment on attachment 8558598 [details] [diff] [review]
5-folded.patch

Review of attachment 8558598 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/BaseAssembler-x86-shared.h
@@ +2633,5 @@
> +    {
> +        twoByteOpSimd_disp32("vmovd", VEX_PD, OP2_MOVD_VdEd, offset, base, invalid_xmm, dst);
> +    }
> +
> +    void vmovd_mr(const void* address, XMMRegisterID dst)

Style nit: put the space before the * rather than after.

@@ +4489,5 @@
> +            m_buffer.putByteUnchecked(opcode);
> +            memoryModRM(offset, base, index, scale, reg);
> +        }
> +
> +        void twoByteOp64(TwoByteOpcodeID opcode, const void* address, int reg)

Style nit ditto.

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +324,5 @@
> +        // Load XY
> +        before = after;
> +        loadSimd(type, 2, srcAddr, out);
> +        after = masm.size();
> +        masm.append(AsmJSHeapAccess(before, after, type, 3, maybeCmpOffset));

Using size 3 here allows this heap access' bounds check to cover the Z element as well? That's subtle enough for a comment :-).

Also, could you do the XY load before the Z load? That would look a little more intuitive, and it seems better to do the access with the bounds check before the other access, even if it doesn't actually matter.

@@ +469,5 @@
> +        // Store XY
> +        before = after;
> +        storeSimd(type, 2, in, dstAddr);
> +        after = masm.size();
> +        masm.append(AsmJSHeapAccess(before, after, type, 3, maybeCmpOffset));

As above, a comment confirming the purpose of the 3 here would be good. And putting the XY store before the Z store seems desirable, unless I'm missing a reason it can't work that way.

::: js/src/jit/x86/CodeGenerator-x86.cpp
@@ +374,5 @@
> +        switch (numElems) {
> +          // In memory-to-register mode, movd zeroes out the high lanes.
> +          case 1: masm.vmovdWithPatch(srcAddr, out); break;
> +          // See comment above, which also applies to movsd.
> +          // Cross-domain penalty here, as movq isn't encodable on x86.

memory-to-xmm movq is actually encodable on 32-bit x86. We can implement that in a subsequent patch, but please update the comment accordingly.

@@ +413,5 @@
> +            Value shiftedCst = Int32Value(ptr->toConstant()->toInt32() + 2 * sizeof(float));
> +            LAllocation shiftedPtr(&shiftedCst);
> +            emitSimdLoad(type, 1, &shiftedPtr, ScratchSimdReg);
> +            // Load XY
> +            emitSimdLoad(type, 2, ptr, out);

As above, can the XY go first?

@@ +442,5 @@
> +
> +        // For n in [-2 sizeof(float), 0[, ScratchReg has a different sign
> +        // than n, so we could load garbage values in the ZW lanes, but it
> +        // isn't observable, as loading XY will throw anyways.
> +        masm.add32(Imm32(2 * sizeof(float)), ptrReg);

Instead of an add instruction (and a sub below), can we just put this offset in the offset field of the Address below? That will also make it easier to do the XY first :-) (as above). Note that this is 32-bit x86, which doesn't have the same problem with folding constant offsets into addresses that x64 does.

@@ +636,5 @@
> +            Value shiftedCst = Int32Value(ptr->toConstant()->toInt32() + 2 * sizeof(float));
> +            LAllocation shiftedPtr(&shiftedCst);
> +            masm.vmovhlps(in, ScratchSimdReg, ScratchSimdReg);
> +            emitSimdStore(type, 1, ScratchSimdReg, &shiftedPtr);
> +            // Load XY

Copy+pasto: This is a Store

@@ +637,5 @@
> +            LAllocation shiftedPtr(&shiftedCst);
> +            masm.vmovhlps(in, ScratchSimdReg, ScratchSimdReg);
> +            emitSimdStore(type, 1, ScratchSimdReg, &shiftedPtr);
> +            // Load XY
> +            emitSimdStore(type, 2, in, ptr);

Is it possible to do the store of XY first, and the store to Z second? That's admittedly for cosmetic reasons, though if it isn't possible, a comment here explaining why it isn't would be good.

@@ +661,5 @@
> +    uint32_t before = masm.size();
> +    if (numElems == 3) {
> +        MOZ_ASSERT(type == Scalar::Int32x4 || type == Scalar::Float32x4);
> +
> +        masm.add32(Imm32(2 * sizeof(float)), ptrReg);

Can we move this offset into the Address (as above)?

@@ +669,5 @@
> +        // However, if needsBoundsCheck, the explicit bounds check will throw
> +        // before the first store; if !needsBoundsCheck, we proved that
> +        // (ptr + Simd128DataSize) was in bounds, so we're fine.
> +
> +        // Load Z (W is zeroed)

copy+pasto

@@ +677,5 @@
> +        storeSimd(type, 1, ScratchSimdReg, addr);
> +        uint32_t after = masm.size();
> +        masm.append(AsmJSHeapAccess(before, after, type, 1, AsmJSHeapAccess::NoLengthCheck));
> +
> +        // Load XY

another

@@ +682,5 @@
> +        masm.sub32(Imm32(2 * sizeof(float)), ptrReg);
> +        before = masm.size();
> +        storeSimd(type, 2, in, addr);
> +        after = masm.size();
> +        masm.append(AsmJSHeapAccess(before, after, type, 3, maybeCmpOffset));

As above, could the XY part go first, before the Z part?
Attachment #8558598 - Flags: review?(sunfish) → review+
Comment on attachment 8558598 [details] [diff] [review]
5-folded.patch

Review of attachment 8558598 [details] [diff] [review]:
-----------------------------------------------------------------

Great job, the new paths read nicely.

::: js/src/jit/shared/Assembler-shared.h
@@ +804,5 @@
> +    {
> +        MOZ_ASSERT(!Scalar::isSimdType(type));
> +    }
> +    // SIMD loads / stores
> +    AsmJSHeapAccess(uint32_t offset, uint32_t after, Scalar::Type type, unsigned numSimdElems, uint32_t cmp)

Could you instead avoid the overload ambiguity (allowing you to put back the default arg to cmp) by reordering numSimdElems and type?

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +383,5 @@
> +      case Scalar::Float32x4:
> +      case Scalar::Int32x4:
> +                              return emitSimdLoad(vt, mir->numSimdElems(), srcAddr,
> +                                                  ToFloatRegister(out), ptr, maybeCmpOffset,
> +                                                  mir->outOfBoundsLabel());

It would be a little nice to avoid having an early return for emitSimdLoad here b/c it means we skip all the stuff below and have to think whether any of it is needed for SIMD and duplicate if so.  (I think not calling memoryBarrier is fine, but I think we want the verifyHeapAccessDisassembly call.)  One way to do this is to make x64 symmetric with x86 and split off to a separate path early.  There'd be a bit of code duplication, but not too much.  Plus, before long I want to remove the JS_NO_SIGNALS support which would remove the needsBoundsCheck branch anyway.

::: js/src/jit/x86/CodeGenerator-x86.cpp
@@ +432,5 @@
> +    uint32_t maybeCmpOffset = AsmJSHeapAccess::NoLengthCheck;
> +    if (needsBoundsCheck) {
> +        CodeOffsetLabel cmp = masm.cmp32WithPatch(ptrReg, Imm32(0));
> +        masm.j(Assembler::AboveOrEqual, oobLabel); // Throws RangeError
> +        maybeCmpOffset = cmp.offset();

You could save a line with (here and in the store path):
  maybeCmpOffset = masm.cmp32WithPatch(ptrReg, Imm32(0)).offset();

@@ +442,5 @@
> +
> +        // For n in [-2 sizeof(float), 0[, ScratchReg has a different sign
> +        // than n, so we could load garbage values in the ZW lanes, but it
> +        // isn't observable, as loading XY will throw anyways.
> +        masm.add32(Imm32(2 * sizeof(float)), ptrReg);

Are you allowed to update ptrReg?  I thought that this would require a different LAlloc policy (which would in turn deoptimize regalloc b/c now the reg can be clobbered).

I'm probably missing something, but can you just put the 2*sizeof(float) offset into the Z Address?

Also, I don't understand the comment: you're only loading 1 elem that is known in-bounds (we already branched above) and I'm guessing zeroes the high lanes, so I don't see where the garbage comes from.

@@ +535,5 @@
>        case Scalar::Float32:      masm.vmovssWithPatch(ToFloatRegister(value), dstAddr); break;
>        case Scalar::Float64:      masm.vmovsdWithPatch(ToFloatRegister(value), dstAddr); break;
> +      case Scalar::Float32x4:
> +      case Scalar::Int32x4:
> +                                 MOZ_CRASH("SIMD stores should be handled in emitSimdStore");

Can you put the MOZ_CRASH on the same line as the case as you did with CodeGeneratorX86::load?
Attached patch 5-folded.patchSplinter Review
Thanks for the reviews! Fixed all remarks, except two:
1) the disassembler doesn't directly work and would need to be extended.  As this bug has lasted quite long, I'd like this to be done in a follow up (same for movq on x86).
2) on x64, we can't store XY before storing Z, as storing XY could work and Z could be OOB (for instance, if the array is Float32Array(16) and we're trying to store at 14: 14 and 15 (xy) are in bounds, z isn't); and in the signal handlers we wouldn't have any way to unstore XY.
Attachment #8558598 - Attachment is obsolete: true
Attachment #8558598 - Flags: review?(luke)
Attachment #8558721 - Flags: review?(luke)
Comment on attachment 8558721 [details] [diff] [review]
5-folded.patch

Review of attachment 8558721 [details] [diff] [review]:
-----------------------------------------------------------------

Great!

::: js/src/jit/x64/CodeGenerator-x64.cpp
@@ +330,5 @@
> +            masm.add32(ToRegister(ptr), ScratchReg);
> +            // For n in [-2 sizeof(float), 0[, ScratchReg has a different sign
> +            // than n, so we could load garbage values in the ZW lanes, but it
> +            // isn't observable, as loading XY will throw anyways.
> +            shiftedOffset = Operand(HeapReg, ScratchReg, TimesOne);

I think Operand provides you a static displacement which would avoid the mov/add32.  Also, I think you can drop the mysterious comment.  Ditto for storeSimd.

@@ +509,5 @@
> +        storeSimd(type, 2, in, dstAddr);
> +        after = masm.size();
> +        // We're noting a load of 3 elements, so that the bounds check checks
> +        // for 3 elements.
> +        masm.append(AsmJSHeapAccess(before, after, 3, type, maybeCmpOffset));

How about, for consistency with the other 3 cases, putting the bounds check on the load of Z instead?  I know it's just a delta, but it's a bit strange for the delta to "reach across" another store.

::: js/src/jit/x86/CodeGenerator-x86.cpp
@@ +454,5 @@
> +        Address shiftedAddr(ptrReg, 2 * sizeof(float));
> +        before = after;
> +        loadSimd(type, 1, shiftedAddr, ScratchSimdReg);
> +        after = masm.size();
> +        masm.append(AsmJSHeapAccess(before, after, 1, type, AsmJSHeapAccess::NoLengthCheck));

I think you can drop the explicit NoLengthCheck here, below and in x64 now.
Attachment #8558721 - Flags: review?(luke) → review+
Thanks for the reviews, all your suggestions have been fixed.

I am running this to try first, to be sure it will behave well.

Last try run showed an issue on win64, in which the segfault handler checks that heapAccess->isLoad() == RecordInformation[0].  However, for SIMD load/stores, we don't know if the heap access is a load or a store, so I just moved this check below the Scalar::isSimdType check and body in HandleFault.  If that's not good, I can change it in another way.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e787c654b77e
(In reply to Benjamin Bouvier [:bbouvier] from comment #29)
> However, for SIMD
> load/stores, we don't know if the heap access is a load or a store, so I
> just moved this check below the Scalar::isSimdType check and body in
> HandleFault.  If that's not good, I can change it in another way.

Makes sense.  How about adding an assert in isLoad() that !isSimd?
Attached patch addendum.patchSplinter Review
fwiw, addendum to the last patch, as i've hit an assertion on win32 x86, where the stack alignment was messing up with the fake LAllocation created in emitSimdLoad/emitSimdStore in the ptr->constant() case of load/storeXYZ.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: