Closed Bug 1313336 Opened 3 years ago Closed 2 years ago

Implement stubbed out 64-bit and 32-bit operations on ARM64

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

ARM64
Unspecified
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: lth, Assigned: lth)

References

Details

Attachments

(7 files, 9 obsolete files)

8.90 KB, patch
lth
: review+
Details | Diff | Splinter Review
2.43 KB, patch
lth
: review+
Details | Diff | Splinter Review
11.46 KB, patch
lth
: review+
Details | Diff | Splinter Review
2.87 KB, patch
lth
: review+
Details | Diff | Splinter Review
9.12 KB, patch
lth
: review+
Details | Diff | Splinter Review
13.00 KB, patch
lth
: review+
Details | Diff | Splinter Review
10.97 KB, patch
lth
: review+
Details | Diff | Splinter Review
See bug 1313043.  The second patch on that bug is a bunch of stubs for the ARM64 back-end where it did not have common MacroAssembler functionality.  Almost all of these are 64-bit functions, in addition there is popcnt32.
Probably dependent on Cretonne.
Priority: P3 → P5
Depends on: 1431402
I have implemented most of these (for wasm, anyway), patch coming once I'm passing all tests.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: P5 → P3
Hardware: Other → ARM64
Duplicate of this bug: 1216486
Implement stubbed-out masm integer instructions.
Implement stubbed-out masm floating-point instructions.
Implement stubbed-out masm truncate-float-to-int instructions and their ool emitters.
Implement stubbed-out masm register-to-register moves.
Implement wasmLoad and wasmStore masm instructions.
Implement stubbed-out masm atomic instructions.

We can probably optimize these a little more by using the new LDRA/STRA + one barrier, but I haven't investigated that yet, it's not urgent to do so.
Attached patch bug1313336-arm64-masm-misc.patch (obsolete) — Splinter Review
Implement miscellaneous stubbed-out masm instructions, notably a lot of patchable control flow.  Annotate remaining stubs that are not used by wasm or at all.
Comment on attachment 8949661 [details] [diff] [review]
bug1313336-arm64-masm-integer.patch

Review of attachment 8949661 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm64/MacroAssembler-arm64-inl.h
@@ +547,5 @@
> +        Udiv(scratch, ARMRegister(srcDest, 32), ARMRegister(rhs, 32));
> +    else
> +        Sdiv(scratch, ARMRegister(srcDest, 32), ARMRegister(rhs, 32));
> +    Mul(scratch, scratch, ARMRegister(rhs, 32));
> +    Sub(ARMRegister(srcDest, 32), ARMRegister(srcDest, 32), scratch);

It would be good to add a comment to this function noting that it doesn't have to deal with x % 0 and INT_MIN % -1

@@ +783,5 @@
>  {
> +    vixl::UseScratchRegisterScope temps(this);
> +    const ARMRegister scratch = temps.AcquireW();
> +    Mov(scratch, 32);
> +    Sub(scratch, scratch, ARMRegister(count, 32));

You can compute `-count` in a single instruction, and ror only looks at the low bits.

@@ +808,5 @@
> +    vixl::UseScratchRegisterScope temps(this);
> +    const ARMRegister scratch = temps.AcquireX();
> +    Mov(scratch, 64);
> +    Sub(scratch, scratch, ARMRegister(count, 64));
> +    Ror(ARMRegister(dest.reg, 64), ARMRegister(input.reg, 64), scratch);

Same: ror(-count)

@@ +889,5 @@
> +    Add(dest, dest, Operand(dest, vixl::LSR, 4));
> +    And(dest, dest, 0x0F0F0F0F);
> +    Add(dest, dest, Operand(dest, vixl::LSL, 8));
> +    Add(dest, dest, Operand(dest, vixl::LSL, 16));
> +    Lsr(dest, dest, 24);

Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD register. You can follow with ADDV to implement a full popcount.

This requires a temporary SIMD register, of course.
Attachment #8949661 - Flags: review+
Comment on attachment 8949664 [details] [diff] [review]
bug1313336-arm64-masm-truncate.patch

Review of attachment 8949664 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +897,5 @@
> +    Fcvtzu(output, input);
> +    Cmp(output, 0);
> +    B(oolEntry, Assembler::Equal);
> +    Cmp(output, -1);
> +    B(oolEntry, Assembler::Equal);

You can use CMP+CCMP to chain the two compares and save a branch here.

@@ +923,5 @@
> +    B(oolEntry, Assembler::Equal);
> +    Cmp(output, INT32_MAX);
> +    B(oolEntry, Assembler::Equal);
> +    Cmp(output, INT32_MIN);
> +    B(oolEntry, Assembler::Equal);

Same: CMP+CCMP+CCMP.

Also, I think INT32_MAX and INT32_MIN both have to be materialized in a scratch register, so it might be better to check unsigned(x-INT32_MAX) < 2.

@@ +1067,5 @@
> +    bind(&notNaN);
> +
> +    Label isOverflow;
> +    if (isUnsigned) {
> +        loadConstantFloat32(float(double(UINT64_MAX) + 1), ScratchFloat32Reg);

This expression looks a little sketchy: double(UINT64_MAX) rounds to 0x1.0p64, and then adding 1 does nothing. So you end up with 0x1.0p64 as you wanted, but it's not really clear that was intended.

Are we allowed to use hex float literals yet?

Alternatively:

const float c = -float(INT64_MIN);
use +c, -c, 2*c for your constants. No rounding anywhere.

@@ +1102,5 @@
> +    } else {
> +        loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> +        branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> +        // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> +        // the double representation (53-bit mantissa).

This comment applies to the positive case too.
Attachment #8949664 - Flags: review?(jolesen)
Comment on attachment 8949666 [details] [diff] [review]
bug1313336-arm64-masm-wasm-load-store.patch

Review of attachment 8949666 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +248,5 @@
> +    if (sixtyfour == Register64::Invalid())
> +        return ARMRegister(any.gpr(), 32);
> +    else
> +        return ARMRegister(sixtyfour.reg, size);
> +}

I don't find this code entirely self-documenting ;-)

@@ +281,5 @@
> +      case Scalar::Int8:
> +      {
> +        ARMRegister dest(SelectGPReg(outany, out64));
> +        Ldrb(dest, srcAddr);
> +        Sxtb(dest, dest);

See LDRSB

@@ +306,5 @@
> +      {
> +        ARMRegister dest32(SelectGPReg(outany, out64, 32));
> +        Ldr(dest32, srcAddr);
> +        if (out64 != Register64::Invalid())
> +            Sxtw(ARMRegister(out64.reg, 64), dest32);

And there's a LDRSW for a sign-extending 32->64 load.

@@ +375,5 @@
> +      case Scalar::Uint32:
> +        Str(SelectGPReg(valany, val64), dstAddr);
> +        break;
> +      case Scalar::Int64:
> +        Str(SelectGPReg(valany, val64), dstAddr);

This is identical to the 32-bit code. Is that really right?
Attachment #8949666 - Flags: review?(jolesen)
Comment on attachment 8949667 [details] [diff] [review]
bug1313336-arm64-masm-atomic.patch

Review of attachment 8949667 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm64/MacroAssembler-arm64-inl.h
@@ +1874,5 @@
> +        Dmb(vixl::InnerShareable, vixl::BarrierWrites);
> +    else if (barrier == MembarLoadLoad)
> +        Dmb(vixl::InnerShareable, vixl::BarrierReads);
> +    else if (barrier)
> +        Dmb(vixl::InnerShareable, vixl::BarrierAll);

I'll have to trust you understand the details here ;-)

::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +1498,5 @@
> +    masm.Cmp(R(output, targetWidth), R(scratch, targetWidth));
> +    masm.B(&done, MacroAssembler::NotEqual);
> +    StoreExclusive(masm, type, scratch, newval, ptr);
> +    masm.Cmp(W(scratch), 1);
> +    masm.B(&again, MacroAssembler::Equal);

CBNZ should be ok here since STXR returns 0/1 only.
Attachment #8949667 - Flags: review+
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #11)
> Comment on attachment 8949661 [details] [diff] [review]
> bug1313336-arm64-masm-integer.patch
> 
> Review of attachment 8949661 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +783,5 @@
> >  {
> > +    vixl::UseScratchRegisterScope temps(this);
> > +    const ARMRegister scratch = temps.AcquireW();
> > +    Mov(scratch, 32);
> > +    Sub(scratch, scratch, ARMRegister(count, 32));
> 
> You can compute `-count` in a single instruction, and ror only looks at the
> low bits.

Oh, nice.

> @@ +889,5 @@
> > +    Add(dest, dest, Operand(dest, vixl::LSR, 4));
> > +    And(dest, dest, 0x0F0F0F0F);
> > +    Add(dest, dest, Operand(dest, vixl::LSL, 8));
> > +    Add(dest, dest, Operand(dest, vixl::LSL, 16));
> > +    Lsr(dest, dest, 24);
> 
> Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD
> register. You can follow with ADDV to implement a full popcount.
> 
> This requires a temporary SIMD register, of course.

Yes, I am aware of CNT.  There seemed to be some dispute on the net (but when is there not?) about whether it was actually a win or not to use it so I decided to just keep it simple and port working code.  I'll note the issue on my worksheet and later I'll file a bug to investigate more fully.
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #12)
> Comment on attachment 8949664 [details] [diff] [review]
> bug1313336-arm64-masm-truncate.patch
> 
> Review of attachment 8949664 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/arm64/MacroAssembler-arm64.cpp
> @@ +897,5 @@
> > +    Fcvtzu(output, input);
> > +    Cmp(output, 0);
> > +    B(oolEntry, Assembler::Equal);
> > +    Cmp(output, -1);
> > +    B(oolEntry, Assembler::Equal);
> 
> You can use CMP+CCMP to chain the two compares and save a branch here.

Excellent.

> @@ +923,5 @@
> > +    B(oolEntry, Assembler::Equal);
> > +    Cmp(output, INT32_MAX);
> > +    B(oolEntry, Assembler::Equal);
> > +    Cmp(output, INT32_MIN);
> > +    B(oolEntry, Assembler::Equal);
> 
> Same: CMP+CCMP+CCMP.
> 
> Also, I think INT32_MAX and INT32_MIN both have to be materialized in a
> scratch register, so it might be better to check unsigned(x-INT32_MAX) < 2.

We can materialize INT32_MIN in one MOVN and INT32_MAX in two MOVZ for a total of eight instructions for the CMP+CCMP+CCMP sequence.  I think what you're suggesting is something like this:

    Fcvtzs(output, input);
    Sub(scratch, output, INT32_MAX);
    Cmp(scratch, 1);
    Ccmp(output, 0, vixl::ZFlag, Assembler::Above);
    B(oolEntry, Assembler::Equal);

which should be only seven.  (The scratch register does not impact regalloc so apart from microarchitectural effects it is free.)

> @@ +1067,5 @@
> > +    bind(&notNaN);
> > +
> > +    Label isOverflow;
> > +    if (isUnsigned) {
> > +        loadConstantFloat32(float(double(UINT64_MAX) + 1), ScratchFloat32Reg);
> 
> This expression looks a little sketchy: double(UINT64_MAX) rounds to
> 0x1.0p64, and then adding 1 does nothing. So you end up with 0x1.0p64 as you
> wanted, but it's not really clear that was intended.

Nor was it, I botched that.

> Are we allowed to use hex float literals yet?
> 
> Alternatively:
> 
> const float c = -float(INT64_MIN);
> use +c, -c, 2*c for your constants. No rounding anywhere.

Righto.

> 
> @@ +1102,5 @@
> > +    } else {
> > +        loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> > +        branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> > +        // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> > +        // the double representation (53-bit mantissa).
> 
> This comment applies to the positive case too.

Well, in the positive case we were supposed to have INT64_MAX+1 which /is/ exactly representable, but I'm going to go over this code again.
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #13)
> Comment on attachment 8949666 [details] [diff] [review]
> bug1313336-arm64-masm-wasm-load-store.patch
> 
> Review of attachment 8949666 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/arm64/MacroAssembler-arm64.cpp
> @@ +248,5 @@
> > +    if (sixtyfour == Register64::Invalid())
> > +        return ARMRegister(any.gpr(), 32);
> > +    else
> > +        return ARMRegister(sixtyfour.reg, size);
> > +}
> 
> I don't find this code entirely self-documenting ;-)

Partly because it predates the cleanup of AnyRegister that I made as a result of writing this code... I'll add some comments and an assertion that is now possible.

> @@ +281,5 @@
> > +      case Scalar::Int8:
> > +      {
> > +        ARMRegister dest(SelectGPReg(outany, out64));
> > +        Ldrb(dest, srcAddr);
> > +        Sxtb(dest, dest);
> 
> See LDRSB

Ah, bad oversight on my part to miss these.

> @@ +375,5 @@
> > +      case Scalar::Uint32:
> > +        Str(SelectGPReg(valany, val64), dstAddr);
> > +        break;
> > +      case Scalar::Int64:
> > +        Str(SelectGPReg(valany, val64), dstAddr);
> 
> This is identical to the 32-bit code. Is that really right?

They are not identical, because SelectGPReg() will return a 32-bit register in the first case and a 64-bit register in the second case, given how this function is used.
(In reply to Lars T Hansen [:lth] from comment #16)
> We can materialize INT32_MIN in one MOVN and INT32_MAX in two MOVZ for a
> total of eight instructions for the CMP+CCMP+CCMP sequence.  I think what
> you're suggesting is something like this:
> 
>     Fcvtzs(output, input);
>     Sub(scratch, output, INT32_MAX);
>     Cmp(scratch, 1);
>     Ccmp(output, 0, vixl::ZFlag, Assembler::Above);
>     B(oolEntry, Assembler::Equal);
> 
> which should be only seven.  (The scratch register does not impact regalloc
> so apart from microarchitectural effects it is free.)

That is what I had in mind. But then I looked at Sub() which calls MacroAssembler::AddSubMacro():

    UseScratchRegisterScope temps(this);
    Register temp = temps.AcquireSameSizeAs(rn);

    // VIXL can acquire temp registers. Assert that the caller is aware.
    VIXL_ASSERT(!temp.Is(rd) && !temp.Is(rn));

So it might be risky to allocate scratch registers at two different levels of the call graph? I think Sean spent some time cleaning this stuff up in the ARM32 backend.

I'm not sure how to best handle this. Maybe it's better to use your original sequence (plus chained compares) since it doesn't allocate scratch registers?

> > @@ +1102,5 @@
> > > +    } else {
> > > +        loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> > > +        branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> > > +        // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> > > +        // the double representation (53-bit mantissa).
> > 
> > This comment applies to the positive case too.
> 
> Well, in the positive case we were supposed to have INT64_MAX+1 which /is/
> exactly representable, but I'm going to go over this code again.

Oh, right, I missed that.
(In reply to Lars T Hansen [:lth] from comment #15)
> > Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD
> > register. You can follow with ADDV to implement a full popcount.
> > 
> > This requires a temporary SIMD register, of course.
> 
> Yes, I am aware of CNT.  There seemed to be some dispute on the net (but
> when is there not?) about whether it was actually a win or not to use it so
> I decided to just keep it simple and port working code.  I'll note the issue
> on my worksheet and later I'll file a bug to investigate more fully.

Makes sense. There is typically a latency comparable to a load when copying between the register banks, and ARM's designs tend to have pretty long latencies on SIMD instructions too.

Cortex-A72 is 3 cycles for the CNT and 6 for the ADDV.8b, and the two cross-bank VMOVs are 5 cycles each. That adds up to 19 cycles of latency. I think it comes down to a throughput vs latency tradeoff which is probably too hard to reason about for the baseline compiler.
This is the truncation patch, updated as requested (comment 6, comment 12, comment 18).  I have decided to forego the clever subtraction optimization here, and just use chained compares.
Attachment #8949664 - Attachment is obsolete: true
Attachment #8949664 - Flags: review?(jolesen)
Attachment #8950332 - Flags: review?(jolesen)
This is the load/store patch updated, comment 8, comment 13, comment 17.  LMK if the register selection function is still inscrutable and I'll beef up the commentary further.
Attachment #8949666 - Attachment is obsolete: true
Attachment #8949666 - Flags: review?(jolesen)
Attachment #8950335 - Flags: review?(jolesen)
Attachment #8950332 - Flags: review?(jolesen) → review+
Comment on attachment 8950335 [details] [diff] [review]
bug1313336-arm64-masm-wasm-load-store-v2.patch

Review of attachment 8950335 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +242,5 @@
>      Brk((code++) & 0xffff);
>  }
>  
> +// Either `any` is valid or `sixtyfour` is valid.  Return a 32-bit ARMRegister
> +// in the first case and an ARMRegister of the desired size in the latter case.

That works. Thanks!
Attachment #8950335 - Flags: review?(jolesen) → review+
Blocks: 1430743
Rebased, etc, ready to land.  Carrying r+.
Attachment #8949661 - Attachment is obsolete: true
Attachment #8954014 - Flags: review+
Rebased, etc, ready to land.  Carrying r+.
Attachment #8949663 - Attachment is obsolete: true
Attachment #8954016 - Flags: review+
Rebased, etc, ready to land.  Carrying r+.
Attachment #8950332 - Attachment is obsolete: true
Attachment #8954017 - Flags: review+
Rebased, etc, ready to land.  Carrying r+.
Attachment #8949665 - Attachment is obsolete: true
Attachment #8954018 - Flags: review+
Rebased, etc, ready to land.  Carrying r+.
Attachment #8950335 - Attachment is obsolete: true
Attachment #8954020 - Flags: review+
Rebased, etc, ready to land.  Carrying r+.
Attachment #8949667 - Attachment is obsolete: true
Attachment #8954021 - Flags: review+
Rebased etc, ready to land.  Carrying r+.
Attachment #8949668 - Attachment is obsolete: true
Attachment #8954022 - Flags: review+
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a37627e154f3
ARM64 integer masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/48452a36fc14
ARM64 floating point masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ba21b2798ca
ARM64 truncate-floating-point-to-int masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7eddae76df
ARM64 register-register move masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/065a605a7909
ARM64 wasmLoad and wasmStore masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/5b036f401b18
ARM64 atomic masm operations. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/885e8ce07968
ARM64 miscellaneous masm instructions. r=jolesen
You need to log in before you can comment on or make changes to this bug.