Wasm baseline: Generate less branchy code for cmp64Set() on ARM, x86_32

ASSIGNED
Assigned to

Status

()

P5
normal
ASSIGNED
2 years ago
7 months ago

People

(Reporter: lth, Assigned: dannas, Mentored)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 6 obsolete attachments)

(Reporter)

Description

2 years ago
The current code is pretty branchy, and it is not yet obvious that we'll reduce the incidence of cmp64Set by boolean optimization for control (bug 1286816) since it'll lead to fairly hairy code in the compiler on 32-bit systems.  So it would be good to generate better code for cmp64Set.

Probably this comes down to breaking x86 and ARM apart and having custom code for both.  ARM is more important; it should be easy to generate good code on ARM.
(Reporter)

Updated

2 years ago
See Also: → bug 1316848
(Assignee)

Comment 1

2 years ago
I'd like to take a stab at this one. Cmp64Set is defined here http://searchfox.org/mozilla-central/source/js/src/wasm/WasmBaselineCompile.cpp#2987

Looks like cmp64Set() is used for reducing repeated code when emitting compare operations. http://searchfox.org/mozilla-central/source/js/src/wasm/WasmBaselineCompile.cpp#6247

First step: Figure out how to do the equivalent of the x64 case for arm.
(Assignee)

Comment 2

2 years ago
WIP. Untested. 

The cmp ARM instruction sets the APSR register. According to the docs it's supposed to be equivalent to a subs (but without writing back the result). I wonder if there's any instruction that can be used for comparing 64 bit values on a 32 bit arm platform. That would simplify things a bit. So far I haven't found any.
Assignee: nobody → dannas
Status: NEW → ASSIGNED
(Assignee)

Comment 3

2 years ago
Attachment #8823828 - Attachment is obsolete: true
(Reporter)

Updated

2 years ago
Priority: P3 → P5
(Reporter)

Comment 4

a year ago
MIPS is cleaning up its cmp64Set code (bug 1437039), very similar to the ARM code here.  Once the MIPS patch has landed we should just take this opportunity to define cmp64Set cross-platform and add good platform implementations.  I expect the arm32 code will look similar to the patch already here (which looks very similar to the MIPS patch).  For arm64 and x64 it is just the same as cmpPtrSet, or whatever that's called.  So we're left with doing the best we can for x86.  For the baseline compiler we currently only need a register-register comparison; most of the hair is in a register-immediate comparison.
(Reporter)

Updated

a year ago
Mentor: hv1989 → lhansen
(Assignee)

Comment 5

a year ago
I've added the testcases from Bug 1437039 to my patch.

I've tried to "shoe-horn" in instructions with condition codes. But it fails for <= and => comparisons. 

I have one big case statement that tries to handle eight different comparison operators in one go. I might be better off, splitting that into eight different cases instead.
Attachment #8826574 - Attachment is obsolete: true
(Assignee)

Comment 6

a year ago
This patch passes wasm/integer.js.

I'll do a try push tomorrow - I'd had been inactive too long so my mercurial access to the try server has been disabled. I've filed a request to restore it https://bugzilla.mozilla.org/show_bug.cgi?id=1438667

I'll run the embenchen benchmark tomorrow as well. I'm curious about where's the tipping point between trading control flow dependence (with branches) for data dependence (with conditional execution instructions) on arm. Too bad I don't have an arm board/computer that I can use for experimenting with performance counters.

Once this part is done, I'll continue with the x86 part.
Attachment #8951071 - Attachment is obsolete: true
(Assignee)

Comment 7

a year ago
I cross-compiled for arm32 with the command 

    CROSS_COMPILE=1 ../configure --enable-optimize \
        --disable-debug \
        --without-intl-api \
        --disable-profiling \
        --target=arm-linux-gnueabihf

I copied the js binary to a rpi3 board and ran the attached benchmark js-file. Timing variation were <1%. The speedup was 1.5x to 1.7x compared to m-c master.

:jolesen mentioned on IRC that rPi3 has an Cortex-A53 core that has no out of order execution in contrast to Cortex-A53 and Cortex-A72. I've learned recently that predicated instructions causes a data dependency that may be a disadvantage to OoO processors. I wonder where the trade-off lies between the cost of missing unpredictable branches and the cost of data dependencies for predicated instructions. Maybe this is not a valid concern here. As lth puts it: "anything you do should be better than the rat's nest of jumps that' currently emitted".

For my reference, two links about predicated instructions: http://yarchive.net/comp/linux/cmov.html and https://twitter.com/FioraAeterna/status/567724845866029058

I'll try running embenchen tomorrow.
(Assignee)

Comment 10

a year ago
Embenchen on rPi3 shows no changes in runtime. I haven't investigated to what extent the benchmarks use 64-bit compare operators.
(Assignee)

Comment 11

a year ago
See previously attached microbenchmark and embenchen results for perf numbers.

I'll do the x86 cmp64Set patch in a follow-up.
Attachment #8951435 - Attachment is obsolete: true
Attachment #8952021 - Flags: review?(lhansen)
(Reporter)

Comment 12

a year ago
Comment on attachment 8952021 [details] [diff] [review]
bug1316822_v4_add_cmp64Set_for_arm32.patch

Review of attachment 8952021 [details] [diff] [review]:
-----------------------------------------------------------------

Generally looks good (and thank you for resurrecting this work!) but let's do another iteration to see if you can use the shorter and simpler code sequences I suggest below.  The trick is to realize that CMP can be executed conditionally.

::: js/src/jit/MacroAssembler.h
@@ +1017,5 @@
>      inline void cmp32Set(Condition cond, T1 lhs, T2 rhs, Register dest)
>          DEFINED_ON(x86_shared, arm, arm64, mips32, mips64);
>  
>      template <typename T1, typename T2>
> +    inline void cmp64Set(Condition cond, T1 lhs, T2 rhs, Register dest)

Let's remove the template here and make both T1 and T2 Register64.  We don't really want templates in MacroAssembler.h (the existing templates notwithstanding) because it makes it difficult to reason about which signatures that must actually be implemented; a case in point here is that  your implementation for ARM only handles the Register64/Register64 case, it does not handle T1 or T2 being Immediate, Address, or BaseIndex cases, yet those cases are implied by having the template in the first case.

::: js/src/jit/arm/MacroAssembler-arm-inl.h
@@ +1093,5 @@
> +        ma_cmp(lhs.low, rhs.low);
> +        ma_mov(Imm32(1), dest);
> +        ma_mov(Imm32(0), dest, Assembler::InvertCondition(cond));
> +        ma_cmp(lhs.high, rhs.high);
> +        ma_mov(Imm32(0), dest, Assembler::InvertCondition(cond));

You don't need to use InvertCondition here because you know the condition is Equal.  Also, can't you do this in four instructions?

cmp     lhs.low, rhs.low
cmp,eq  lhs.high, rhs.high
mov,eq  dest, 1
mov,ne  dest, 0

Same for NotEqual.  And I think this could be used to simplify the other cases too, you should only have to use InvertCondition for the not-equal case, not UnsignedCondition or ConditionWithoutEqual.

With the code you suggest you also ought to assert that dest is different from either of the "high" registers, or you'd have ill-defined code.  With the shorter code I suggest that constraint can be removed.
Attachment #8952021 - Flags: review?(lhansen)
(Reporter)

Comment 13

a year ago
I wrote:

> cmp     lhs.low, rhs.low
> cmp,eq  lhs.high, rhs.high
> mov,eq  dest, 1
> mov,ne  dest, 0

which is equivalent to 

> cmp     lhs.low, rhs.low
> cmp,eq  lhs.high, rhs.high
> mov,ne  dest, 0
> mov,eq  dest, 1

which is equivalent to

> cmp     lhs.low, rhs.low
> cmp,eq  lhs.high, rhs.high
> mov     dest, 0
> mov,eq  dest, 1

where the last two instructions are exactly those emitted by emitSet(), in case this simplifies anything (MacroAssembler-arm.h ca line 1210).
(Assignee)

Comment 14

a year ago
Wow. Your suggestion *really* simplified the patch. I didn't  realize that cmp's could be predicated as well. I feel silly for not coming up with that myself but I'm glad you showed me. Predicated comparisons won't help on x86 though, since I only have setCC and cmov at my disposal. I'll try to come up with something cleaner than the previous patch though.

The UnsignedCondition() call in the previous patch was due to a misconception I had about the in-memory  representation of integers on two-complement form. Using signed comparison of lhs.low and rhs.low will preserve the order of lhs and rhs. When I realized that I could use the original condition for comparing the lower part as well, the case reuced to almost the same code as the eq/ne one.

I've replaced the template parameters of cmp64Set() with Register64 types.

I haven't done a try push for these changes but  they pass the wasm part of the  jit-testsuite locally (are there any more tests that touches wasm than those?)

(bug1316822)dannas@dannas--dell-xp-15:arm-32_build_OPT.OBJ        
$ ../jit-test/jit_test.py dist/bin/js wasm
[694|  0|  0|  0] 100% ==============================================>|  33.4s
PASSED ALL      

Thank you for a swift reply.
Attachment #8952021 - Attachment is obsolete: true
Attachment #8952224 - Flags: review?(lhansen)
(Assignee)

Comment 15

a year ago
Doh. I missed that the eq/ne case can be merged with the other case. I'll fix that tonight.
(Reporter)

Comment 16

a year ago
Comment on attachment 8952224 [details] [diff] [review]
bug1316822_v5_add_cmp64Set_for_arm32.patch

Review of attachment 8952224 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/MacroAssembler-arm-inl.h
@@ +1102,5 @@
> +      case Assembler::GreaterThanOrEqual:
> +      case Assembler::Above:
> +      case Assembler::AboveOrEqual:
> +        ma_cmp(lhs.high, rhs.high);
> +        ma_cmp(lhs.low, rhs.low, Assembler::Equal);

This is not quite right.  Assume we're operating on four-bit signed numbers in two-bit groups and we have a=5, b=7 and we're testing a < b.  That is, 01 01 `LessThan` 01 11, the answer should be true.  The upper bits are equal so we compare the lower bits.  These are interpreted as signed by LessThan, so we find that 01 is *not* less than 11 since the former is 1 and the latter is -1.  So the comparison ends up being false which is wrong.  This is why the x86 code is careful to convert the condition to Unsigned when comparing the lower bits by themselves.

It's possible that you want to break this switch into unsigned comparisons and signed comparisons since the unsigned comparisons will basically be equivalent to the equality tests and can maybe be merged with them, and that signed comparisons will be a little hairier and want to stand alone, but I don't know for sure.
Attachment #8952224 - Flags: review?(lhansen)
(Assignee)

Comment 17

a year ago
WiP. I'm now using cmp+sbcs for doing the signed comparisons. It looked cleaner to me than the three cmp instructions I had before (back in v3 or so). For sbcs, I neeeded to add simulator support.

Remains to be done
* Figure out why the borrowFrom() function appears to work when I test for carry ==  true. In my mind it should be the other way around. The sub instruction  sets carry=false when a borrow is needed. So sbcs should set borrow to True when rhs > lhs-carry.
* Add more test cases to wasm/integer.js
Attachment #8952224 - Attachment is obsolete: true
(Reporter)

Updated

7 months ago
Component: JavaScript Engine: JIT → Javascript: Web Assembly
You need to log in before you can comment on or make changes to this bug.