Wasm baseline: Optimize boolean evaluation for control

RESOLVED FIXED in Firefox 53

Status

()

defect
P1
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: lth, Assigned: lth)

Tracking

(Blocks 1 bug)

unspecified
mozilla53
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox53 fixed)

Details

Attachments

(6 attachments, 9 obsolete attachments)

4.42 KB, patch
h4writer
: review+
Details | Diff | Splinter Review
14.72 KB, patch
h4writer
: feedback+
Details | Diff | Splinter Review
5.73 KB, patch
h4writer
: review+
Details | Diff | Splinter Review
2.44 KB, patch
h4writer
: review+
Details | Diff | Splinter Review
16.61 KB, patch
h4writer
: review+
Details | Diff | Splinter Review
24.82 KB, patch
h4writer
: review+
Details | Diff | Splinter Review
(Assignee)

Description

3 years ago
Two entangled problems:

(1) Currently the baseline compiler always evaluates the boolean expression into a register, then tests that register.  This is silly (except on MIPS, maybe).  It does not actually take a lot to cover most interesting cases of boolean evaluation for control.

(2) Currently the baseline compiler will usually evaluate a conditional branch to a label as a conditional branch across code that pops the execution stack + jumps.  This is silly if there is no cleanup code.
(Assignee)

Comment 1

3 years ago
Refactor and clean up comparison handling to make later changes simpler.
(Assignee)

Comment 2

3 years ago
Posted patch bug1286816-token-queue.patch (obsolete) — Splinter Review
Add a simple token queue so that we can do one-token lookahead.  This should be rolled into the WasmBinaryIterator, where it would be more efficient.
(Assignee)

Comment 3

3 years ago
WIP boolean evaluation for control: if, at a compare node, we see a conditional branch as the next token then defer the evaluation so that when the branch is handled it can absorb the compare too.
Priority: -- → P2
(Assignee)

Updated

3 years ago
Priority: P2 → P3
(Assignee)

Updated

3 years ago
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
(Assignee)

Updated

3 years ago
Depends on: 1316614
(Assignee)

Comment 4

3 years ago
Comment on attachment 8770929 [details] [diff] [review]
bug1286816-token-queue.patch

Moved to bug 1316614.
Attachment #8770929 - Attachment is obsolete: true
(Assignee)

Comment 5

3 years ago
Attachment #8770927 - Attachment is obsolete: true
(Assignee)

Comment 6

3 years ago
Attachment #8770930 - Attachment is obsolete: true
(Assignee)

Updated

3 years ago
Blocks: wasm-baseline-perf
No longer blocks: wasm
(Assignee)

Updated

3 years ago
See Also: → 1316804
(Assignee)

Updated

3 years ago
Priority: P3 → P1
(Assignee)

Comment 7

2 years ago
Clean up how we emit code for comparisons by making the code generation table-driven instead of switch-driven.  These tables will be used heavily in the patch that implements the optimization (the next patch on this bug).
Attachment #8809446 - Attachment is obsolete: true
Attachment #8812124 - Flags: review?(hv1989)
(Assignee)

Comment 8

2 years ago
Implement optimizations of boolean expressions used for control.  This is good for performance and good for code density.

The idea here is that when we see certain boolean operators (currently comparisons and Eqz) we will look ahead to the next opcode, and if that opcode is a conditional branch we know how to handle (BrIf, If, and Select) then we will stash away the information about the boolean operator and not generate code yet.

The code generators for conditional branches will then use generalized machinery that make use of the stashed information to avoid generating boolean result values but instead fold the boolean operator into the branching directly.

On AngryBots, 84% of boolean operators (measuring statically) are optimized by this optimization.

Code generation is then divided into two, with a setup step (corresponding to popping operands) and a branch step (corresponding to testing and branching), because both If and Select need to perform additional work between the two steps.  Information that pertains to both of the steps is stored in a BranchState structure.  You will see that the logic in BrIf, If, and Select remains unchanged as a result of this division of labor.

Additionally, this patch has two tweaks:

- in int32 comparisons where the lhs is a constant, we do not move that
  constant to a register but use it directly

- if a BrIf has no stack to pop the branch is direct to the target rather
  than an inverted branch around code that pops the stack and branches to
  the target (Select and If did not have this issue)
Attachment #8809447 - Attachment is obsolete: true
Attachment #8812125 - Flags: review?(hv1989)
(Assignee)

Comment 9

2 years ago
Test cases that test that things continue to work in situations where the optimization is supposed to kick in.  (See the next patch for some ideas about how to test that the optimization does kick in.)
Attachment #8812128 - Flags: review?(hv1989)
(Assignee)

Comment 10

2 years ago
This is a framework - not ready for landing - that is used to measure whether optimizations do kick in as they are supposed to, and which can be used to profile the efficacy of the optimization.

(It's not ready for landing primarily because test cases start failing if the baseline compiler is not enabled, so some kind of guarding is needed for that.)

 We can sort-of safeguard that optimizations work with benchmarks, but it seemed like here we have an opportunity to check it directly, so I explored it a bit.  For other optimizations in the baseline compiler we may be able to do similar things, but really not all that often.

What I'd like to know is whether you think something like this is useful, or if it's a distraction.  And any other thoughts you may have.
Attachment #8812129 - Flags: feedback?(hv1989)
(Assignee)

Comment 12

2 years ago
Needs an Arm64 stub for the MacroAssembler interface branch64(Cond,Register64,Register64,Label*).
Comment on attachment 8812124 [details] [diff] [review]
bug1286816-refactor-comparisons-v3.patch

Review of attachment 8812124 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not totally convinced a table-driven approach is better?

Adding a function that can translate the CmpOp to Assembler::Condition seems better. (E.g. something like JSOpToCondition).
That way the compiler will warn us if we add a new cases that aren't considered in those switches.
Instead of relying on a comment in the code. And needing to have NumCmpIntBinop etc...
Also we have InvertCondition we can use for the "notTaken" case. That way we don't have to keep "notTaken" in the table.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +503,5 @@
> +        CMP_GE,
> +        CMP_ULT,                // I32, I64
> +        CMP_ULE,
> +        CMP_UGT,
> +        CMP_UGE

Might make sense to add the comments inline in the code, instead of after. Since at first glance I thought CMP_EQ and CMP_ULT were special.

        // Binary
	// I32, I64, F32, F64
        CMP_EQ,                 
        CMP_NE,
        CMP_LT,
        CMP_LE,
        CMP_GT,
        CMP_GE,
        // I32, I64
        CMP_ULT,
        CMP_ULE,
        CMP_UGT,
        CMP_UGE

@@ +3785,5 @@
> +    { Assembler::Below,              Assembler::AboveOrEqual },       // ULT
> +    { Assembler::BelowOrEqual,       Assembler::Above },              // ULE
> +    { Assembler::Above,              Assembler::BelowOrEqual },       // UGT
> +    { Assembler::AboveOrEqual,       Assembler::Below }               // UGE
> +};

Assembler::InvertCondition exists for the second row ;). Please use that.

@@ +3795,5 @@
> +    { Assembler::DoubleNotEqualOrUnordered, Assembler::DoubleEqual },
> +    { Assembler::DoubleLessThan,            Assembler::DoubleGreaterThanOrEqualOrUnordered },
> +    { Assembler::DoubleLessThanOrEqual,     Assembler::DoubleGreaterThanOrUnordered },
> +    { Assembler::DoubleGreaterThan,         Assembler::DoubleLessThanOrEqualOrUnordered },
> +    { Assembler::DoubleGreaterThanOrEqual,  Assembler::DoubleLessThanOrUnordered }

Don't think it would be bad to add InvertCondition for Assembler::DoubleCondition.
Attachment #8812124 - Flags: review?(hv1989)
Comment on attachment 8812128 [details] [diff] [review]
bug1286816-test-cases.patch

Review of attachment 8812128 [details] [diff] [review]:
-----------------------------------------------------------------

tests++
Attachment #8812128 - Flags: review?(hv1989) → review+
Attachment #8812129 - Flags: feedback?(hv1989) → feedback+
Comment on attachment 8812125 [details] [diff] [review]
bug1286816-boolean-evaluation-v3.patch

Review of attachment 8812125 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing for now waiting for updated patch. Design already looks nice.
Attachment #8812125 - Flags: review?(hv1989)
(Assignee)

Comment 16

2 years ago
Add Assembler::InvertCondition(DoubleCondition) for x86, x64, and ARM - MIPS already has this.  (I forgot about None but will check that.)
Attachment #8814000 - Flags: review?(hv1989)
(Assignee)

Comment 17

2 years ago
Add a branch64(reg64,reg64) stub for ARM64.
Attachment #8814001 - Flags: review?(hv1989)
(Assignee)

Comment 18

2 years ago
Refactor comparisons, using standard Condition codes rather than the CMP_ values.
Attachment #8812124 - Attachment is obsolete: true
Attachment #8814003 - Flags: review?(hv1989)
Comment on attachment 8814000 [details] [diff] [review]
bug1286816-invertcondition-double.patch

Review of attachment 8814000 [details] [diff] [review]:
-----------------------------------------------------------------

Looks nice! Thanks!
Attachment #8814000 - Flags: review?(hv1989) → review+
(Assignee)

Comment 20

2 years ago
Implement boolean optimization for control using InvertCondition rather than the old table.  (The consequence of having type-specific condition codes shows up in that there are type-specific "latent" variables for storing the condition codes and an extra template to handle the type specificity.  Otherwise the changes are superficial at this level.)
Attachment #8812125 - Attachment is obsolete: true
Attachment #8814005 - Flags: review?(hv1989)
Comment on attachment 8814001 [details] [diff] [review]
bug1286816-arm64-branch64.patch

Review of attachment 8814001 [details] [diff] [review]:
-----------------------------------------------------------------

Nice :D
Attachment #8814001 - Flags: review?(hv1989) → review+
Comment on attachment 8814003 [details] [diff] [review]
bug1286816-refactor-comparisons-v4.patch

Review of attachment 8814003 [details] [diff] [review]:
-----------------------------------------------------------------

+1
Attachment #8814003 - Flags: review?(hv1989) → review+
(Assignee)

Comment 23

2 years ago
Re the optimization instrumentation, that needs more work before it can land (it needs to only run when the code is baseline-compiled) and I may spin that off as a new bug, where I'll also try to capture the IRC discussion.
Comment on attachment 8814005 [details] [diff] [review]
bug1286816-boolean-evaluation-v4.patch

Review of attachment 8814005 [details] [diff] [review]:
-----------------------------------------------------------------

I mostly went through the logical part. Some feedback.
I'll need another round, but with these comments fixed I expect only nits anymore.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +3752,5 @@
> +    void jumpConditionalWithJoinReg(BranchState* b, Cond cond, Lhs lhs, Rhs rhs)
> +    {
> +        AnyReg r = popJoinRegUnlessVoid(b->resultType);
> +
> +        if (b->framePushed != BranchState::NoPop && willPopStackBeforeBranch(b->framePushed)) {

Shouldn't this be || instead of &&?

@@ +5006,5 @@
> +BaseCompiler::emitBranchSetup(BranchState* b)
> +{
> +    switch (latentOp_) {
> +      case LatentOp::None: {
> +        b->i32.lhs = popI32NotJoinReg(b->resultType);

For every LatentOp can we set the corresponding "Imm" rhs already?
That way "emitBranchPerform" would be universal for every latentOp_ ...

I.e. set b->i32.rhsImm = true and b->i32.imm = 0 here.

@@ +5046,5 @@
> +      }
> +      case LatentOp::Eqz: {
> +        switch (latentType_) {
> +          case ValType::I32: {
> +            b->i32.lhs = popI32NotJoinReg(b->resultType);

I.e. set b->i32.rhsImm = true and b->i32.imm = 0 here.

@@ +5050,5 @@
> +            b->i32.lhs = popI32NotJoinReg(b->resultType);
> +            break;
> +          }
> +          case ValType::I64: {
> +            b->i64.lhs = popI64NotJoinReg(b->resultType);

I.e. set b->i64.rhsImm = true and b->i64.imm = 0 here.

@@ +5272,5 @@
>      if (!pushControl(&endLabel, &elseLabel))
>          return false;
>  
> +    if (!deadCode_)
> +        emitBranchPerform(&b);

There is a bug here:
emitBranchPerform resets the "latentOp_", which is not reset when "deadCode_" is true. As result the if/select/... could assume there is. Can you add a testcase for this?
Attachment #8814005 - Flags: review?(hv1989)
(In reply to Lars T Hansen [:lth] from comment #23)
> Re the optimization instrumentation, that needs more work before it can land
> (it needs to only run when the code is baseline-compiled) and I may spin
> that off as a new bug, where I'll also try to capture the IRC discussion.

Makes sense
(Assignee)

Comment 26

2 years ago
(In reply to Hannes Verschore [:h4writer] from comment #24)
> Comment on attachment 8814005 [details] [diff] [review]
> bug1286816-boolean-evaluation-v4.patch
> 
> ::: js/src/wasm/WasmBaselineCompile.cpp
> @@ +3752,5 @@
> > +    void jumpConditionalWithJoinReg(BranchState* b, Cond cond, Lhs lhs, Rhs rhs)
> > +    {
> > +        AnyReg r = popJoinRegUnlessVoid(b->resultType);
> > +
> > +        if (b->framePushed != BranchState::NoPop && willPopStackBeforeBranch(b->framePushed)) {
> 
> Shouldn't this be || instead of &&?

No, && is right: if framePushed == NoPop then it doesn't matter what the other conjunct is; framePushed has to be different from NoPop AND there has to be something to pop for the first arm to be used, which performs the popping.

> @@ +5006,5 @@
> > +BaseCompiler::emitBranchSetup(BranchState* b)
> > +{
> > +    switch (latentOp_) {
> > +      case LatentOp::None: {
> > +        b->i32.lhs = popI32NotJoinReg(b->resultType);
> 
> For every LatentOp can we set the corresponding "Imm" rhs already?
> That way "emitBranchPerform" would be universal for every latentOp_ ...
> 
> I.e. set b->i32.rhsImm = true and b->i32.imm = 0 here.

Fixed.  Nice cleanup.

> @@ +5272,5 @@
> >      if (!pushControl(&endLabel, &elseLabel))
> >          return false;
> >  
> > +    if (!deadCode_)
> > +        emitBranchPerform(&b);
> 
> There is a bug here:
> emitBranchPerform resets the "latentOp_", which is not reset when
> "deadCode_" is true. As result the if/select/... could assume there is. Can
> you add a testcase for this?

Nice find.  I have cleaned this up by (a) introducing and documenting a method resetLatentOp that must be called if emitBranchSetup / emitBranchPerform are not called because deadCode_ is true and (b) adding assertions in sniffLatentWhatever that check that the latentOp_ field is None.  Part (b) is not as good as a test case, I'm still working on that, but it will catch some future mistakes.
Attachment #8814005 - Attachment is obsolete: true
Attachment #8814406 - Flags: review?(hv1989)
(Assignee)

Comment 27

2 years ago
FWIW there's a bug here somewhere, the emscripten box2d benchmark no longer terminates with this optimization (x64 release build, Mac).
(Assignee)

Comment 28

2 years ago
(In reply to Lars T Hansen [:lth] from comment #27)
> FWIW there's a bug here somewhere, the emscripten box2d benchmark no longer
> terminates with this optimization (x64 release build, Mac).

Easy fix in emitBranchSetup, it must set the type when synthesizing an operation in the "None" case.
(Assignee)

Comment 29

2 years ago
Comment on attachment 8814406 [details] [diff] [review]
bug1286816-boolean-evaluation-v5.patch

New patch forthcoming.
Attachment #8814406 - Flags: review?(hv1989)
(Assignee)

Comment 30

2 years ago
Ran the embenchen benchmarks (in asm_vs_wasm), results are generally very good, with minor improvements across the suite.  The times are measured at the shell (/usr/bin/time), which will tend to give an edge to the baseline compiler because the runtime ratio is diluted by startup overhead and faster compilation time:

Without the optimization

                Baldr  Rabaldr  Ratio

box2d           1.06    1.78    .595
bullet          1.32    2.19    .602
conditionals    0.32    1.01    .316
copy            0.62    3.05    .203
corrections     1.00    1.57    .636
fannkuch        2.26    5.06    .446
fasta           0.78    2.17    .359
ifs             0.23    0.38    .605
linpack         .8222   1.0493  .783
lua_binarytrees 4.03    6.35    .634
lua_scimark     4.145   7.352   .563
memops          0.92    3.50    .262
primes          1.17    1.44    .812
skinning        0.84    2.21    .380
zlib            1.20    2.08    .576

With the optimization

                Baldr  Rabaldr  Ratio

box2d           1.09    1.72    .633 +
bullet          1.27    2.15    .590 -
conditionals    0.35    1.06    .330 +
copy            0.62    2.96    .209 +
corrections     1.02    1.51    .675 +
fannkuch        2.22    4.63    .479 +
fasta           0.78    2.14    .364 +
ifs             0.23    0.34    .676 +
linpack         .8219   1.0293  .798 +
lua_binarytrees 4.01    6.06    .661 +
lua_scimark     4.110   7.331   .560 -
memops          0.92    3.51    .262
primes          1.16    1.18    .983 +
skinning        0.84    2.20    .381 +
zlib            1.20    2.00    .600 +

For bullet and scimark: though they are slower here a second run produced numbers that had a better ratio than without the optimization (.627 and .580 respectively).  There's probably plenty of noise, but the trend is good.
(Assignee)

Comment 31

2 years ago
This is the same patch as in comment #26, except that two bugs that crept into the rewritten emitBranchSetup / emitBranchPerform logic have been fixed (basically, failing to properly initialize variables along all paths).
Attachment #8814406 - Attachment is obsolete: true
Attachment #8814823 - Flags: review?(hv1989)
Attachment #8814823 - Flags: review?(hv1989) → review+
(Assignee)

Comment 33

2 years ago
Optimization instrumentation forked off as bug 1320645.
Depends on: 1324989
You need to log in before you can comment on or make changes to this bug.