Closed Bug 1640669 Opened 1 year ago Closed 1 year ago

Boolean evaluation for control for all_true, any_true, bitmask

Categories

(Core :: Javascript: WebAssembly, enhancement, P3)

x86_64
All
enhancement

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox81 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

The T.all_true and T.any_true predicates generate boolean values that are (probably) almost always used for control flow, but our renditions of these predicates generate an explicit boolean value that must be tested subsequently, this is suboptimal. It looks like we can instead depend on Ion to clean this up.

(Control flow results in an MTest node, and we want to add some smarts to the MTest lowering code, similar to what's there for some other cases. The predicates apparently need isEmittedAtUses, which is probably OK - the allTrue renditions will then boil down to three instructions, which we can afford to repeat in the worst case; anyTrue is just a single instruction anyway, with a single input.)

In the inner loop of the SIMD version of the mandelbrot program here the only really ugly bit is the code generated for any_true, for exactly the reason outlined in comment 0. The code there is additionally a negated test, (i32.eqz (i32x4.any_true ...)) and we need to handle that and its variations too.

No longer blocks: 1637332

Also the bitmask family of instructions benefits from this, though in that case I'm guessing there's more of a balance between uses for control and uses for value.

Summary: Boolean evaluation for control for all_true and any_true → Boolean evaluation for control for all_true, any_true, bitmask

Also, we want to constant-fold these when we can, if it's not too much work. I've seen emscripten generate some strange code where it tests any_true of a constant value, probably as a result of several levels of hoisting and rewriting. Here's the start of the output of the mandelbrot program at -O3:

  (func (;2;) (type 2) (param i32 i32) (result i32)
    (local i32 i32 f32 v128 v128 v128 v128 v128 v128 v128 v128 v128)
    block  ;; label = @1
      i32.const -1
      i32x4.splat
      local.tee 11
      i32x4.any_true
      if  ;; label = @2
        ...

Note, though, that to effectively constant fold that condition we have to constant fold the tee, that is, a tee of a constant should propagate the constant. Not completely obvious to me what happens in WasmIonCompile right now.

Edit: tee seems to do the right thing, the MDefinition flows through the tee.

Edit: the constant folding should be a work item of its own, separate from this bug.

Depends on: 1644424

(In reply to Lars T Hansen [:lth] from comment #2)

Also the bitmask family of instructions benefits from this, though in that case I'm guessing there's more of a balance between uses for control and uses for value.

Specifically, instead of computing the mask and then comparing it to zero we can PTEST the input value with a vector that has all the sign bits set and just branch on the flag. For 8-bit and 32-bit on x86 i'm not sure this is going to beat {PMOVMSKB, MOVMSKPS} + TEST but it might (and we save a register and stay in the FPU). For 16-bit the PTEST will almost certainly beat the four-instruction sequence we have for computing the mask + another TEST.

This removes the generation of a boolean result for all_true and
any_true operations that will subsequently just be tested to perform
control flow. It uses the standard Ion framework for this.

The test case tests that the optimization is triggered (in this simple
setting), and that it produces correct code. (The optimization also
triggers on a couple of test cases in the imported spec test suite.)

Code generation for i16x8.bitmask is particularly poor, so optimize for control.
In the case of i8x16.bitmask and i32x4.bitmask, it's a single SIMD instruction
followed by a test, and it's unlikely that something depending on a constant
load will do better, so don't optimize these.

A subsequent patch will move the code from Lowering.cpp into platform code, now
that it is clearly platform-dependent.

Depends on D85180

Factor a predicate that determines whether we should optimize a reduce+branch operation.

Depends on D85260

Attachment #9166604 - Attachment description: Bug 1640669 - Part 1: Optimize all_true and any_true for control. r?jseward → Bug 1640669 - Part 1: Optimize all_true and any_true for control. r=jseward
Attachment #9166782 - Attachment description: Bug 1640669 - Part 2: Optimize bitmask for control. r?jseward → Bug 1640669 - Part 2: Optimize bitmask for control. r=jseward
Attachment #9166783 - Attachment description: Bug 1640669 - Part 3: Hide platform-specific bits better. r?jseward → Bug 1640669 - Part 3: Hide platform-specific bits better. r=jseward
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9d343deee76d
Part 1: Optimize all_true and any_true for control.  r=jseward
https://hg.mozilla.org/integration/autoland/rev/ce7b1451c100
Part 2: Optimize bitmask for control. r=jseward
https://hg.mozilla.org/integration/autoland/rev/684e8b6a4109
Part 3: Hide platform-specific bits better. r=jseward
You need to log in before you can comment on or make changes to this bug.