Closed Bug 999257 Opened 8 years ago Closed 7 years ago

Optimize instructions with tied inputs more thoroughly

Categories

(Core :: JavaScript Engine: JIT, enhancement)

x86
All
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33

People

(Reporter: sunfish, Assigned: sunfish, Mentored)

Details

(Whiteboard: [lang=c++])

Attachments

(1 file)

The following asm.js testcase:

a = new ArrayBuffer(4096);
var i32 = new Int32Array(a);
i32[0] = 3;
i32[1] = 5;
i32[1000] = 2;

print((function(stdlib, n, heap) {
    "use asm"
    var i32heap = new stdlib.Int32Array(heap);
    function f() {
        var i = 0;
        var sum = 0;
        do {
            sum = (i32heap[i>>2]|0) + sum |0;
            i = i + 4 |0;
        } while ((i|0) < 4096);
        return sum|0;
    }
    return f;
})(this, null, a)());

for example gets this x86-64 code:

  movl       0(%r15,%rdx,1), %edx
  addl       %ecx, %edx            # the add
  ...
  cmpl       $0x1000, %eax
  jge        ((30))
  movl       %edx, %ecx            # the copy
  jmp        ((37))

Similar code is emitted on 32-bit x86 as well.

It would be better to commute the operands of the add marked "the add", so that the phi doesn't require a copy on the loop backedge, since that increases register pressure and forces codegen to emit two branches for the loop instead of one.

This also appears to be related to the surprising regression seen on asmjs-ubench/memops.js on 32-bit x86. When I hack the compiler to commute the operands, it eliminates the copy, merges the branches, and the presumed microarchitectural quirk appears to be avoided.

Currently, codegen just uses a simple heuristic that looks at hasOneDefUse(). Ideally, the register allocator should be able to commute the operands of an LAdd (or similar) on demand, so that it can make a decision with the benefit of full liveness information.
Whiteboard: [mentor=sunfish][lang=c++]
Mentor: sunfish
Whiteboard: [mentor=sunfish][lang=c++] → [lang=c++]
This patch adds a simple heuristic for reduction-style phis. It's a fairly narrow pattern, but it's enough to optimize memops.js and the hand-written testcase in this bug.

I created ShouldReorderCommutative to split the heuristic logic out from the mechanism, and to make it easier to add more conditions.

I also added new methods to MPhi, getLoopPredecessorOperand and getLoopBackedgeOperand, and in addition to using them for this coalescing optimization, I also simplified a few things in RangeAnalysis with them.
Assignee: nobody → sunfish
Attachment #8447366 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8447366 [details] [diff] [review]
phi-loop-operands.patch

Review of attachment 8447366 [details] [diff] [review]:
-----------------------------------------------------------------

This patch sounds good.
Attachment #8447366 - Flags: review?(nicolas.b.pierron) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/35b7ab4dafd7

On AWFY, 32-bit x86 asmjs-ubench/memops.js got a 14% speedup in asm.js mode as a result of this patch. Non-asm.js mode also sped up by about 10% in both 32-bit x86 and x64.
Nice!  x86 memops matches native now.
https://hg.mozilla.org/mozilla-central/rev/35b7ab4dafd7
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
You need to log in before you can comment on or make changes to this bug.