Last Comment Bug 649693 - JM+TI: make kraken-ai-astar fast
: JM+TI: make kraken-ai-astar fast
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: general
:
Mentors:
Depends on: 651827
Blocks: 619423 641047
  Show dependency treegraph
 
Reported: 2011-04-13 10:40 PDT by Brian Hackett (:bhackett)
Modified: 2011-05-18 08:07 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP array length LICM (72.05 KB, patch)
2011-04-15 13:51 PDT, Brian Hackett (:bhackett)
no flags Details | Diff | Splinter Review
array length LICM (76.68 KB, patch)
2011-04-16 07:03 PDT, Brian Hackett (:bhackett)
no flags Details | Diff | Splinter Review

Description Brian Hackett (:bhackett) 2011-04-13 10:40:18 PDT
With most of the machinery in place for doing advanced compiler optimizations in JM2, we can start focusing on individual benchmarks and tune these optimizations to actually do something.

ai-astar is a nice place to start, for a few reasons:

- We're still considerably slower than V8 on it.
- Many of our advanced optimizations could be applied to it, but none are yet robust enough to actually apply to it.
- Our ai-astar performance feeds into JSObject evisceration, so knowing how fast we can possibly be on it is good.
- Performance on ai-astar is entirely determined by this loop:

    for(var i=0;i<this.length;i++) {
        if(this[i].pos == obj.pos) { return this[i]; }
    }

Current times on this benchmark:

js -m -n: 455
js -j: 625
js -m: 727
d8: 240
Comment 1 Brian Hackett (:bhackett) 2011-04-14 14:50:24 PDT
A couple easy fixes.

Don't add 'undefined' to the type of an array when reading holes out of it.  This is an old behavior designed to avoid recompilations back before we had -a.  Now that we have -a, the interpreter happily marked the array elements as undefined even for opcodes immediately followed by a test 'if (x[a])'.  This was causing the element type of the 'this' array here to be maybe-void and not definitely-object.  Along with the sixth GPR (about a 20ms win on this benchmark), this drops our time to 377ms.

http://hg.mozilla.org/projects/jaegermonkey/rev/eee58bb8f367

Don't use an IC for x.length on a known dense array x.  Unfortunately this doesn't yet apply to this benchmark (more below), it's just a good thing to do.

http://hg.mozilla.org/projects/jaegermonkey/rev/dca50d9a5047

Two more involved changes that will help big on this benchmark:

- The 'this' array here can shrink in some other ai-astar code (it's a worklist) which causes us to deoptimize all dense array accesses on it (including 'this[i]').  The array shrinkage stuff was added for bounds checks hoisting --- if we know the array can't shrink, we only need to do bounds checks when initially entering the loop and not after every stub call.  This is avoiding the problem that we don't have a mechanism to trigger recompilation while regenerating loop invariants after stub calls.  Need to add this mechanism and rip out the array shrinkage stuff.  This would also allow hoisting the bounds check on 'this[i]' (we can recheck after a stub that this.length == this.initializedLength) and making the slots of 'this' loop invariant.

- The element types of the 'this' array are polluted with the original objects populating the graph (which don't have a 'pos' property at all).  This is because the original graph's arrays are updated in place when creating graph nodes ('g1[i][j] = new GraphNode(...)'), and we propagate that imprecision through to the worklist arrays which this loop runs on.  If we recognize this polymorphism and stick type checks in the 'neighbors' function to ensure only GraphNodes are added to the 'neighbors' result array, we would retain precision in findGraphNode (and most of the rest of the astar search algorithm).  This strategy calls back to earlier versions of the inference algorithm, and we will need to do this for maximal precision when code is polymorphic (adding possible monitoring to property/element reads and scripted calls should be sufficient).
Comment 2 Brian Hackett (:bhackett) 2011-04-14 22:05:49 PDT
Make the first involved change above.  After rejoining from a call and regenerating the loop invariants, we can make tests that trigger recompilation if they fail (how this works is kind of nifty --- on failure, we make an OOL call that first replaces its own return address with that of the original call, so the recompilation looks like it came from that original call).

This allows us to avoid ICs for this.length and this[i], cutting our time to 300ms.  We are still not doing any hoisting, LICM, or optimization of the *.pos accesses.

http://hg.mozilla.org/projects/jaegermonkey/rev/cb06710a8eb7
Comment 3 Brian Hackett (:bhackett) 2011-04-15 13:51:19 PDT
Created attachment 526366 [details] [diff] [review]
WIP array length LICM

WIP (needs cleanup, testing) patch for doing LICM on x.length computations and hoisting bounds checks / LICM'ing array slots in loops which do x.length.  This cuts our astar time to 257ms.  Looking at the remaining disassembly:

2: x/i $pc  0x782348:	cmpl   $0x0,0x102742c
2: x/i $pc  0x78234f:	jne    0x7824af
2: x/i $pc  0x782355:	mov    (%edi,%esi,8),%ebx
2: x/i $pc  0x782358:	mov    %ebx,%edx
2: x/i $pc  0x78235a:	mov    0xc(%edx),%ecx
2: x/i $pc  0x78235d:	cmp    $0x2c4,%ecx
2: x/i $pc  0x782363:	jne    0x782527
2: x/i $pc  0x782369:	lea    0x24(%edx),%edx
2: x/i $pc  0x78236c:	mov    0x20(%edx),%ecx
2: x/i $pc  0x782372:	mov    0x1c(%edx),%edx
2: x/i $pc  0x782378:	mov    %eax,%ebx
2: x/i $pc  0x78237a:	mov    0xc(%ebx),%ecx
2: x/i $pc  0x78237d:	cmp    $0x2c4,%ecx
2: x/i $pc  0x782383:	jne    0x782574
2: x/i $pc  0x782389:	lea    0x24(%ebx),%ebx
2: x/i $pc  0x78238c:	mov    0x20(%ebx),%ecx
2: x/i $pc  0x782392:	mov    0x1c(%ebx),%ebx
2: x/i $pc  0x782398:	cmp    %ebx,%edx
2: x/i $pc  0x78239a:	jne    0x7823bc
2: x/i $pc  0x7823bc:	sub    $0xffffffff,%esi
2: x/i $pc  0x7823bf:	jo     0x78266f
2: x/i $pc  0x7823c5:	mov    %esi,0x38(%ebp)
2: x/i $pc  0x7823c8:	mov    0x68(%ebp),%ebx
2: x/i $pc  0x7823cb:	cmp    %ebx,%esi
2: x/i $pc  0x7823cd:	jl     0x782347

It's the IC'ed .pos accesses that really dominate the loop now.  There's improvement to be had here, even while keeping the ICs (in most real world cases we probably won't be able to remove the property ICs).  We pointlessly load the type tags each time (we know statically the result type of the access, even if we don't know its slot), which also keeps us from loop-carrying a register for this.length.

I went ahead and made an alternate version of astar where we can recover the exact types of this[i] and obj here and do direct slot accesses (per comment 1, we can get here with some more analysis work).  This cuts our time to 138ms (70% faster than V8), and is *still* not optimal (does not LICM the obj.pos access).
Comment 4 Brian Hackett (:bhackett) 2011-04-16 07:03:15 PDT
Created attachment 526487 [details] [diff] [review]
array length LICM

http://hg.mozilla.org/projects/jaegermonkey/rev/244446b156b7
Comment 5 Jan de Mooij [:jandem] 2011-04-16 07:36:45 PDT
According to awfy this also improved V8-richards nicely, Δ590 (11.8% better), a number of individual Kraken tests also got faster.
Comment 6 Brian Hackett (:bhackett) 2011-04-16 07:44:55 PDT
Yeah, these are not astar-specific improvements, but looking at single benchmarks helps with focus.  Also, the above cset dipped us below V8 on Kraken (may not last, as they've dinged their Kraken perf by 400-500ms in the last week).

function foo(n) {
  var a = Array(n);
  for (var i = 0; i < n; i++)
    a[i] = i;
  for (var i = 0; i < 1000000; i++) {
    var b = 0;
    for (var j = 0; j < a.length; j++)
      b = a[j];
  }
}
foo(1000);

js -m -n  1107
js -m     7578
js -j     3983
d8        1457

Putting the polymorphism handling on hold until we have more examples and I have a better sense of how to do it right.
Comment 7 Brian Hackett (:bhackett) 2011-05-18 07:49:59 PDT
With bug 656920 landing I now get 138ms for JM+TI on astar, 75% faster than V8.  This is the disassembly of the loop in comment 0:

3: x/i $pc  0x7bd0cc:	cmpl   $0x0,0x79642c
3: x/i $pc  0x7bd0d3:	jne    0x7bd1f9
3: x/i $pc  0x7bd0d9:	mov    (%edi,%esi,8),%ebx
3: x/i $pc  0x7bd0dc:	mov    0x40(%ebx),%ebx    // #1
3: x/i $pc  0x7bd0df:	cmp    %edx,%ebx
3: x/i $pc  0x7bd0e1:	jne    0x7bd103
3: x/i $pc  0x7bd103:	sub    $0xffffffff,%esi
3: x/i $pc  0x7bd106:	mov    %esi,0x38(%ebp)    // #2
3: x/i $pc  0x7bd109:	mov    0xb0(%ebp),%edx    // #3
3: x/i $pc  0x7bd10f:	cmp    %ecx,%esi
3: x/i $pc  0x7bd111:	jl     0x7bd0cb

As fast as this code is, it could be faster.

#1 (load this[i].pos) should be combined with the branch in the following instruction.  This avoids an extra instruction (important to do for code that is no longer choked with memory ops) and in some cases an extra register (we manage to reuse the object register here).

#2 (sync i for backedge) is pointless, and required to maintain the JM+TI invariant that live entries are always synced at loop heads (makes on-demand allocation of loop registers *much* simpler).

#3 (reload hoisted obj.pos) is redundant, and happens because the branching in the loop body combined with loop register allocation causes us to forget the hoisted value is still in a register at the backedge.

#3 would be a little gross to fix but will probably happen (it also hurts the most, negating the effect of hoisting obj.pos), #1 and #2 would be hard to fix.  This is hitting the limit of codegen quality we can reasonably do in JM2.  Fortunately, the best code for this loop should fall out naturally in IonMonkey, from its better design decisions (namely, maximal munch on a lowered IR gives #1, and a precomputed regalloc gives #2 and #3).

Note You need to log in before you can comment on or make changes to this bug.