Closed Bug 653962 Opened 9 years ago Closed 7 years ago

JM+TI: make v8-deltablue fast

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED INVALID

People

(Reporter: bhackett, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

V8 is currently more than four times as fast as JM+TI at deltablue.  Whatever's broken needs to get fixed (hopefully related mainly to the virtual call structure of this benchmark).
Starting with a simplified version of the 'execute' function which takes up more time than any other script in delatblue.  This iterates over a list of constraints and executes each one.  The actual 'execute' call is a dispatch which can call one of several simple functions with a few property side effects.  Even without this 'execute' call though, we are still 3 times slower than V8, which seems to be because we don't do any LICM within the main loop.

Plan.prototype.execute = function () {
  for (var i = 0; i < this.size(); i++) {
    var c = this.constraintAt(i);
  }
}

Patch to do LICM on property accesses.  This won't affect deltablue until we do LICM within the inlined calls (next step).  This also fixes a bug in the LICM design, where if a loop invariant temporary changes within the call and has copies on the stack, those copies will use the new value of the temporary instead of the old one.  The fix is to trigger recompilation if a copied temporary changes across a call.

http://hg.mozilla.org/projects/jaegermonkey/rev/acafcbe50b01
Attached patch WIP LICM within inline calls (obsolete) — Splinter Review
WIP overhauling handling of frame state during inlining.  Trying to allocate registers with the existing scheme of isolating all the inlined frames is not really workable for LICM, as there is no not-disgusting way to allocate registers for loop temporaries from within an inlined call.  This is barely a WIP, but the idea is to remove the isolation between frames and allow the FrameState to manage all the entries as a single unit.  Active frames are still around, but just demarcate which portion of the entries buffer belongs to each frame's args, locals, etc.
Attached patch WIP LICM 2 (obsolete) — Splinter Review
Wrap per-script SSA analyses with a simple analysis for following data flow across inline call boundaries, so we can do SSA stuff on a whole compilation unit, including inlined functions, without having to redo the SSA analyses themselves.  Existing loop based analysis like bounds check hoisting and integer overflow elimination work transparently in the presence of inlining (though patch still barely works).

// Call this 100000 times with a 1000 element Plan.
Plan.prototype.execute = function () {
  for (var i = 0; i < this.size(); i++) {
    var c = this.constraintAt(i);
  }
}

js -m -n (old): 760
js -m -n (new): 113
d8: 160
js -m: 3900
js -j: 1770
Attachment #529598 - Attachment is obsolete: true
Finished patch.  Will push after finishing bug 650163, which is in progress on the same tree.  This improves the above testcase but still doesn't do anything for deltablue yet because LICM gets disabled on the important loops.
Attachment #529931 - Attachment is obsolete: true
Followup bugfix, for inline calls with multiple exit paths we could overwrite the inline call's return value when syncing for the expected state after the call.

http://hg.mozilla.org/projects/jaegermonkey/rev/5aadf6bc110b
Inlined natives for Array.{push,pop}.  Improves deltablue score by about 30%.

http://hg.mozilla.org/projects/jaegermonkey/rev/fd1abc43d698

function foo() {
  var a = [];
  for (var i = 0; i < 10000; i++) {
    var x = 0;
    for (var j = 0; j < 10000; j++)
      a.push(j);
    for (var j = 0; j < 10000; j++)
      x += a.pop();
  }
}

js -m -n (old): 9307
js -m -n (new): 497
d8: 1250
js -m: 6864
js -j: 7366

-m -n (old) is slower than -m because of the write barriers for types needed when Array.push is called as a native (pretty much all side effects in the VM need to go through these barriers).
Attached file WIP new property overhaul (obsolete) —
Mostly done WIP overhauling how we figure out the definite properties and initial shapes for objects allocated with 'new'.  This was previously a fairly narrow scan of the bytecode which didn't allow branching or all that many operations to happen.  Now that we have SSA, it's easier to make things more robust, allow branching (as long as the writes to this.f themselves happen in unconditional code) and even follow allocation code across script boundaries.  For now this specializes on initialization that uses Function.call (the pattern in deltablue, shown below), but could be fairly straightforwardly extend to other constructor patterns and class inheritance shoehorning.

function Constraint(strength) {
  this.strength = strength;
}

function UnaryConstraint(v, strength) {
  UnaryConstraint.superConstructor.call(this, strength);
  this.myOutput = v;
  this.satisfied = false;
  this.addConstraint();
}

UnaryConstraint.inheritsFrom(Constraint);

UnaryConstraint.superConstructor is Constraint, so after creating a UnaryConstraint we want to know it has strength/myOutput/satisfied in particular slots.

Main remaining issue is making sure that if we clear definite property information from the object due to, say, UnaryConstraint.superConstructor producing undefined or Constraint.call getting overwritten with or Object.prototype getting a setter for 'myOutput', we need to walk the stack to look for objects whose initialization is in progress and make sure they end up with the right shape (we initialize these objects with their final shape at the start of the constructor).

We don't do this walking currently (bug in the current design), and it would be hard and not robust to prove none of the above things could occur during the middle of initialization.  Fortunately, by remembering where each property gets initialized we can deduce for a given stack frame how much of it has been initialized and what its current shape *should* be.
Improve type information and compilation at polymorphic call sites --- we can track correlations between the type of the 'this' value and callee and maintain accurate types for the called function's 'this', and avoid a PIC at the callsite by dispatching on the 'this' value's type.

http://hg.mozilla.org/projects/jaegermonkey/rev/0b58cbabd2cc

This improves v8-deltablue by about 10%, to about twice what it was when I started optimizing deltablue and 70% more than stock js -m -j -p.  It's still only half of v8's score, but analysis information is really good and I'm going to stop working on it for now.  We are really held back by a lack of generational GC --- if I modify the benchmark to keep the OrderedCollections alive, we're about 15% faster than v8 (though doing this unfairly penalizes a generational GC).  Codegen around branches inside inline calls also sucks, we have to sync a huge amount of stack (fixable, but will be far easier to do right in IonMonkey).
Duplicate of this bug: 650075
JM is gone and GGC is tracked in lots of places, already.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.