Closed Bug 650496 Opened 9 years ago Closed 7 years ago

JM+TI: make v8-crypto fast

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED INVALID

People

(Reporter: bhackett1024, Unassigned)

References

(Blocks 1 open bug)

Details

v8-crypto basically tests our performance on this script's loop:

function am3(i,x,w,j,c,n) {
  var this_array = this.array;
  var w_array    = w.array;

  var xl = x&0x3fff, xh = x>>14;
  while(--n >= 0) {
    var l = this_array[i]&0x3fff;
    var h = this_array[i++]>>14;
    var m = xh*l+h*xl;
    l = xl*l+((m&0x3fff)<<14)+w_array[j]+c;
    c = (l>>28)+(m>>14)+xh*h;
    w_array[j++] = l&0xfffffff;
  }
  return c;
}

Intense register pressure in the loop, and some analysis opportunity.  We should make the slots of this_array and w_array loop invariant, and hoist the bounds checks on their accesses.  We don't currently do this as we don't understand the linear relationships between i/j and n.  The two this_array accesses could be CSE'ed, but I don't think that will save much once the cost of the array accesses is cut down to one dereference (also, for now CSE is not in the cards for JM+TI).  Current scores:

js -m -n     9428
js -m        5074
js -m -j -p  5611
d8           13930

Note: v8's tuning on this loop is not very robust.  If I change 'this_array' to this.array and 'w_array' to w.array, I get the scores below.  v8 can presumably LICM these accesses (something we don't do yet), but its score still drops in half.

js -m -n     7410
js -m        4420
js -m -j -p  5113
d8           6935
Depends on: 646923
From turning things off (unsoundly) to generate better code, doing these will get us to about 12400 on crypto (each is worth about 1000 points):

- Hoisting bounds checks and LICM'ing slot pointers.
- Determine that this_array and w_array are packed, with a hoisted dynamic check.  These arrays are sometimes initialized from the top down (bnpLShiftTo, bnpDLShiftTo), so they end up getting marked as possibly-not-packed and we do hole checks when accessing them.  It would be good if we could make the packed array stuff more robust against this initialization pattern.  Could do this if dense arrays had a holeCount for the number of holes beneath the initializedLength (share with obj->map, maybe?  stopgap until dense arrays are their own subclass of an eviscerated JSObject).
- Avoid add/multiply overflow checks and (particularly) multiply negative zero checks.  Rather than try to prove these can't happen (looks hard, especially for negative zero), it should be straightforward to prove the result will only be used in bitops, which are not affected by add/mul overflows in computing the operands.

After this, the remaining 1500 point difference comes down to register allocation and codegen.  It will help to have a seventh register available (bug 650560), and doing arithmetic with memory operands will reduce both register pressure and instruction counts.
(In reply to comment #1)
> It would be good if we could make the packed
> array stuff more robust against this initialization pattern.  Could do this if
> dense arrays had a holeCount for the number of holes beneath the
> initializedLength (share with obj->map, maybe?  stopgap until dense arrays are
> their own subclass of an eviscerated JSObject).

Bug on file?

/be
Great work, btw!

/be
I was just going to fix this stuff and attach the patches here, like I did for the astar bug (bug 649693).  Is it better practice to split into separate bugs?
Handle the linear relationships between i/j/n, hoist bounds checks and slot pointers for the array accesses in am3.  Improves v8-crypto score to 10440.

http://hg.mozilla.org/projects/jaegermonkey/rev/f01b61fd6f49
Dense arrays used to have a count of non-hole values < length, but that was removed in a righteous renovation that njn did.  Don't know if we can cheaply maintain it, since it requires a set to check for potentially overwriting a hole in order to update the accounting.
Yeah, that makes me wary of that change, it has potential to regress other stuff (there's also the increase in complexity).  The test is only required when updating arrays that aren't packed, though.  In most cases we will statically know the array is packed, and in other cases (such as the write in am3 here) we can test dynamically for a packed array (holeCount == 0) and hoist that check out of the loop.
(In reply to comment #6)
> Dense arrays used to have a count of non-hole values < length, but that was
> removed in a righteous renovation that njn did.  Don't know if we can cheaply
> maintain it, since it requires a set to check for potentially overwriting a
> hole in order to update the accounting.

gal did most of that work.  That code was awful -- complex, slow and I had to backport a bugfix to the 1.9.1 and 1.9.2 branches at one point.  Let's not bring it back! :)
Re: overflow checks, I got numbers again and on current tip removing the negative zero tests adds 600 points, removing the mul and add overflow checks adds another 900.  Doing this is trickier than I thought at first, for a couple reasons.

- Multiplying into a double and then truncating at the bitop *can* produce a different value than multiplying with truncation, due to floating point rounding errors.  I'm guessing/hoping that if the result of the multiply will fit in the 48 bit mantissa without rounding, then truncating the double will produce the same value as truncating the original mul.  With some simple interval analysis (wants SSA, bug 650715) we can determine three of the multiplies won't overflow an int32, and the fourth (xh*h) will fit in 36 bits.  We can't prove that negative zero isn't produced by these, and still need to determine that a negative zero will never be observed.

- If an add's result feeds into a bitop, overflows *can* be observed if subsequent additions before the bitop involve strings.  If x, y, and z are integers, we can ignore overflows in '(x + y + z) | 0', but if z is a string ('e20', say), then ignoring overflow in 'x + y' will change the result of the bitop.  This is tricky to deal with in type inference because we generally assume that known types can change at any time.  We can guard that any arithmetic on the add/mul result will be on integers, but we need extra checks to ensure that between the add/mul and its final bitop use, the types cannot change so that a string operand suddenly appears (basically, no scripted calls are possible and no fetching of unknown object properties).

Doing these is, however, worth it not just for v8-crypto but for other crypto tests.  kraken-stanford-crypto-pbkdf2 has this sort of stuff (see bug 642412) where adds overflow and are then converted back to ints with 'x | 0'.  JM+TI is way slower than everything else on this test, and I suspect that is in large part because we force each add which has ever overflowed to always produce a double, just so we can convert it back to int.  We should instead just ignore overflows here.  The safe_add function used in sunspider also fits this pattern.
If it's any use, Nanojit has a simple interval analysis in nanojit/LIR.cpp.  Look for class Interval.  It's designed to handle safe_add, it makes a big difference to trace JIT results for crypto-md5 and crypto-sha1 in Sunspider.
Will Nanojit's interval analysis work with code that isn't in the loop (that's all it analyzes, right?).  For am3, the multiplies are on 4 local variables xh, xl, h, l.  The latter two are set in the head of the loop, but the former two are set outside the loop and are loop invariant.  Trying to write analyses to capture this stuff without an SSA form (or Nanojit's linear SSA) is kind of a pita, but with SSA things are very simple --- the operands of the muls have a single associated write, which is a bitop taking a constant that must produce a value with N significant digits.
(In reply to comment #11)
> Will Nanojit's interval analysis work with code that isn't in the loop (that's
> all it analyzes, right?).

Correct, so it won't handle that -- the variables set outside the loop will end up as loads in the trace, so we'll have no info on them.  That's by far the biggest weakness with it.

Getting an interval analysis right is tricky, though.  You can do quite precise but complicated abstract operations, esp. for bitwise-and and bitwise-or.  In the Nanojit analysis I kept things simple, but it might still be a useful comparison point nonetheless.
Depends on: 652520
JM is gone
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.