Closed Bug 642412 Opened 9 years ago Closed 6 years ago

TI: investigate Kraken crypto performance


(Core :: JavaScript Engine, defect)

Not set





(Reporter: jandem, Assigned: bhackett)


(Blocks 1 open bug)



(1 file)

TI+JM is slower than trunk JM on all Kraken crypto tests. TI and JM should do very well on crypto tests, so I'd like to investigate what's happening here. 

TI is also slower on SS crypto-aes, it's probably related.
Assignee: general → jandemooij
I started analyzing pbkdf2. According to Shark we spend at least 8% in stubs::SetElem. Most of the time is spent inside sjcl.hash.sha256.prototype.C. This function has many setelem's, the most important one is inside the loop, something like this:

b = d[a & 15] = (int + int) | 0;

Type inference output:

#35:00285:  zero
  type 0: int
#35:00286:  bitor
  type 0: int
#35:00287:  setelem
  type 0: int

I think this is bug 621937 (JM not generating IC's for monitored setelem)
Depends on: 621937
This is a beautified, single-file version of stanford-crypto-pbkdf2.
Brian, the type information for function C (line 344) is really bad. I'm not sure why but the type of all locals, arg and this is all "unknown". Better type information may get rid of the monitored setelem and is needed to make this benchmark fast.
First step, assume that arithmetic involving objects produces integers (rather than unknown values), monitor double/string/XML results dynamically.
Several fixes to inference precision.  These mostly fall into the category of not overapproximating to capture the behavior of an unlikely event (calling a non-function, indexing an object with a non-integer), but waiting for it to happen dynamically.

There is still some imprecision because of a few accesses to Math which are JSOP_NAME rather than GNAME.  That can be fixed similarly to how we do monitored SETNAME/SETPROP, will file separately.  For now adding '| 0' to the calls to Math.round and Math.floor should fix precision.  I get these times:

JM+TI (old): 691
JM+TI (new): 438
JM: 474
JM+TM: 345
V8: 229

Looking at shark, JM seems to be getting killed by the slow paths it takes for the '(...) | 0' operations in C's inner loop.  We try to branchTruncateToInt32 on these, but on average about 1.5 times per iteration this truncation fails (the double does not fit in an int32) and we take I think a very expensive slow path.  We could instead call an infallible native like the tracer does, but to get to V8's performance I think we should instead generate inline code which is good enough to capture the large finite doubles being truncated here.
Assignee: jandemooij → bhackett1024
Depends on: 643037
The patch in bug 643037 fixes the truncate-double paths and brings the JM+TI time down to 361 (still with the modification to the Math calls).  Other things that should help:

1. Know that 'a' is an integer within the loop in the C function.  It is initially an array but is then reused as an integer.  Could use SSA-style techniques to add some flow sensitivity to let the inference distinguish its uses, or (probably better) make a Crankshaft-style speculation it is an integer within the loop body and hoist a check to the loop header.  This is worth about 20ms (easy to test by hand).

2. Hoist array bounds checks and LICM the slots pointer computation on the 'd' array.

3. Regalloc tuning on the loop, get another GPR (heavy register pressure here).

4. Typed natives or inline code for various Array natives.  Shark thinks we spend a decent amount of time in these.
Get precise type information for the results of JSOP_NAME/JSOP_CALLNAME (358ms on the unmodified test).

This exposed an existing bug in the recompilation mechanism, where we would convert integer slots to doubles even if the recompiled code knew they were integers.  The FrameState can be either more or less precise than the inferred types, and we only require args/locals inferred as int|double to be doubles at join points (the revised algorithm only uses the FrameState, not the inferred types, to figure out what needs repatching, and is a good deal cleaner).
We're still slower than V8 on these tests but the info in this bug is unlikely to help much.
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.