Last Comment Bug 575529 - TM: use interval analysis to remove more overflow checks
: TM: use interval analysis to remove more overflow checks
Status: RESOLVED FIXED
fixed-in-nanojit, fixed-in-tracemonke...
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Nicholas Nethercote [:njn] (on vacation until July 11)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-29 00:16 PDT by Nicholas Nethercote [:njn] (on vacation until July 11)
Modified: 2010-09-23 19:56 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch (9.32 KB, patch)
2010-06-29 00:16 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
no flags Details | Diff | Review
patch, v2 (13.57 KB, patch)
2010-06-29 20:06 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
no flags Details | Diff | Review
TM patch (against TM 52847:857ab0bfeb22) (4.23 KB, patch)
2010-09-06 21:08 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
gal: review+
Details | Diff | Review
NJ patch (against TM 52847:857ab0bfeb22) (12.04 KB, patch)
2010-09-06 21:16 PDT, Nicholas Nethercote [:njn] (on vacation until July 11)
edwsmith: review+
jseward: review+
Details | Diff | Review

Description Nicholas Nethercote [:njn] (on vacation until July 11) 2010-06-29 00:16:00 PDT
Created attachment 454789 [details] [diff] [review]
patch

TM currently has a function IsOverflowSafe() which is used to remove some overflow checks for demoted (integer) arithmetic operations.  It's pretty weak.

The attached patch does a basic interval analysis that does a better job.  It does a great job for crypto-md5, roughly halving the size of the final LIR, and reducing the number of static overflow checks from 416 to 26.  This is because of this function:

function safe_add(x, y)
{
  var lsw = (x & 0xFFFF) + (y & 0xFFFF);
  var msw = (x >> 16) + (y >> 16) + (lsw >> 16);
  return (msw << 16) | (lsw & 0xFFFF);
}

The new analysis can tell that no overflow checks are needed for any of the additions.  This speeds up crypto-md5 by 1.25--1.50x (it varies across platforms).  It also speeds crypto-sha1 by 1.05--1.10x.

I'd appreciate a careful review, particularly of the interval arithmetic.  I've checked it pretty carefully but more eyes are welcome.  For a while I tried to do something smarter, esp. with LIR_andi, but it got complicated and I started getting nervous.

One thing worth mentioning is that the analysis redoes some work.  Eg. if you have an expression (a+b)+(c+d) it'll analyze a, b, c, d, a+b, c+d, (a+b)+(c+d).  I could try to memoize prior results but it doesn't seem important.

Another thing is that the checking for -0 after LIR_muli has changed.  It may be too optimistic.  For example, if you have x*2 where you don't know anything about the interval of x, I assume a check isn't needed.  But it's unclear to me if x could be -0.  We seem to guard against -0 occurring after div, mul and mod, but not after d2i which makes me nervous.

(One notable thing about -0 checks is that they trigger quite often, even when they don't need to.  So they're even more undesirable than overflow checks.)

Beyond this patch, I'd love it if we could get more out of this analysis by important possible intervals for eg. values on the stack.  If it could be done, we could possibly remove a lot more overflow checks.  I'm not sure how hard it would be, though.
Comment 1 Andreas Gal :gal 2010-06-29 00:30:09 PDT
Ed, I think you guys would like this, too. We should move it into NJ.
Comment 2 Edwin Smith 2010-06-29 04:33:46 PDT
+1
Comment 3 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-06-29 18:43:09 PDT
Hmm, TM has some extra guards to avoid div-by-zero and mod-by-zero.  The interval analysis could improve those cases as well.
Comment 4 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-06-29 20:06:25 PDT
Created attachment 455059 [details] [diff] [review]
patch, v2

I moved all the generic interval analysis stuff into NJ;  I haven't split the NJ and TM patches yet though, doesn't seem worth it yet.  I also cleaned up interval::of() a bit.

I'm more confident now that the needsNegZeroCheck-related code is correct.  But I'm still worried that we could be losing -0 in several of the cases in d2i() (which would be a pre-existing bug).
Comment 5 Andreas Gal :gal 2010-06-29 21:32:19 PDT
How much do you recurs repeatedly over the same LIR patterns? Would it be worth it storing the result of of() in a hash table?
Comment 6 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-06-29 21:54:20 PDT
(In reply to comment #5)
> How much do you recurs repeatedly over the same LIR patterns? Would it be worth
> it storing the result of of() in a hash table?

It depends on the program, in particular how long its expression chains are.  It appeared to be negligible for crypto-md5, which seems like a case with pretty long chains.  So I suspect a hash table wouldn't be worth the effort, and might even slow it down.
Comment 7 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-06-30 19:54:27 PDT
(In reply to comment #6)
>
> It appeared to be negligible for crypto-md5, which seems like a case with
> pretty long chains.

To quantify that:  0.28M instructions out of 42.3M.  In comparison, VisitFrameSlots<js::CountSlotsVisitor> accounts for 1.7M.
Comment 8 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-07-07 17:20:44 PDT
Comment on attachment 455059 [details] [diff] [review]
patch, v2

Julian was worried about the propagation of overflows within the analysis.  To cut a long story short, I think that, given the way that TM uses the analysis, it's safe.  But the safety argument is subtle.  Furthermore, the interval analysis code is in NJ and so I can't only consider how TM uses it.

So adding an explicit "has overflowed" value might help here.  It might also speed the analysis up on big expression chains, if I can stop working backwards as soon as an overflow value is hit.

I'll work up a new patch.
Comment 9 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-07-08 23:49:42 PDT
Note to self:  I could strengthen the analysis considerably by looking at guards.  Eg. if there's a "exit if i > 100" guard then I know that after the guard that i <= 100.

I wonder if it's worth doing anything symbolically, too.
Comment 10 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-08-03 18:09:21 PDT
In bug 580468 comment 20 Travis Fisher suggested: 

"Regarding interval analysis and real-world testcases, it might have a good
impact on some of the "emulators are slow" complaints, like bug 509986 or bug
514765.  I would bet the hot core of an 8 bit emulator has a lot of integer ALU
operations that can be statically shown to have 8 or 16 bit results."

Though tracing the main loop of an emulator is probably a bad idea in general, JM should do much better on such coode, which would render the idea moot.
Comment 11 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-06 18:50:39 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > How much do you recurs repeatedly over the same LIR patterns? Would it be worth
> > it storing the result of of() in a hash table?
> 
> It depends on the program, in particular how long its expression chains are. 
> It appeared to be negligible for crypto-md5, which seems like a case with
> pretty long chains.

That turns out to be untrue.  I was handling 'or' pessimistically, which breaks up the expression chains.  If I don't do that the expressions chains get very large and intertwined and I get an appalling blow-up.

A function like this is the pathological case:

function f(x, y) {
    var a = x + y;
    var b = a + a;
    var c = b + b;
    var d = c + c;
    var e = d + d;
    var f = e + e;
    var g = f + f;
    var h = g + g;
    return h;
}


> So I suspect a hash table wouldn't be worth the effort,
> and might even slow it down.

A hash table would clearly help with crypto-md5 and other extreme examples, but it would definitely slow the average case down -- CSE is the most expensive part of the NJ pipeline and it does a hash table lookup for many instructions, and this optimization is less generally applicable than CSE.

If we could store the bounds in the instructions that would make things easy and fast, but that doesn't seem reasonable.

Maybe having a counter that limits how far back the analysis goes would be a decent compromise.  It's true for safe_add() that going back 2 or 3 operands is enough and it's probably true for other cases where this optimization is effective.
Comment 12 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-06 21:08:26 PDT
Created attachment 472524 [details] [diff] [review]
TM patch (against TM 52847:857ab0bfeb22)

Just the TM patch.  Fairly straightforward, most of the tricky stuff is in the NJ patch.
Comment 13 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-06 21:16:28 PDT
Created attachment 472526 [details] [diff] [review]
NJ patch (against TM 52847:857ab0bfeb22)

Nanojit patch.  

- I added the notion of an 'overflowed' value that propagates through add/sub/mul like 'NaN'.  

- I simplified a number of the cases, eg. LIR_andi, LIR_rshi, LIR_rshui, so they're easier to understand but still cover the common cases.

- I added a recursion limit on Interval::of() to avoid analysis blow-ups.

Cachegrind results:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    95.193    95.226 (------) |    25.159    25.160 (------) | 3d-cube
|    57.410    57.322 (1.002x) |    23.099    23.012 (1.004x) | 3d-morph
|   116.739   116.792 (------) |    44.065    44.066 (------) | 3d-raytrace
|    73.043    73.048 (------) |    15.220    15.220 (------) | access-binary-
|   133.642   133.646 (------) |    82.806    82.806 (------) | access-fannkuc
|    37.526    37.541 (------) |    16.700    16.701 (------) | access-nbody
|    61.691    61.695 (------) |    28.901    28.901 (------) | access-nsieve
|    16.349    16.350 (------) |     3.249     3.249 (------) | bitops-3bit-bi
|    45.911    45.914 (------) |    32.701    32.701 (------) | bitops-bits-in
|    24.779    24.782 (------) |    12.019    12.019 (------) | bitops-bitwise
|    63.619    63.623 (------) |    38.301    38.301 (------) | bitops-nsieve-
|    49.008    49.010 (------) |    19.438    19.438 (------) | controlflow-re
|    47.619    41.988 (1.134x) |     5.964     4.406 (1.353x) | crypto-md5
|    30.330    29.145 (1.041x) |     6.809     6.220 (1.095x) | crypto-sha1
|    97.604    97.642 (------) |    17.295    17.295 (------) | date-format-to
|    83.587    83.603 (------) |     9.767     9.768 (------) | date-format-xp
|    54.274    54.278 (------) |    31.292    31.292 (------) | math-cordic
|    29.549    29.482 (1.002x) |     6.232     6.169 (1.010x) | math-partial-s
|    31.019    31.033 (------) |    13.319    13.319 (------) | math-spectral-
|    58.528    58.530 (------) |    34.592    34.592 (------) | regexp-dna
|    39.746    39.752 (------) |     9.408     9.408 (------) | string-base64
|   118.457   118.207 (1.002x) |    25.728    25.504 (1.009x) | string-fasta
|   139.147   139.133 (------) |    17.703    17.703 (------) | string-tagclou
|   180.727   180.666 (------) |    21.897    21.897 (------) | string-unpack-
|    58.414    58.415 (------) |    11.706    11.706 (------) | string-validat
-------
|  1743.921  1736.835 (1.004x) |   553.382   550.862 (1.005x) | all


crypto-md5 and crypto-sha1 benefit because no addxovi's occur in safe_add(), the other minor improvements are because some -0 checks are removed from some LIR_muli operations.

Timing shows about a 2--3ms win on crypto-md5, other improvements are in the noise.

It's worth noting that the crypto-md5 improvement is mostly in compile-time -- the number of guards drops greatly, which greatly reduces the number of stack stores that are necessary, which causes lots of code to become dead.  Although the trace JIT gives us bad numbers on crypto-md5 because of the high compile-time costs, the code it generates is actually very good.  I tried greatly increasing the amount of text processed by the benchmark greatly and got these figures:

  js -m:                  5.3s
  js -j, before patch:    2.0s
  js -j, after patch:     0.8s

I bet we'd thrash JSC and V8 on an input of this size.
Comment 14 Andreas Gal :gal 2010-09-08 17:00:42 PDT
Comment on attachment 472524 [details] [diff] [review]
TM patch (against TM 52847:857ab0bfeb22)

ChecksRequired instead of checksRequired seems to be the standard. I am not a big fan of reference parameters. Pointers are more explicit. Just a nit. Up to you.
Comment 15 Edwin Smith 2010-09-10 12:08:30 PDT
Comment on attachment 472526 [details] [diff] [review]
NJ patch (against TM 52847:857ab0bfeb22)


any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got one hit with eclipse's "find references" search, on macosx, in /Developer/SDKs/MacOSX10.4u.sdk/usr/include/sys/param.h, but it appears to be benign.

Should Interval values be passed/returned by reference instead of by value?  Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or 20 depending on sizeof(bool).

Would anything go wrong with LIR_rshui(x, 32)?  IIRC, LIR follows x86 semantics, the low 5 bits of the shift value are used, the others are ignored.  (gah, this might not be documented).  the interval analyzer for case LIR_rshui is doing if (... y > 0), but maybe should be (... (y&31) > 0).   (same question for rshi.  

note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity?  not important for TM code?  (just asking)

if there's a feasible way to exhaustively test the boundary conditions, it would be worth it; bugs will be hard to find in this code when they dont lead to crashes.  R+ with the rshui logic double-checked and/or tested.
Comment 16 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-14 09:24:47 PDT
(In reply to comment #15)
> 
> any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got

I've changed them to NJ_MIN and NJ_MAX and moved them into nanojit.h.

> Should Interval values be passed/returned by reference instead of by value? 
> Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or
> 20 depending on sizeof(bool).

My cachegrind analysis was that the cost of the analysis was negligible.  I originally tried passing by reference but it was a total pain because you frequently need to return a new interval from a function so it required dynamic memory allocations.

> Would anything go wrong with LIR_rshui(x, 32)?  IIRC, LIR follows x86
> semantics, the low 5 bits of the shift value are used, the others are ignored. 
> (gah, this might not be documented).  the interval analyzer for case LIR_rshui
> is doing if (... y > 0), but maybe should be (... (y&31) > 0).   (same question
> for rshi.  

I've masked out the high 27 bits.

> note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity?  not
> important for TM code?  (just asking)

Yeah, it's not important so far, so I figured not doing it gave me one less chance to get it wrong.

> if there's a feasible way to exhaustively test the boundary conditions, it
> would be worth it; bugs will be hard to find in this code when they dont lead
> to crashes.  R+ with the rshui logic double-checked and/or tested.

Exhaustive testing is tricky;  you'd have to restrict 8 bits values or less to make it plausible.  I'll give it some more thought.
Comment 17 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-14 10:28:59 PDT
> Exhaustive testing is tricky;  you'd have to restrict 8 bits values or less to
> make it plausible.  I'll give it some more thought.

I just tried doing that, I found writing the testing code correctly more difficult than just looking at the interval analysis and checking them mentally.  The only cases to worry about are addi, subi, andi, muli, rshi, rshui, modi and cmovi, and I've restricted the cases where it does something clever so much that I think they're pretty easy to mentally check.  But we'll see what Julian thinks.
Comment 18 Julian Seward [:jseward] 2010-09-16 16:37:13 PDT
Comment on attachment 472526 [details] [diff] [review]
NJ patch (against TM 52847:857ab0bfeb22)

r+ with the following comments considered (not necessarily actioned):

Interval::of: check lim > 0 and give up once, at the start of the fn,
rather than doing it individually.  This would make it clearer it's
a standard limited-depth traversal backwards.

Interval::of, case LIR_rshui
you have   if (ins->oprnd2()->isImmI() && lim > 0) and then
           if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0))
producing
   Interval(0, x.hi >> y)
what if y > 31  ?   Then x.hi >> y is undefined a la C++ standard.  Can that
ever happen?  Should the guards explicitly exclude it?


Interval::of, case LIR_rshi
ditto, does y need to be range-checked?
do the 1s in (1 << (31-y)) need to explicitly be made int64_ts 
before the shift?

Interval::of, case LIR_rshi
could you easily do better with
   Interval( MAX( -(1<<(31-y)), lo),  MIN((1<<(31-y)-1), hi)
as it stands it can produce a larger result interval than the left
operand, even though it is strictly moving the left operand interval
closer to zero.  But since this adds to the verification burden,
don't bother if it isn't necessary, I suppose.


Interval::of, case LIR_modi
   Seems to give semantics for the case y == 0.  Wouldn't it be safer
to say I don't know in that case?


bool isSane():
would this be clearer?  I think it's correct as it stands, but not
sure.

  return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) ||
         (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX);


How will you know if the analysis is punting on case which might be
useful to handle, eg, some op applied to non-overflowed operands,
that you actually could and should handle?  can you add some debug
printing to show those cases?


I think you can get away without a standalone exhaustive test program.
But if it gets any more aggressive than this, then it might become
necessary.
Comment 19 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-16 16:53:51 PDT
(In reply to comment #18)
> 
> Interval::of: check lim > 0 and give up once, at the start of the fn,
> rather than doing it individually.  This would make it clearer it's
> a standard limited-depth traversal backwards.

Good idea, I'll do it.

> Interval::of, case LIR_rshui
> you have   if (ins->oprnd2()->isImmI() && lim > 0) and then
>            if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0))
> producing
>    Interval(0, x.hi >> y)
> what if y > 31  ?   Then x.hi >> y is undefined a la C++ standard.  Can that
> ever happen?  Should the guards explicitly exclude it?
> 
> Interval::of, case LIR_rshi
> ditto, does y need to be range-checked?
> do the 1s in (1 << (31-y)) need to explicitly be made int64_ts 
> before the shift?

Semantics of LIR are that only the bottom 5 bits of the RHS operand of shifts are considered.  I'll update it account for this.

> Interval::of, case LIR_rshi
> could you easily do better with
>    Interval( MAX( -(1<<(31-y)), lo),  MIN((1<<(31-y)-1), hi)
> as it stands it can produce a larger result interval than the left
> operand, even though it is strictly moving the left operand interval
> closer to zero.  But since this adds to the verification burden,
> don't bother if it isn't necessary, I suppose.

Yeah, I'll keep it simpler.

> Interval::of, case LIR_modi
>    Seems to give semantics for the case y == 0.  Wouldn't it be safer
> to say I don't know in that case?

It'll crash if that happens, but I'll add a y!=0 test.

> bool isSane():
> would this be clearer?  I think it's correct as it stands, but not
> sure.
> 
>   return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) ||
>          (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX);

Yes, that's stricter and thus better, I'll change.
 
> How will you know if the analysis is punting on case which might be
> useful to handle, eg, some op applied to non-overflowed operands,
> that you actually could and should handle?  can you add some debug
> printing to show those cases?

I won't, but the output would be huge.  I'm happy with the trade-off between simplicity and effectiveness as it stands.
Comment 20 Nicholas Nethercote [:njn] (on vacation until July 11) 2010-09-21 00:09:08 PDT
http://hg.mozilla.org/projects/nanojit-central/rev/8f99b0ecb758

http://hg.mozilla.org/tracemonkey/rev/e22d91ec5e97
http://hg.mozilla.org/tracemonkey/rev/fa09cd0f1fed

This was about a 3ms win on AWFY for sunspider.
Comment 21 Robert Sayre 2010-09-21 08:48:18 PDT
We're hitting a nanojit assertion the tinderbox:

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1285078519.1285079617.21105.gz&fulltext=1#err1

stack included.

Note You need to log in before you can comment on or make changes to this bug.