Closed Bug 575529 Opened 14 years ago Closed 14 years ago

TM: use interval analysis to remove more overflow checks

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey,fixed-in-tamarin)

Attachments

(2 files, 2 obsolete files)

Attached patch patch (obsolete) — Splinter Review
TM currently has a function IsOverflowSafe() which is used to remove some overflow checks for demoted (integer) arithmetic operations.  It's pretty weak.

The attached patch does a basic interval analysis that does a better job.  It does a great job for crypto-md5, roughly halving the size of the final LIR, and reducing the number of static overflow checks from 416 to 26.  This is because of this function:

function safe_add(x, y)
{
  var lsw = (x & 0xFFFF) + (y & 0xFFFF);
  var msw = (x >> 16) + (y >> 16) + (lsw >> 16);
  return (msw << 16) | (lsw & 0xFFFF);
}

The new analysis can tell that no overflow checks are needed for any of the additions.  This speeds up crypto-md5 by 1.25--1.50x (it varies across platforms).  It also speeds crypto-sha1 by 1.05--1.10x.

I'd appreciate a careful review, particularly of the interval arithmetic.  I've checked it pretty carefully but more eyes are welcome.  For a while I tried to do something smarter, esp. with LIR_andi, but it got complicated and I started getting nervous.

One thing worth mentioning is that the analysis redoes some work.  Eg. if you have an expression (a+b)+(c+d) it'll analyze a, b, c, d, a+b, c+d, (a+b)+(c+d).  I could try to memoize prior results but it doesn't seem important.

Another thing is that the checking for -0 after LIR_muli has changed.  It may be too optimistic.  For example, if you have x*2 where you don't know anything about the interval of x, I assume a check isn't needed.  But it's unclear to me if x could be -0.  We seem to guard against -0 occurring after div, mul and mod, but not after d2i which makes me nervous.

(One notable thing about -0 checks is that they trigger quite often, even when they don't need to.  So they're even more undesirable than overflow checks.)

Beyond this patch, I'd love it if we could get more out of this analysis by important possible intervals for eg. values on the stack.  If it could be done, we could possibly remove a lot more overflow checks.  I'm not sure how hard it would be, though.
Attachment #454789 - Flags: review?(gal)
Attachment #454789 - Flags: review?(jseward)
Ed, I think you guys would like this, too. We should move it into NJ.
+1
Hmm, TM has some extra guards to avoid div-by-zero and mod-by-zero.  The interval analysis could improve those cases as well.
Attached patch patch, v2 (obsolete) — Splinter Review
I moved all the generic interval analysis stuff into NJ;  I haven't split the NJ and TM patches yet though, doesn't seem worth it yet.  I also cleaned up interval::of() a bit.

I'm more confident now that the needsNegZeroCheck-related code is correct.  But I'm still worried that we could be losing -0 in several of the cases in d2i() (which would be a pre-existing bug).
Attachment #454789 - Attachment is obsolete: true
Attachment #455059 - Flags: review?(gal)
Attachment #454789 - Flags: review?(jseward)
Attachment #454789 - Flags: review?(gal)
Attachment #455059 - Flags: review?(jseward)
How much do you recurs repeatedly over the same LIR patterns? Would it be worth it storing the result of of() in a hash table?
(In reply to comment #5)
> How much do you recurs repeatedly over the same LIR patterns? Would it be worth
> it storing the result of of() in a hash table?

It depends on the program, in particular how long its expression chains are.  It appeared to be negligible for crypto-md5, which seems like a case with pretty long chains.  So I suspect a hash table wouldn't be worth the effort, and might even slow it down.
(In reply to comment #6)
>
> It appeared to be negligible for crypto-md5, which seems like a case with
> pretty long chains.

To quantify that:  0.28M instructions out of 42.3M.  In comparison, VisitFrameSlots<js::CountSlotsVisitor> accounts for 1.7M.
Comment on attachment 455059 [details] [diff] [review]
patch, v2

Julian was worried about the propagation of overflows within the analysis.  To cut a long story short, I think that, given the way that TM uses the analysis, it's safe.  But the safety argument is subtle.  Furthermore, the interval analysis code is in NJ and so I can't only consider how TM uses it.

So adding an explicit "has overflowed" value might help here.  It might also speed the analysis up on big expression chains, if I can stop working backwards as soon as an overflow value is hit.

I'll work up a new patch.
Attachment #455059 - Flags: review?(jseward)
Attachment #455059 - Flags: review?(gal)
Note to self:  I could strengthen the analysis considerably by looking at guards.  Eg. if there's a "exit if i > 100" guard then I know that after the guard that i <= 100.

I wonder if it's worth doing anything symbolically, too.
In bug 580468 comment 20 Travis Fisher suggested: 

"Regarding interval analysis and real-world testcases, it might have a good
impact on some of the "emulators are slow" complaints, like bug 509986 or bug
514765.  I would bet the hot core of an 8 bit emulator has a lot of integer ALU
operations that can be statically shown to have 8 or 16 bit results."

Though tracing the main loop of an emulator is probably a bad idea in general, JM should do much better on such coode, which would render the idea moot.
(In reply to comment #6)
> (In reply to comment #5)
> > How much do you recurs repeatedly over the same LIR patterns? Would it be worth
> > it storing the result of of() in a hash table?
> 
> It depends on the program, in particular how long its expression chains are. 
> It appeared to be negligible for crypto-md5, which seems like a case with
> pretty long chains.

That turns out to be untrue.  I was handling 'or' pessimistically, which breaks up the expression chains.  If I don't do that the expressions chains get very large and intertwined and I get an appalling blow-up.

A function like this is the pathological case:

function f(x, y) {
    var a = x + y;
    var b = a + a;
    var c = b + b;
    var d = c + c;
    var e = d + d;
    var f = e + e;
    var g = f + f;
    var h = g + g;
    return h;
}


> So I suspect a hash table wouldn't be worth the effort,
> and might even slow it down.

A hash table would clearly help with crypto-md5 and other extreme examples, but it would definitely slow the average case down -- CSE is the most expensive part of the NJ pipeline and it does a hash table lookup for many instructions, and this optimization is less generally applicable than CSE.

If we could store the bounds in the instructions that would make things easy and fast, but that doesn't seem reasonable.

Maybe having a counter that limits how far back the analysis goes would be a decent compromise.  It's true for safe_add() that going back 2 or 3 operands is enough and it's probably true for other cases where this optimization is effective.
Just the TM patch.  Fairly straightforward, most of the tricky stuff is in the NJ patch.
Attachment #455059 - Attachment is obsolete: true
Attachment #472524 - Flags: review?(gal)
Nanojit patch.  

- I added the notion of an 'overflowed' value that propagates through add/sub/mul like 'NaN'.  

- I simplified a number of the cases, eg. LIR_andi, LIR_rshi, LIR_rshui, so they're easier to understand but still cover the common cases.

- I added a recursion limit on Interval::of() to avoid analysis blow-ups.

Cachegrind results:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    95.193    95.226 (------) |    25.159    25.160 (------) | 3d-cube
|    57.410    57.322 (1.002x) |    23.099    23.012 (1.004x) | 3d-morph
|   116.739   116.792 (------) |    44.065    44.066 (------) | 3d-raytrace
|    73.043    73.048 (------) |    15.220    15.220 (------) | access-binary-
|   133.642   133.646 (------) |    82.806    82.806 (------) | access-fannkuc
|    37.526    37.541 (------) |    16.700    16.701 (------) | access-nbody
|    61.691    61.695 (------) |    28.901    28.901 (------) | access-nsieve
|    16.349    16.350 (------) |     3.249     3.249 (------) | bitops-3bit-bi
|    45.911    45.914 (------) |    32.701    32.701 (------) | bitops-bits-in
|    24.779    24.782 (------) |    12.019    12.019 (------) | bitops-bitwise
|    63.619    63.623 (------) |    38.301    38.301 (------) | bitops-nsieve-
|    49.008    49.010 (------) |    19.438    19.438 (------) | controlflow-re
|    47.619    41.988 (1.134x) |     5.964     4.406 (1.353x) | crypto-md5
|    30.330    29.145 (1.041x) |     6.809     6.220 (1.095x) | crypto-sha1
|    97.604    97.642 (------) |    17.295    17.295 (------) | date-format-to
|    83.587    83.603 (------) |     9.767     9.768 (------) | date-format-xp
|    54.274    54.278 (------) |    31.292    31.292 (------) | math-cordic
|    29.549    29.482 (1.002x) |     6.232     6.169 (1.010x) | math-partial-s
|    31.019    31.033 (------) |    13.319    13.319 (------) | math-spectral-
|    58.528    58.530 (------) |    34.592    34.592 (------) | regexp-dna
|    39.746    39.752 (------) |     9.408     9.408 (------) | string-base64
|   118.457   118.207 (1.002x) |    25.728    25.504 (1.009x) | string-fasta
|   139.147   139.133 (------) |    17.703    17.703 (------) | string-tagclou
|   180.727   180.666 (------) |    21.897    21.897 (------) | string-unpack-
|    58.414    58.415 (------) |    11.706    11.706 (------) | string-validat
-------
|  1743.921  1736.835 (1.004x) |   553.382   550.862 (1.005x) | all


crypto-md5 and crypto-sha1 benefit because no addxovi's occur in safe_add(), the other minor improvements are because some -0 checks are removed from some LIR_muli operations.

Timing shows about a 2--3ms win on crypto-md5, other improvements are in the noise.

It's worth noting that the crypto-md5 improvement is mostly in compile-time -- the number of guards drops greatly, which greatly reduces the number of stack stores that are necessary, which causes lots of code to become dead.  Although the trace JIT gives us bad numbers on crypto-md5 because of the high compile-time costs, the code it generates is actually very good.  I tried greatly increasing the amount of text processed by the benchmark greatly and got these figures:

  js -m:                  5.3s
  js -j, before patch:    2.0s
  js -j, after patch:     0.8s

I bet we'd thrash JSC and V8 on an input of this size.
Attachment #472526 - Flags: review?(edwsmith)
Attachment #472526 - Flags: review?(jseward)
Comment on attachment 472524 [details] [diff] [review]
TM patch (against TM 52847:857ab0bfeb22)

ChecksRequired instead of checksRequired seems to be the standard. I am not a big fan of reference parameters. Pointers are more explicit. Just a nit. Up to you.
Attachment #472524 - Flags: review?(gal) → review+
Comment on attachment 472526 [details] [diff] [review]
NJ patch (against TM 52847:857ab0bfeb22)


any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got one hit with eclipse's "find references" search, on macosx, in /Developer/SDKs/MacOSX10.4u.sdk/usr/include/sys/param.h, but it appears to be benign.

Should Interval values be passed/returned by reference instead of by value?  Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or 20 depending on sizeof(bool).

Would anything go wrong with LIR_rshui(x, 32)?  IIRC, LIR follows x86 semantics, the low 5 bits of the shift value are used, the others are ignored.  (gah, this might not be documented).  the interval analyzer for case LIR_rshui is doing if (... y > 0), but maybe should be (... (y&31) > 0).   (same question for rshi.  

note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity?  not important for TM code?  (just asking)

if there's a feasible way to exhaustively test the boundary conditions, it would be worth it; bugs will be hard to find in this code when they dont lead to crashes.  R+ with the rshui logic double-checked and/or tested.
Attachment #472526 - Flags: review?(edwsmith) → review+
(In reply to comment #15)
> 
> any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got

I've changed them to NJ_MIN and NJ_MAX and moved them into nanojit.h.

> Should Interval values be passed/returned by reference instead of by value? 
> Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or
> 20 depending on sizeof(bool).

My cachegrind analysis was that the cost of the analysis was negligible.  I originally tried passing by reference but it was a total pain because you frequently need to return a new interval from a function so it required dynamic memory allocations.

> Would anything go wrong with LIR_rshui(x, 32)?  IIRC, LIR follows x86
> semantics, the low 5 bits of the shift value are used, the others are ignored. 
> (gah, this might not be documented).  the interval analyzer for case LIR_rshui
> is doing if (... y > 0), but maybe should be (... (y&31) > 0).   (same question
> for rshi.  

I've masked out the high 27 bits.

> note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity?  not
> important for TM code?  (just asking)

Yeah, it's not important so far, so I figured not doing it gave me one less chance to get it wrong.

> if there's a feasible way to exhaustively test the boundary conditions, it
> would be worth it; bugs will be hard to find in this code when they dont lead
> to crashes.  R+ with the rshui logic double-checked and/or tested.

Exhaustive testing is tricky;  you'd have to restrict 8 bits values or less to make it plausible.  I'll give it some more thought.
> Exhaustive testing is tricky;  you'd have to restrict 8 bits values or less to
> make it plausible.  I'll give it some more thought.

I just tried doing that, I found writing the testing code correctly more difficult than just looking at the interval analysis and checking them mentally.  The only cases to worry about are addi, subi, andi, muli, rshi, rshui, modi and cmovi, and I've restricted the cases where it does something clever so much that I think they're pretty easy to mentally check.  But we'll see what Julian thinks.
Comment on attachment 472526 [details] [diff] [review]
NJ patch (against TM 52847:857ab0bfeb22)

r+ with the following comments considered (not necessarily actioned):

Interval::of: check lim > 0 and give up once, at the start of the fn,
rather than doing it individually.  This would make it clearer it's
a standard limited-depth traversal backwards.

Interval::of, case LIR_rshui
you have   if (ins->oprnd2()->isImmI() && lim > 0) and then
           if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0))
producing
   Interval(0, x.hi >> y)
what if y > 31  ?   Then x.hi >> y is undefined a la C++ standard.  Can that
ever happen?  Should the guards explicitly exclude it?


Interval::of, case LIR_rshi
ditto, does y need to be range-checked?
do the 1s in (1 << (31-y)) need to explicitly be made int64_ts 
before the shift?

Interval::of, case LIR_rshi
could you easily do better with
   Interval( MAX( -(1<<(31-y)), lo),  MIN((1<<(31-y)-1), hi)
as it stands it can produce a larger result interval than the left
operand, even though it is strictly moving the left operand interval
closer to zero.  But since this adds to the verification burden,
don't bother if it isn't necessary, I suppose.


Interval::of, case LIR_modi
   Seems to give semantics for the case y == 0.  Wouldn't it be safer
to say I don't know in that case?


bool isSane():
would this be clearer?  I think it's correct as it stands, but not
sure.

  return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) ||
         (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX);


How will you know if the analysis is punting on case which might be
useful to handle, eg, some op applied to non-overflowed operands,
that you actually could and should handle?  can you add some debug
printing to show those cases?


I think you can get away without a standalone exhaustive test program.
But if it gets any more aggressive than this, then it might become
necessary.
Attachment #472526 - Flags: review?(jseward) → review+
(In reply to comment #18)
> 
> Interval::of: check lim > 0 and give up once, at the start of the fn,
> rather than doing it individually.  This would make it clearer it's
> a standard limited-depth traversal backwards.

Good idea, I'll do it.

> Interval::of, case LIR_rshui
> you have   if (ins->oprnd2()->isImmI() && lim > 0) and then
>            if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0))
> producing
>    Interval(0, x.hi >> y)
> what if y > 31  ?   Then x.hi >> y is undefined a la C++ standard.  Can that
> ever happen?  Should the guards explicitly exclude it?
> 
> Interval::of, case LIR_rshi
> ditto, does y need to be range-checked?
> do the 1s in (1 << (31-y)) need to explicitly be made int64_ts 
> before the shift?

Semantics of LIR are that only the bottom 5 bits of the RHS operand of shifts are considered.  I'll update it account for this.

> Interval::of, case LIR_rshi
> could you easily do better with
>    Interval( MAX( -(1<<(31-y)), lo),  MIN((1<<(31-y)-1), hi)
> as it stands it can produce a larger result interval than the left
> operand, even though it is strictly moving the left operand interval
> closer to zero.  But since this adds to the verification burden,
> don't bother if it isn't necessary, I suppose.

Yeah, I'll keep it simpler.

> Interval::of, case LIR_modi
>    Seems to give semantics for the case y == 0.  Wouldn't it be safer
> to say I don't know in that case?

It'll crash if that happens, but I'll add a y!=0 test.

> bool isSane():
> would this be clearer?  I think it's correct as it stands, but not
> sure.
> 
>   return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) ||
>          (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX);

Yes, that's stricter and thus better, I'll change.
 
> How will you know if the analysis is punting on case which might be
> useful to handle, eg, some op applied to non-overflowed operands,
> that you actually could and should handle?  can you add some debug
> printing to show those cases?

I won't, but the output would be huge.  I'm happy with the trade-off between simplicity and effectiveness as it stands.
blocking2.0: --- → ?
http://hg.mozilla.org/projects/nanojit-central/rev/8f99b0ecb758

http://hg.mozilla.org/tracemonkey/rev/e22d91ec5e97
http://hg.mozilla.org/tracemonkey/rev/fa09cd0f1fed

This was about a 3ms win on AWFY for sunspider.
blocking2.0: ? → ---
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey
We're hitting a nanojit assertion the tinderbox:

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1285078519.1285079617.21105.gz&fulltext=1#err1

stack included.
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey,fixed-in-tamarin
http://hg.mozilla.org/mozilla-central/rev/e22d91ec5e97
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.