Closed
Bug 575529
Opened 14 years ago
Closed 14 years ago
TM: use interval analysis to remove more overflow checks
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey,fixed-in-tamarin)
Attachments
(2 files, 2 obsolete files)
4.23 KB,
patch
|
gal
:
review+
|
Details | Diff | Splinter Review |
12.04 KB,
patch
|
edwsmith
:
review+
jseward
:
review+
|
Details | Diff | Splinter Review |
TM currently has a function IsOverflowSafe() which is used to remove some overflow checks for demoted (integer) arithmetic operations. It's pretty weak. The attached patch does a basic interval analysis that does a better job. It does a great job for crypto-md5, roughly halving the size of the final LIR, and reducing the number of static overflow checks from 416 to 26. This is because of this function: function safe_add(x, y) { var lsw = (x & 0xFFFF) + (y & 0xFFFF); var msw = (x >> 16) + (y >> 16) + (lsw >> 16); return (msw << 16) | (lsw & 0xFFFF); } The new analysis can tell that no overflow checks are needed for any of the additions. This speeds up crypto-md5 by 1.25--1.50x (it varies across platforms). It also speeds crypto-sha1 by 1.05--1.10x. I'd appreciate a careful review, particularly of the interval arithmetic. I've checked it pretty carefully but more eyes are welcome. For a while I tried to do something smarter, esp. with LIR_andi, but it got complicated and I started getting nervous. One thing worth mentioning is that the analysis redoes some work. Eg. if you have an expression (a+b)+(c+d) it'll analyze a, b, c, d, a+b, c+d, (a+b)+(c+d). I could try to memoize prior results but it doesn't seem important. Another thing is that the checking for -0 after LIR_muli has changed. It may be too optimistic. For example, if you have x*2 where you don't know anything about the interval of x, I assume a check isn't needed. But it's unclear to me if x could be -0. We seem to guard against -0 occurring after div, mul and mod, but not after d2i which makes me nervous. (One notable thing about -0 checks is that they trigger quite often, even when they don't need to. So they're even more undesirable than overflow checks.) Beyond this patch, I'd love it if we could get more out of this analysis by important possible intervals for eg. values on the stack. If it could be done, we could possibly remove a lot more overflow checks. I'm not sure how hard it would be, though.
Attachment #454789 -
Flags: review?(gal)
Assignee | ||
Updated•14 years ago
|
Attachment #454789 -
Flags: review?(jseward)
Comment 1•14 years ago
|
||
Ed, I think you guys would like this, too. We should move it into NJ.
Comment 2•14 years ago
|
||
+1
Assignee | ||
Comment 3•14 years ago
|
||
Hmm, TM has some extra guards to avoid div-by-zero and mod-by-zero. The interval analysis could improve those cases as well.
Assignee | ||
Comment 4•14 years ago
|
||
I moved all the generic interval analysis stuff into NJ; I haven't split the NJ and TM patches yet though, doesn't seem worth it yet. I also cleaned up interval::of() a bit. I'm more confident now that the needsNegZeroCheck-related code is correct. But I'm still worried that we could be losing -0 in several of the cases in d2i() (which would be a pre-existing bug).
Attachment #454789 -
Attachment is obsolete: true
Attachment #455059 -
Flags: review?(gal)
Attachment #454789 -
Flags: review?(jseward)
Attachment #454789 -
Flags: review?(gal)
Assignee | ||
Updated•14 years ago
|
Attachment #455059 -
Flags: review?(jseward)
Comment 5•14 years ago
|
||
How much do you recurs repeatedly over the same LIR patterns? Would it be worth it storing the result of of() in a hash table?
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #5) > How much do you recurs repeatedly over the same LIR patterns? Would it be worth > it storing the result of of() in a hash table? It depends on the program, in particular how long its expression chains are. It appeared to be negligible for crypto-md5, which seems like a case with pretty long chains. So I suspect a hash table wouldn't be worth the effort, and might even slow it down.
Assignee | ||
Comment 7•14 years ago
|
||
(In reply to comment #6) > > It appeared to be negligible for crypto-md5, which seems like a case with > pretty long chains. To quantify that: 0.28M instructions out of 42.3M. In comparison, VisitFrameSlots<js::CountSlotsVisitor> accounts for 1.7M.
Assignee | ||
Comment 8•14 years ago
|
||
Comment on attachment 455059 [details] [diff] [review] patch, v2 Julian was worried about the propagation of overflows within the analysis. To cut a long story short, I think that, given the way that TM uses the analysis, it's safe. But the safety argument is subtle. Furthermore, the interval analysis code is in NJ and so I can't only consider how TM uses it. So adding an explicit "has overflowed" value might help here. It might also speed the analysis up on big expression chains, if I can stop working backwards as soon as an overflow value is hit. I'll work up a new patch.
Attachment #455059 -
Flags: review?(jseward)
Attachment #455059 -
Flags: review?(gal)
Assignee | ||
Comment 9•14 years ago
|
||
Note to self: I could strengthen the analysis considerably by looking at guards. Eg. if there's a "exit if i > 100" guard then I know that after the guard that i <= 100. I wonder if it's worth doing anything symbolically, too.
Assignee | ||
Comment 10•14 years ago
|
||
In bug 580468 comment 20 Travis Fisher suggested: "Regarding interval analysis and real-world testcases, it might have a good impact on some of the "emulators are slow" complaints, like bug 509986 or bug 514765. I would bet the hot core of an 8 bit emulator has a lot of integer ALU operations that can be statically shown to have 8 or 16 bit results." Though tracing the main loop of an emulator is probably a bad idea in general, JM should do much better on such coode, which would render the idea moot.
Assignee | ||
Comment 11•14 years ago
|
||
(In reply to comment #6) > (In reply to comment #5) > > How much do you recurs repeatedly over the same LIR patterns? Would it be worth > > it storing the result of of() in a hash table? > > It depends on the program, in particular how long its expression chains are. > It appeared to be negligible for crypto-md5, which seems like a case with > pretty long chains. That turns out to be untrue. I was handling 'or' pessimistically, which breaks up the expression chains. If I don't do that the expressions chains get very large and intertwined and I get an appalling blow-up. A function like this is the pathological case: function f(x, y) { var a = x + y; var b = a + a; var c = b + b; var d = c + c; var e = d + d; var f = e + e; var g = f + f; var h = g + g; return h; } > So I suspect a hash table wouldn't be worth the effort, > and might even slow it down. A hash table would clearly help with crypto-md5 and other extreme examples, but it would definitely slow the average case down -- CSE is the most expensive part of the NJ pipeline and it does a hash table lookup for many instructions, and this optimization is less generally applicable than CSE. If we could store the bounds in the instructions that would make things easy and fast, but that doesn't seem reasonable. Maybe having a counter that limits how far back the analysis goes would be a decent compromise. It's true for safe_add() that going back 2 or 3 operands is enough and it's probably true for other cases where this optimization is effective.
Assignee | ||
Comment 12•14 years ago
|
||
Just the TM patch. Fairly straightforward, most of the tricky stuff is in the NJ patch.
Attachment #455059 -
Attachment is obsolete: true
Attachment #472524 -
Flags: review?(gal)
Assignee | ||
Comment 13•14 years ago
|
||
Nanojit patch. - I added the notion of an 'overflowed' value that propagates through add/sub/mul like 'NaN'. - I simplified a number of the cases, eg. LIR_andi, LIR_rshi, LIR_rshui, so they're easier to understand but still cover the common cases. - I added a recursion limit on Interval::of() to avoid analysis blow-ups. Cachegrind results: --------------------------------------------------------------- | millions of instructions executed | | total | on-trace (may overestimate) | --------------------------------------------------------------- | 95.193 95.226 (------) | 25.159 25.160 (------) | 3d-cube | 57.410 57.322 (1.002x) | 23.099 23.012 (1.004x) | 3d-morph | 116.739 116.792 (------) | 44.065 44.066 (------) | 3d-raytrace | 73.043 73.048 (------) | 15.220 15.220 (------) | access-binary- | 133.642 133.646 (------) | 82.806 82.806 (------) | access-fannkuc | 37.526 37.541 (------) | 16.700 16.701 (------) | access-nbody | 61.691 61.695 (------) | 28.901 28.901 (------) | access-nsieve | 16.349 16.350 (------) | 3.249 3.249 (------) | bitops-3bit-bi | 45.911 45.914 (------) | 32.701 32.701 (------) | bitops-bits-in | 24.779 24.782 (------) | 12.019 12.019 (------) | bitops-bitwise | 63.619 63.623 (------) | 38.301 38.301 (------) | bitops-nsieve- | 49.008 49.010 (------) | 19.438 19.438 (------) | controlflow-re | 47.619 41.988 (1.134x) | 5.964 4.406 (1.353x) | crypto-md5 | 30.330 29.145 (1.041x) | 6.809 6.220 (1.095x) | crypto-sha1 | 97.604 97.642 (------) | 17.295 17.295 (------) | date-format-to | 83.587 83.603 (------) | 9.767 9.768 (------) | date-format-xp | 54.274 54.278 (------) | 31.292 31.292 (------) | math-cordic | 29.549 29.482 (1.002x) | 6.232 6.169 (1.010x) | math-partial-s | 31.019 31.033 (------) | 13.319 13.319 (------) | math-spectral- | 58.528 58.530 (------) | 34.592 34.592 (------) | regexp-dna | 39.746 39.752 (------) | 9.408 9.408 (------) | string-base64 | 118.457 118.207 (1.002x) | 25.728 25.504 (1.009x) | string-fasta | 139.147 139.133 (------) | 17.703 17.703 (------) | string-tagclou | 180.727 180.666 (------) | 21.897 21.897 (------) | string-unpack- | 58.414 58.415 (------) | 11.706 11.706 (------) | string-validat ------- | 1743.921 1736.835 (1.004x) | 553.382 550.862 (1.005x) | all crypto-md5 and crypto-sha1 benefit because no addxovi's occur in safe_add(), the other minor improvements are because some -0 checks are removed from some LIR_muli operations. Timing shows about a 2--3ms win on crypto-md5, other improvements are in the noise. It's worth noting that the crypto-md5 improvement is mostly in compile-time -- the number of guards drops greatly, which greatly reduces the number of stack stores that are necessary, which causes lots of code to become dead. Although the trace JIT gives us bad numbers on crypto-md5 because of the high compile-time costs, the code it generates is actually very good. I tried greatly increasing the amount of text processed by the benchmark greatly and got these figures: js -m: 5.3s js -j, before patch: 2.0s js -j, after patch: 0.8s I bet we'd thrash JSC and V8 on an input of this size.
Attachment #472526 -
Flags: review?(edwsmith)
Assignee | ||
Updated•14 years ago
|
Attachment #472526 -
Flags: review?(jseward)
Comment 14•14 years ago
|
||
Comment on attachment 472524 [details] [diff] [review] TM patch (against TM 52847:857ab0bfeb22) ChecksRequired instead of checksRequired seems to be the standard. I am not a big fan of reference parameters. Pointers are more explicit. Just a nit. Up to you.
Attachment #472524 -
Flags: review?(gal) → review+
Comment 15•14 years ago
|
||
Comment on attachment 472526 [details] [diff] [review] NJ patch (against TM 52847:857ab0bfeb22) any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got one hit with eclipse's "find references" search, on macosx, in /Developer/SDKs/MacOSX10.4u.sdk/usr/include/sys/param.h, but it appears to be benign. Should Interval values be passed/returned by reference instead of by value? Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or 20 depending on sizeof(bool). Would anything go wrong with LIR_rshui(x, 32)? IIRC, LIR follows x86 semantics, the low 5 bits of the shift value are used, the others are ignored. (gah, this might not be documented). the interval analyzer for case LIR_rshui is doing if (... y > 0), but maybe should be (... (y&31) > 0). (same question for rshi. note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity? not important for TM code? (just asking) if there's a feasible way to exhaustively test the boundary conditions, it would be worth it; bugs will be hard to find in this code when they dont lead to crashes. R+ with the rshui logic double-checked and/or tested.
Attachment #472526 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 16•14 years ago
|
||
(In reply to comment #15) > > any risk of header file collisions with MIN and MAX #defined in LIR.h? (I got I've changed them to NJ_MIN and NJ_MAX and moved them into nanojit.h. > Should Interval values be passed/returned by reference instead of by value? > Not sure how bad the struct copy is on 32bit machines. sizeof(interval) = 17 or > 20 depending on sizeof(bool). My cachegrind analysis was that the cost of the analysis was negligible. I originally tried passing by reference but it was a total pain because you frequently need to return a new interval from a function so it required dynamic memory allocations. > Would anything go wrong with LIR_rshui(x, 32)? IIRC, LIR follows x86 > semantics, the low 5 bits of the shift value are used, the others are ignored. > (gah, this might not be documented). the interval analyzer for case LIR_rshui > is doing if (... y > 0), but maybe should be (... (y&31) > 0). (same question > for rshi. I've masked out the high 27 bits. > note, lshi(x, c) is the same as muli(x, (1<<c)), missed opportunity? not > important for TM code? (just asking) Yeah, it's not important so far, so I figured not doing it gave me one less chance to get it wrong. > if there's a feasible way to exhaustively test the boundary conditions, it > would be worth it; bugs will be hard to find in this code when they dont lead > to crashes. R+ with the rshui logic double-checked and/or tested. Exhaustive testing is tricky; you'd have to restrict 8 bits values or less to make it plausible. I'll give it some more thought.
Assignee | ||
Comment 17•14 years ago
|
||
> Exhaustive testing is tricky; you'd have to restrict 8 bits values or less to
> make it plausible. I'll give it some more thought.
I just tried doing that, I found writing the testing code correctly more difficult than just looking at the interval analysis and checking them mentally. The only cases to worry about are addi, subi, andi, muli, rshi, rshui, modi and cmovi, and I've restricted the cases where it does something clever so much that I think they're pretty easy to mentally check. But we'll see what Julian thinks.
Comment 18•14 years ago
|
||
Comment on attachment 472526 [details] [diff] [review] NJ patch (against TM 52847:857ab0bfeb22) r+ with the following comments considered (not necessarily actioned): Interval::of: check lim > 0 and give up once, at the start of the fn, rather than doing it individually. This would make it clearer it's a standard limited-depth traversal backwards. Interval::of, case LIR_rshui you have if (ins->oprnd2()->isImmI() && lim > 0) and then if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0)) producing Interval(0, x.hi >> y) what if y > 31 ? Then x.hi >> y is undefined a la C++ standard. Can that ever happen? Should the guards explicitly exclude it? Interval::of, case LIR_rshi ditto, does y need to be range-checked? do the 1s in (1 << (31-y)) need to explicitly be made int64_ts before the shift? Interval::of, case LIR_rshi could you easily do better with Interval( MAX( -(1<<(31-y)), lo), MIN((1<<(31-y)-1), hi) as it stands it can produce a larger result interval than the left operand, even though it is strictly moving the left operand interval closer to zero. But since this adds to the verification burden, don't bother if it isn't necessary, I suppose. Interval::of, case LIR_modi Seems to give semantics for the case y == 0. Wouldn't it be safer to say I don't know in that case? bool isSane(): would this be clearer? I think it's correct as it stands, but not sure. return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) || (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX); How will you know if the analysis is punting on case which might be useful to handle, eg, some op applied to non-overflowed operands, that you actually could and should handle? can you add some debug printing to show those cases? I think you can get away without a standalone exhaustive test program. But if it gets any more aggressive than this, then it might become necessary.
Attachment #472526 -
Flags: review?(jseward) → review+
Assignee | ||
Comment 19•14 years ago
|
||
(In reply to comment #18) > > Interval::of: check lim > 0 and give up once, at the start of the fn, > rather than doing it individually. This would make it clearer it's > a standard limited-depth traversal backwards. Good idea, I'll do it. > Interval::of, case LIR_rshui > you have if (ins->oprnd2()->isImmI() && lim > 0) and then > if ... (!x.hasOverflowed && (x.lo >= 0 || y > 0)) > producing > Interval(0, x.hi >> y) > what if y > 31 ? Then x.hi >> y is undefined a la C++ standard. Can that > ever happen? Should the guards explicitly exclude it? > > Interval::of, case LIR_rshi > ditto, does y need to be range-checked? > do the 1s in (1 << (31-y)) need to explicitly be made int64_ts > before the shift? Semantics of LIR are that only the bottom 5 bits of the RHS operand of shifts are considered. I'll update it account for this. > Interval::of, case LIR_rshi > could you easily do better with > Interval( MAX( -(1<<(31-y)), lo), MIN((1<<(31-y)-1), hi) > as it stands it can produce a larger result interval than the left > operand, even though it is strictly moving the left operand interval > closer to zero. But since this adds to the verification burden, > don't bother if it isn't necessary, I suppose. Yeah, I'll keep it simpler. > Interval::of, case LIR_modi > Seems to give semantics for the case y == 0. Wouldn't it be safer > to say I don't know in that case? It'll crash if that happens, but I'll add a y!=0 test. > bool isSane(): > would this be clearer? I think it's correct as it stands, but not > sure. > > return (hasOverflowed && lo == UNTRUSTWORTHY && hi == UNTRUSTWORTHY) || > (!hasOverflowed && lo <= hi && I32_MIN <= lo && hi <= I32_MAX); Yes, that's stricter and thus better, I'll change. > How will you know if the analysis is punting on case which might be > useful to handle, eg, some op applied to non-overflowed operands, > that you actually could and should handle? can you add some debug > printing to show those cases? I won't, but the output would be huge. I'm happy with the trade-off between simplicity and effectiveness as it stands.
Updated•14 years ago
|
blocking2.0: --- → ?
Assignee | ||
Comment 20•14 years ago
|
||
http://hg.mozilla.org/projects/nanojit-central/rev/8f99b0ecb758 http://hg.mozilla.org/tracemonkey/rev/e22d91ec5e97 http://hg.mozilla.org/tracemonkey/rev/fa09cd0f1fed This was about a 3ms win on AWFY for sunspider.
blocking2.0: ? → ---
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey
Comment 21•14 years ago
|
||
We're hitting a nanojit assertion the tinderbox: http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1285078519.1285079617.21105.gz&fulltext=1#err1 stack included.
Updated•14 years ago
|
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey,fixed-in-tamarin
Comment 23•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/e22d91ec5e97
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•