Closed
Bug 480494
Opened 15 years ago
Closed 15 years ago
[TM] Dromaeo: Drop in crypto/string processing performance
Categories
(Core :: JavaScript Engine, defect, P2)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: rflazaro, Assigned: dmandelin)
References
()
Details
(Keywords: perf, regression)
Attachments
(2 files)
70.74 KB,
text/plain
|
Details | |
1.62 KB,
patch
|
graydon
:
review-
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090208 Minefield/3.2a1pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090208 Minefield/3.2a1pre There is a "significant" drop in crypto/string processing performance. I am adding crypto because it involves the MD5 and SHA1 hashing it might be a string only issue. See http://dromaeo.com/?id=61195,61196 for the comparison I only have the 2/4 and 2/8 nightlies; I don't have time to narrow it right now. Firefox 3.2a1pre ID:20090204032912 Firefox 3.2a1pre ID:20090208034355 Reproducible: Always Steps to Reproduce: 1. Run http://dromaeo.com/?sunspider-crypto-aes|sunspider-crypto-md5|v8-crypto|sunspider-crypto-sha1 using Firefox 3.2a1pre ID:20090204032912 2. Run http://dromaeo.com/?sunspider-crypto-aes|sunspider-crypto-md5|v8-crypto|sunspider-crypto-sha1 using Firefox 3.2a1pre ID:20090208034355 3. Compare results Actual Results: Firefox 3.2a1pre ID:20090208034355 build is "slower" Expected Results: Firefox 3.2a1pre ID:20090208034355 build should exhibit same performance or better than previous build
Changeset range is betweeb http://hg.mozilla.org/mozilla-central/rev/4ada519b5c97 and http://hg.mozilla.org/mozilla-central/rev/688c44602a55 I don't have the time to narrow it down right now but I suspect it's a string-related change
Component: General → JavaScript Engine
Product: Firefox → Core
Updated•15 years ago
|
Assignee: nobody → general
QA Contact: general → general
I think I've narrowed the range to: 2/4 Firefox 3.2a1pre ID:20090204032912 http://dromaeo.com/?id=61195 2/5 Firefox 3.2a1pre ID:20090205033258 http://dromaeo.com/?id=61512 Comparison http://dromaeo.com/?id=61195,61512 Checkin range is http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2%2F4%2F09&enddate=2%2F5%2F09 There's a whole lot of checkins in that range and I don't see any string or loops specific code. But there's a "big" TM merge.
Checkin range based on changesets: http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=4ada519b5c97&tochange=5de9f1e51c68
Keywords: perf,
regression
Updated•15 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9.1?
Summary: Dromaeo: Drop in crypto/string processing performance → [TM] Dromaeo: Drop in crypto/string processing performance
Comment 4•15 years ago
|
||
Related to bug 481922 ? /be
I did not see bug 481922 improving any of the string tests I included. I expanded my test set to included other regressions which involve strings and loops. Builds tested 2/4 Firefox 3.2a1pre ID:20090204032912 2/5 Firefox 3.2a1pre ID:20090205033258 3/9 Firefox 3.2a1pre ID:20090309032542 Test: http://dromaeo.com/?sunspider-crypto-aes|sunspider-string-fasta|sunspider-access-fannkuch|sunspider-crypto-md5|sunspider-access-nsieve|sunspider-crypto-sha1 Comaparison: http://dromaeo.com/?id=61947,61948,61950 My original report detected slowdowns of about 50% in MD5 Hashing and SHA1 Hashing. Looking back I also see a slowdown for Prime Number Computation in my original regression range. My current test set shows slowdowns of about 50% for AES Encryption/Decryption, DNA Sequence Alignment, and Fannkuch. These are recent regressions; I have no regression range for these.
Including TM build 20090309021534 in comparison (test 61951). It shows that (except for DNA Sequence Alignment) we are almost back to pre-regression performance. http://dromaeo.com/?id=61947,61948,61950,61951 Waiting for merge to m-c... =)
Updated•15 years ago
|
Flags: blocking1.9.1? → blocking1.9.1+
Assignee | ||
Comment 7•15 years ago
|
||
I just ran through this myself, comparing the two changesets you reported with this morning's tip, which is 2bfbc7fe3867. I will call them R0, R1, and Rt. Notable differences: R0 R1 Rt DNA 169 171 131 MD5 1278 518 1100 Prime 170 155 196 SHA1 1003 571 969 As of today, it seems that we may have: - a new regression in DNA: 170 -> 130 - possible mild regressions in MD5, SHA1 - improvement in Prime
Assignee | ||
Comment 8•15 years ago
|
||
Detailed stats for DNA, R1 vs Rt (btw, "DNA" actually means string-fasta) are below. Ignore the non-JS time; it is irrelevant. I think Dromaeo runs everything for 20 seconds, so the key thing here is more the ratio of interpreted time vs. native time. # Rt (current, slower) david-mandelins-macbook-pro:scripts dmandelin$ python binlog.py Activity duration summary: interpret 7982.501978 monitor 12.656783 record 15.870025 compile 12.821750 execute 7.658347 native 12166.841877 ------------------------- Subtotal 20198.350760 Non-JS 21191.393634 ========================= Total 41389.744394 Estimated speedup: 1.90x Reasons for transitions to interpret: start 5928.450675 ms 3326 1782.456607 us abort 861.947463 ms 812 1061.511654 us timeoutExit 557.293237 ms 8 69661.654602 us backoff 519.918843 ms 1206 431.110152 us cold 93.904430 ms 687 136.687672 us extendOuter 15.672669 ms 1510 10.379251 us extend 3.275303 ms 174 18.823582 us mismatchExit 1.740493 ms 1465 1.188050 us monitor 0.240087 ms 398 0.603234 us extendInner 0.039302 ms 11 3.572893 us deepBailExit 0.019476 ms 2 9.737955 us # R1, faster david-mandelins-macbook-pro:scripts dmandelin$ python binlog.py Activity duration summary: interpret 4974.553315 monitor 31.304646 record 25.471654 compile 19.401827 execute 38.807475 native 14909.712752 ------------------------- Subtotal 19999.251669 Non-JS 12888.297141 ========================= Total 32887.548810 Estimated speedup: 2.11x Reasons for transitions to interpret: timeoutExit 3070.217471 ms 7584 404.828253 us extend 965.701927 ms 1440 670.626338 us start 580.653700 ms 4158 139.647355 us abort 125.226376 ms 756 165.643354 us extendOuter 91.527501 ms 9663 9.471955 us monitor 90.505205 ms 1251 72.346287 us cold 37.329944 ms 4448 8.392523 us mismatchExit 12.970392 ms 1910 6.790781 us deepBailExit 0.420799 ms 17 24.752861 us
Assignee | ||
Comment 9•15 years ago
|
||
On string-fasta we're getting burned by blacklisting. The stats in my last comment subtly show this, but it was much easier to see with the visualization. At some points we spend over a second in the interpreter without a break. The regression most likely came from the new, more aggressive blacklisting. I can make it go away by setting BL_ATTEMPTS to 200. The attachment contains the aborts for my run. There are many--mostly jsQuery stuff.
Assignee | ||
Comment 10•15 years ago
|
||
OK, I now have a more definitive diagnosis of Dromaeo string-fasta ("DNA Sequence Alignment"). This is the key function: function fastaRepeat(n, seq) { var seqi = 0, lenOut = 60; while (n>0) { *** if (n<lenOut) lenOut = n; if (seqi + lenOut < seq.length) { ret = seq.substring(seqi, seqi+lenOut); seqi += lenOut; } else { var s = seq.substring(seqi); seqi = lenOut - s.length; ret = s + seq.substring(0, seqi); } n -= lenOut; } } In the standard SS version, this function is called once. We compile 3 traces for this loop at the beginning, and then run natively until the function is done. Note that for the line marked ***, we always take the same branch (bypassing the 'then' statement), except on the last iteration of the loop, where we execute the 'then' statement. If we call the function more than once (either due to adding a loop to the SS version, or just calling it a lot like Dromaeo does), bad things start happening: 1. On the second call, we have to record the 3 traces again, because the typemap is now different. The first time, it was OOISIIB and OOISIIS:B, while the second time (in Dromaeo), it is OOISIIBII:S, where the colon separates the stack part from the global part. That slot is for the local variable 's'. Why it would be different, I don't know. This seems like a bug to me. 2. We will take *** enough times with the new trace to try to extend the tree from that exit. But that extension trace exits the loop without our noticing, and then exits the function, and we abort at the LeaveFrame (because that would take us back to a frame not incorporated into this trace). 3. Eventually the aborts occur enough times (3) to cause blacklisting. The blacklisting is applied to the root of the trace tree, and it makes it so that js_MonitorLoopEdge is not called for that trace tree ever again. So now we have a tree with good compiled traces, but we can't call it. I think each one of these events represents a defect, although I'm sure 3 is the easiest to solve in the short term: 1': In reality the types on the second call are identical to those on the first call. We should figure out how to decide type compatibility more accurately. 2': I don't see why we have to abort the trace. We could consider simply stopping the trace at the frame boundary and then trying to start a new trace at that point. 3': Here we are blacklisting a tree, in the sense of never recording or tracing it again, simply because one of its branches couldn't be compiled. Also, my previous studies indicated that the LeaveFrame abort condition isn't very "harmful", in the sense that valid traces can often be formed for headers that get that condition. Because entering and leaving traces repeatedly is costly, and often leads to sub-interpreter performance, there is some merit to the idea of blacklisting a tree if one or more of its common paths can't be compiled. But a LeaveFrame path, it would seem, cannot be a common path. Thus, the extension point should be marked so that we don't try to record from it again, but the tree itself should not be blacklisted.
Comment 11•15 years ago
|
||
This patch only blacklists the tree if we abort on a primary fragment (the first trace we record). This causes a significant (50ms) perf loss in the shell on SS. I am not sure where.
Comment 12•15 years ago
|
||
I can't measure a slowdown when I use the sunspider harness: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: - 1053.2ms +/- 0.1% 1053.9ms +/- 0.1% ============================================================================= 3d: *1.004x as slow* 156.1ms +/- 0.2% 156.8ms +/- 0.4% significant cube: ?? 44.7ms +/- 0.6% 44.9ms +/- 0.7% not conclusive: might be *1.004x as slow* morph: ?? 30.8ms +/- 0.4% 30.9ms +/- 0.7% not conclusive: might be *1.003x as slow* raytrace: *1.005x as slow* 80.6ms +/- 0.2% 81.0ms +/- 0.3% significant access: - 138.9ms +/- 0.2% 138.8ms +/- 0.2% binary-trees: - 40.1ms +/- 0.3% 40.1ms +/- 0.3% fannkuch: - 58.3ms +/- 0.2% 58.2ms +/- 0.2% nbody: - 26.3ms +/- 0.5% 26.2ms +/- 0.5% nsieve: ?? 14.2ms +/- 0.9% 14.3ms +/- 0.9% not conclusive: might be *1.007x as slow* bitops: ?? 36.0ms +/- 0.8% 36.1ms +/- 0.7% not conclusive: might be *1.003x as slow* 3bit-bits-in-byte: - 1.6ms +/- 8.8% 1.5ms +/- 9.4% bits-in-byte: 1.010x as fast 8.0ms +/- 0.7% 8.0ms +/- 0.7% significant bitwise-and: ?? 2.5ms +/- 5.7% 2.6ms +/- 5.5% not conclusive: might be *1.032x as slow* nsieve-bits: *1.008x as slow* 23.9ms +/- 0.5% 24.1ms +/- 0.4% significant controlflow: - 33.0ms +/- 0.5% 33.0ms +/- 0.3% recursive: - 33.0ms +/- 0.5% 33.0ms +/- 0.3% crypto: - 63.6ms +/- 0.4% 63.6ms +/- 0.3% aes: - 36.5ms +/- 0.7% 36.3ms +/- 0.4% md5: *1.008x as slow* 20.1ms +/- 0.4% 20.2ms +/- 0.6% significant sha1: *1.006x as slow* 7.0ms +/- 0.0% 7.0ms +/- 0.8% significant date: 1.004x as fast 179.1ms +/- 0.2% 178.3ms +/- 0.2% significant format-tofte: 1.009x as fast 70.8ms +/- 0.3% 70.2ms +/- 0.2% significant format-xparb: - 108.3ms +/- 0.3% 108.2ms +/- 0.2% math: - 39.0ms +/- 0.8% 38.8ms +/- 0.5% cordic: - 19.1ms +/- 0.6% 19.0ms +/- 0.6% partial-sums: - 13.7ms +/- 1.6% 13.5ms +/- 1.1% spectral-norm: ?? 6.2ms +/- 1.9% 6.2ms +/- 2.0% not conclusive: might be *1.006x as slow* regexp: ?? 45.1ms +/- 0.4% 45.2ms +/- 0.3% not conclusive: might be *1.002x as slow* dna: ?? 45.1ms +/- 0.4% 45.2ms +/- 0.3% not conclusive: might be *1.002x as slow* string: ?? 362.5ms +/- 0.1% 363.2ms +/- 0.2% not conclusive: might be *1.002x as slow* base64: ?? 17.2ms +/- 0.6% 17.2ms +/- 0.7% not conclusive: might be *1.003x as slow* fasta: *1.005x as slow* 75.4ms +/- 0.2% 75.7ms +/- 0.2% significant tagcloud: ?? 99.5ms +/- 0.2% 99.8ms +/- 0.4% not conclusive: might be *1.003x as slow* unpack-code: - 139.2ms +/- 0.2% 139.3ms +/- 0.3% validate-input: - 31.2ms +/- 0.5% 31.2ms +/- 0.5%
Comment 13•15 years ago
|
||
2. We will take *** enough times with the new trace to try to extend the tree from that exit. But that extension trace exits the loop without our noticing, and then exits the function, and we abort at the LeaveFrame (because that would take us back to a frame not incorporated into this trace). This is a BUG. Can we reduce this to a smaller testcase or do we know where and how we walk out of the loop without noticing?
Comment 14•15 years ago
|
||
Without the patch: recorder: started(353), aborted(108), completed(520), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(93), breaks(9), returns(3), unstableInnerCalls(60) monitor: triggered(23292), exits(23292), type mismatch(0), global mismatch(12) With the patch: recorder: started(352), aborted(108), completed(519), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(92), breaks(9), returns(3), unstableInnerCalls(60) monitor: triggered(23292), exits(23292), type mismatch(0), global mismatch(12) Bench shows a 5ms slowdown. The sunspider testsuite shows no slowdown. The difference in the stats is minimal and unstable. In other words the bench result different might be just noise. Aborting the trace is still a bug, but this patch could go in as additional fix. Lets fix the underlying bug first though.
Comment 15•15 years ago
|
||
I am tempted to convert the abort into an assert. We don't hit it in trace-tests.
Comment 16•15 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=482800 contains a suggested fix for the fastaRepeat issue.
Updated•15 years ago
|
Attachment #366853 -
Flags: review?(graydon)
Comment 17•15 years ago
|
||
I'm hesitating on reviewing this as r+. It is ok in the sense that the diagnosis is correct and the fix appears to change the diagnosed problem into the proposed new state of affairs. However, I'm concerned that we're just re-opening ourselves up to a "too many aborted compilations on a non-compilable leaf" scenario. The current behavior is "intentional" in the sense that falling back to interpreting is guaranteed to have a lower-bound of "no slower than interpreter". We have often seen cases where the constant-aborts make the JIT's presence *slow down* the browser from interpreter mode. Blacklisting tree roots aggressively saves us from these cases (mostly: it still doesn't save us from the auxiliary cost of entering and leaving traces too often). I believe a more desirable fix would do the following: - backoff on a per-fragment basis, not moving to root but also not ignoring leaves - blacklist on a per-fragment basis when attempts *for that frag* hits limit - only patch over LOOP opcodes in cases where we're looking at a root (afaict the only reason the current code insists on blacklisting at the roots)
Assignee | ||
Comment 18•15 years ago
|
||
(In reply to comment #17) > I believe a more desirable fix would do the following: > - backoff on a per-fragment basis, not moving to root but also not ignoring > leaves > - blacklist on a per-fragment basis when attempts *for that frag* hits limit > - only patch over LOOP opcodes in cases where we're looking at a root (afaict > the only > reason the current code insists on blacklisting at the roots) This seems dicey to me. If you blacklist just a non-root fragment or two, you can easily get in the situation of going on and off trace repeatedly through the loop, which as you point out hurts perf a lot. And blacklisting tree roots *does* save us from many of those (but possibly not all--I think inner and outer loops can still interact to create that problem). Not that we can't try it--sadly, it seems impossible to get away from empirical testing in blacklisting algorithm design, at least with what we know so far. But maybe we should use a hybrid solution--backoff/blacklist at both the fragment and tree level. At the fragment level, we backoff if we get to a situation where the fragment doesn't compile now, but we have reason to believe it will later. We blacklist if we think it will never compile. At the tree level, I'm not sure backoff makes sense, but we can blacklist if the level of "badness" (i.e., falling off trace too quickly) is too high. We could measure the number of iterations per trace execution for the first 20 or 100 or so tree entries after recording has stabilized, and if it's too low, blacklist that tree. Eventually I think we will also want to think harder about inner/outer loop interaction wrt blacklisting.
Assignee | ||
Updated•15 years ago
|
Assignee: general → dmandelin
Comment 19•15 years ago
|
||
I don't want to be holding this up; gal, do you have any feelings on dmandelin's or my comments here, either way? Or do you want to charge ahead with the patch as-is?
Comment 20•15 years ago
|
||
The underlying bug is fixed. We no longer leave the loop and abort there so I don't think the patch is necessary for that particular benchmark so its less critical to get it in. You still seem to have concerns.
Comment 21•15 years ago
|
||
sayrer: please unblock (maybe verify first that slowdown is one?)
Updated•15 years ago
|
Priority: -- → P2
Assignee | ||
Comment 22•15 years ago
|
||
We still have problems here. In fact, we've gone backwards from my last test on at least 4 benchmarks. Performance seems to be unstable on these. Extending my table from before, with Ru=tip of 4/2, R0 R1 Rt Ru fasta 169 171 131 123 ("DNA Sequence Alignment") MD5 1278 518 1100 871 Prime 170 155 196 149 SHA1 1003 571 969 676 The visualization for fasta looks pretty much like what I did before, so I think the same causes still apply. I will recheck.
Assignee | ||
Comment 23•15 years ago
|
||
OK, some real progress on fasta, finally. The problem is too many GCs. A GC during this test takes about .3 seconds, and gets run about .7 seconds, so this really kills our throughput. Runs that get 2 GCs are even slower. I don't know how these GCs are getting triggered yet--I'm going to dtrace it.
Assignee | ||
Comment 24•15 years ago
|
||
The GC is getting triggered in js_NewGCThing by the rt->gcMallocBytes >= rt->gcMaxMallocBytes condition. Thus, the program is allocating about 32MB/.4s of mutator run time. After that, we do GC for .3 seconds, which is both a big pause time and a poor mutator throughput quotient. Running that GC that often doesn't seem that bad, as long as the GC can be made to run faster. I'm not sure why it's taking so long. I'm also not sure how much garbage is or is not being recovered. The problem could also really be more related to string implementation code. Perf probably dropped off when I made JS_realloc'd bytes for new memory objects count in gcMallocBytes. This is necessary, because some apps GC way too late and start thrashing without it.
Assignee | ||
Comment 25•15 years ago
|
||
Still grinding through the revision history. In searching for the regressions, I ended up finding first a perf improvement. Changeset 81a29bf64b16, for bug 475474, produced a dramatic improvement in MD5 and SHA1, and smaller improvements in fasta (DNA sequence matching) and access-nsieve (Primes). I don't know the exact details of that bug, but given that it makes the JIT handle global objects more flexibly/precisely, it's not surprising that it would help in-browser perf. About half of this perf gain was lost by changeset 569acf636d50, so the next step is to track that down. After that, the perf was eventually regained, and after that, about 10-20% lost again.
Assignee | ||
Comment 26•15 years ago
|
||
The next interesting event was a severe regression in MD5 and SHA1, accompanied by a smaller regression in Primes. This happened at changeset 569acf636d50 for bug 469625, which I believe introduced the arrayProtoHasElements flag. The perf regression was detected and filed as bug 481251 and fixed around March 3.
Assignee | ||
Comment 27•15 years ago
|
||
I think I have the complete story on MD5 (and nearly complete on SHA1): long ago: MD5: 33, SHA1: 240 81a29bf64b16: MD5: 1000, SHA1: 780 This was the tracking-multiple-globals patch. 569acf636d50: MD5: 400, SHA1: 400 This fixed a bug where the tracer generated incorrect code for arrays with holes when an array prototype element had been set. This first solution was somewhat crude, relying on a global flag that deoptimized all array accesses once the jit thought an array prototype element had been set (and the jit used a very conservative check, so things got deoptimized quickly). 8761d29774e4: MD5: 900, SHA1: 700 This replaced the mechanism from the last bug. The new code uses guards to verify that the shapes of the objects on the prototype chain are the same as they were during trace recording. Thus, we don't get widespread deoptimization, but we do have moderately expensive checks for all sparse array accesses, so we didn't get back all the perf we had before fixing the bug.
There are sparse arrays in our crypto? :-(
Comment 29•15 years ago
|
||
(In reply to comment #28) > There are sparse arrays in our crypto? :-( Dense arrays, with holes. <Homer S. voice> Love that SunSpider, don't hate -- love! :-P. /be
Comment 30•15 years ago
|
||
Tracking multiple globals has nothing to do with crypto, either -- that one is due to Dromaeo's harness. But it was a good fix. /be
Comment 31•15 years ago
|
||
I think this is WONTFIX. The maximum perf was due to incorrect semantics. We can probably optimize this further using raw arrays or other tricks, but the performance loss has been diagnosed and happens by design.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Comment 32•12 years ago
|
||
Comment on attachment 366853 [details] [diff] [review] patch Long dead, clearing out queue.
Attachment #366853 -
Flags: review?(graydon) → review-
You need to log in
before you can comment on or make changes to this bug.
Description
•