<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 2

•

15 years ago

I wonder how things correlate to the branchiness of the loops...

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 3

•

15 years ago

dvander, I have the new persistent oracle patch half-way working. I will upload it soon-ish. That will do much better blacklisting and avoid a lot of unnecessary compilation. That should help, too.

Reporter

Comment 4

•

15 years ago

bz: That's a good question. I am trying to find a good way to automatically correlate results with information like that, right now it's by hand. Working on a new script with the ability to selectively enable tracing on any combination of loops in a script. It performs a tree search to find a supposedly optimal result. The results are kind of hard to interpret at a glance so I'm still playing with it. Here's some samples. bitops-bitwise-and: > Method JIT: 2ms > Tracing JIT: 1ms > Combined: 1ms > Search found winning time 0ms! > Combos at this time: 1 > loop @ 27-30 traced 100% of the time This is pretty straightforward. We should always trace bitwise-and. access-binary-trees: > Method JIT: 11ms > Tracing JIT: 22ms > Combined: 11ms > Search found winning time 10ms! > Combos at this time: 1 > loop @ 41-54 traced 0% of the time > loop @ 46-52 traced 0% of the time > loop @ 32-58 traced 0% of the time Another easy one. We should never trace this. crypto-sha1: > Method JIT: 4ms > Tracing JIT: 2ms > Combined: 2ms > Search found winning time 2ms! > Combos at this time: 31 > loop @ 54-85 traced 54% of the time > loop @ 63-77 traced 100% of the time > loop @ 119-126 traced 51% of the time > loop @ 158-162 traced 100% of the time > loop @ 172-176 traced 35% of the time > loop @ 186-193 traced 32% of the time > loop @ 203-218 traced 41% of the time > loop @ 209-216 traced 58% of the time > loop @ 237-242 traced 25% of the time This seems to be saying that for a bunch of loops are dontcare for tracing, but others are worthwhile. Patch & more data tomorrow. I think v8 will be more interesting, since it's longer running.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 5

•

15 years ago

Attached file exhaustive search — Details

http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/0de8f6369f8d Initial script only does exhaustive searches, which doesn't work for a few SS tests, and all of v8. Work on that forthcoming. For now, attached is the exhaustive analysis where we can do it. Method JIT alone got 376ms. Tracing JIT alone got 382ms. Combined, as-is, got 358ms. Optimal result is 326ms.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 6

•

15 years ago

Attached file MCTS results — Details

Results for the four remaining interesting benchmarks, using a random search. I've no idea how accurate this is, but the results seem in line with the exhaustive searches on other tests (sometimes with more confidence). Method JIT total: 107ms Tracing JIT total: 187ms Combined total: 151ms Estimated optimum: 105ms This seems to be saying that cube, raytrace, fannkuch, and aes do not trace well. My takeaway lesson from SunSpider is that we should give up on loop nesting very quickly if it's not working out. It seems to never be worth it. v8 tomorrow.

Comment 7

•

15 years ago

> My takeaway lesson from SunSpider is that we should give up on loop > nesting very quickly if it's not working out. It seems to never be worth it. I wonder if only using tracing for fragments that self-loop would be effective.

Comment 8

•

15 years ago

I've been going over the SS data to try to make a start on simplifying the problem. Relative to the current JM+TM setup, where we try to trace everything, the biggest wins we can get are by not tracing anything in the following benchmarks: access-fannkuch 32 ms 3d-raytrace 13 crypto-aes 9 bitops-nsieve-bits 6 date-formate-tofte 6 Features that jumped out at me as possibly being useful to distinguish these: * fannkuch - (Note: It's probably fine/good to trace the first loop, a simple array init) - triply-nested loop within one function - loops inside conditionals - the outer loop has a total of 13 control-flow constructs inside it: 8 loops and 5 ifs. * 3d-raytrace - doubly-nested loops within one function - loop containing a call to a method with the same name as the current function -- this suggests recursion inside a loop, or a callout to a method in the inheritance hierarchy that might loop - loop with 6 function calls and 2 ifs * crypto-aes - (Note: tracing the first 3 loops is fine) - doubly-nested loops within one function - (Note: there are some other loops that look like they should trace well to me, but the best times seem into indicate they should not trace; not sure what's going on here.) * bitops-nsieve-bits - doubly-nested loops within one function * date-format-tofte - loop containing a return statement - loop containing a call to eval - loop containing 4 method calls Note that an equally important analysis that I have not done yet is to identify the loops that are most important *to* trace and their characteristics. *Tentative* summary based on the above, with caveat regarding one-sided analysis so far: * Strong indicators of loops not to trace: - statically detectable nesting - call to eval * Weaker indicators of loops not to trace: - large number of ifs - large number of calls, especially method calls |a.f()| and maybe-recursive calls - return statements

Comment 9

•

15 years ago

Just curious, would it help to change the hotloop constant, the number of iterations before recording? E.g. a higher value could prevent tracing certain loops with few iterations?

Comment 10

•

15 years ago

Now, the biggest wins available by tracing some or all loops relative to not tracing anything: date-format-xparb trace loops at 363, 422 6 ms math-partial-sums trace loop at 12 6 string-base64 trace all 6 access-nbody trace loops at 102, 105, 127, 178 5 math-spectral-norm trace loops at 12, 15, 25, 28 5 bitops-3bit-bits-in-byte trace all 5 string-validate-input trace all 4 bitops-bits-in-byte trace all 3 bitops-bitwise-and trace all 3 crypto-sha1 trace loop at 63 2 In general, the magnitudes of the times so far suggest that it's most important to avoid tracing just a few really bad loops that don't trace. Conversely, it's probably OK to lose an opportunity to trace a loop here or there--the loss is smallish. Observations on the loops we really want to trace: * date-format-xparb - loop with a total of 3 operators, including loop test - loop identical to one in tofte that we don't want to trace: 4 function calls, fairly short overall * math-partial-sums - no control flow, no array access, only functions/props called are in Math. - lots of arithmetic ops - (The loop we don't need trace here is an outer loop, but in a different function from the inner loop). * string-base64 - no control flow or function calls, lots of arithmetic, some array reads - 3 ifs, no function calls, lots of arithmetic, some array reads * access-nbody - a doubly-nested loop; outer loop has only one statement besides the inner loop; lots of arithmetic, no function calls other than Math - single loop with mostly arithmetic - single loop with one-line body that calls a function * math-spectral-norm - doubly-nested loops; small bodies for both loops; mostly arithmetic; inner loop calls a function that is just one arithmetic expression * bitops-3bit-bits-in-byte - doubly-nested loop; contains a function call that is mostly arithmetic * string-validate-input - loop with 4 function calls; 2 are one-line functions; there is nesting--inner loops are small, so are outer loops * bitops-bits-in-byte - triply-nested loop; mostly arithmetic * bitops-bitwise-and - mostly arithmetic * crypto-aes - single loop; calls several functions that are short arithmetic kernels Summarizing, the key factors that indicate a loop we want to trace are: - lots of arithmetic - short loop bodies - few if statements

Comment 11

•

15 years ago

(In reply to comment #9) > Just curious, would it help to change the hotloop constant, the number of > iterations before recording? E.g. a higher value could prevent tracing certain > loops with few iterations? That might help, but I suspect we need to be adaptive about it: it looks like some loops you can tell will or will not trace immediately, but other loops may be expected to pay off only for more iterations.

Comment 12

•

15 years ago

Putting comment 8 and comment 10 together: ** Basic features that indicate what to trace * If a loop contains one of these, do not trace it: - more than 2 nested loops (not including the outer loop; at any level) - call to eval - return out of loop * If a loop contains these, probably do not trace it: - more than 5 conditionals - maybe-recursive calls (calls to the same name/prop name) * If a loop contains these, question tracing it: - more than 4 function calls; methods seem to hurt tracing more than global function calls * If a loop has these characteristics, do trace it: - short loop body (What does this mean? Not sure yet, but "2 lines of code" or "5 or fewer operators" are a reasonable starting place.) - mostly arithmetic (What does this mean? Not sure yet, but it's mostly about the proportion of arithmetic operations to other kinds.) ** Additional notes - We can often guess a function call target statically and be right. This could be useful: if the targets to functions called from a loop have no loops or short loop bodies and are mostly arithmetic, then that loop probably traces well; if the functions have lots of control flow or are big, then the loop probably doesn't trace. - It looks like we want to be adaptive about nested loops. We could say "never trace nested loops" and not give up too much. But some loop nests trace pretty well. Maybe we can figure all that out statically, but we might be able to tell better when we're ready to try an outer loop by how the inner loop is tracing.

Comment 13

•

15 years ago

dmandelin, great analysis. You haven't mentioned crypto-md5. It's most distinctive feature is that the main loop is enormous -- almost 20,000 LIR instructions get put into the writer pipeline. (In comparison, most traced fragments have 10s or 100s of LIR instructions.) In one respect it's an excellent loop to trace, because it's nothing but arithmetic. But the compile-time cost is really high, and the loop isn't run enough times to make up that cost. (I'd love to know how we compare against V8/Nitro if this benchmark checksummed 10x or 100x more text.) Not tracing this loop seems the right decision. It contains many more than 4 function calls, so the heuristics in comment 12 would do the right thing.

Comment 14

•

15 years ago

(In reply to comment #13) > dmandelin, great analysis. Thanks. > You haven't mentioned crypto-md5. It's most distinctive feature is that the > main loop is enormous -- almost 20,000 LIR instructions get put into the > writer pipeline. (In comparison, most traced fragments have 10s or 100s of > LIR instructions.) In one respect it's an excellent loop to trace, because > it's nothing but arithmetic. But the compile-time cost is really high, and > the loop isn't run enough times to make up that cost. (I'd love to know how > we compare against V8/Nitro if this benchmark checksummed 10x or 100x more > text.) I left that one out because we only lose 3 ms by tracing it when we shouldn't. But I see that it does raise a point that may matter on the web more generally. I see that the main giant loop (in core_md5, right?) runs 3967 times, which I would expect would give a nice speedup and be well worth tracing. Is the trace compile time linear in the length of the loop body, or can it grow faster (either asymptotically, or at thresholds or something)? If linear, then I'm really surprised. Otherwise it suggests some kind of "don't trace 'giant loops'" element to the heuristics. > Not tracing this loop seems the right decision. Yes, that's what David's stats say as well. It looks like the other loops in md5 are don't cares. That reminds me of something I forgot to say earlier--many loops in SS are don't cares between tracejit and methodjit. I think that's a good thing, because it gives us a lot of freedom in the decision criteria to really get it right for the loops that matter. > It contains many more than 4 function calls, so the heuristics in comment 12 > would do the right thing. Another thing I forgot to mention earlier--I'm hoping those heuristics can actually be turned into 'features' that can then be plugged in to a machine learning algorithm to compute the best formula based on a training set.

Brendan Eich [:brendan]

Comment 15

•

15 years ago

The data set for crypto-md5 was cut down from an older (C?) benchmark when Maciej build SunSpider, because anything larger was too slow in IE at the time. Truly a bogus benchmark, and it may change (someone, perhaps not Apple, will do "Annuals" of SunSpider). Any way to take into account larger input size justifying tracing? /be

Comment 16

•

15 years ago

(In reply to comment #15) > The data set for crypto-md5 was cut down from an older (C?) benchmark when > Maciej build SunSpider, because anything larger was too slow in IE at the time. > Truly a bogus benchmark, and it may change (someone, perhaps not Apple, will do > "Annuals" of SunSpider). Any way to take into account larger input size > justifying tracing? Could we use information collected by Nick's loop bounds elimination to guess at the number of iterations and use that as another input to the decision?

Comment 17

•

15 years ago

I have an idea to make the code in md5 trace faster. Gimme a day.

Comment 18

•

15 years ago

(In reply to comment #17) > I have an idea to make the code in md5 trace faster. Gimme a day. Bug 575529 already speeds it up by 1.25--1.50x. I haven't landed that patch because it still needs some work but it also hasn't seemed that important so far... because it is such a bogus benchmark I'm reluctant to land patches that will only help it and nothing else.

travis fisher

Comment 19

•

15 years ago

Regarding interval analysis and real-world testcases, it might have a good impact on some of the "emulators are slow" complaints, like bug 509986 or bug 514765. I would bet the hot core of an 8 bit emulator has a lot of integer ALU operations that can be statically shown to have 8 or 16 bit results...

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 20

•

15 years ago

(In reply to comment #19) > Regarding interval analysis and real-world testcases, it might have a good > impact on some of the "emulators are slow" complaints, like bug 509986 or bug > 514765. I would bet the hot core of an 8 bit emulator has a lot of integer ALU > operations that can be statically shown to have 8 or 16 bit results... Both of those should be greatly helped by JM even without any interval analysis.

Reporter

Comment 21

•

15 years ago

Attached patch experimental profiling patch — Details — Splinter Review

Use some runtime rdtsc profiling to decide whether to trace or not. A quick attempt shows promise for v8, where earley-boyer traces 3X slower. http://pastebin.mozilla.org/762023 I'm going to keep this in mind but look to see if we can find deterministic heuristics first. Timing dependent results could make reproducibility and debugging very painful.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 22

•

15 years ago

http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/9e7fa574c491 Two small heuristics. First, increase HOTLOOP to 4. We've wanted to do this for a while. The second is to blacklist a loop after it runs 300 times. I'm not too sure on the reliability of the latter, though the two combined show a 26% perf win for v8 and help mitigate the perf we're currently losing on SunSpider by 2.6%. I tried counting opcodes and using that as a metric, but it was highly unreliable.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 23

•

15 years ago

For v8, it's important to note that by making two of the benchmarks much faster, we took 5-10% hits across the board elsewhere.

Comment 24

•

15 years ago

I am pretty sure my improved oracle will help a ton here. Should be ready soon, just have to clear out some patches from my queue.

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

15 years ago

Blocks: 585120

Reporter

Comment 25

•

15 years ago

The biggest loss right now is access-fannkuch. In JM it's okay to finally speculate on integers in arrays, without Oracle changes: http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/4e1c23ed3b2d The method JIT seems to dampen the losses we previously observed by making this change in TM. We're still losing ~5ms on a few benchmarks here and there, but combined is 6% faster than TM on the graphs, and 2.5% slower than JM alone.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 26

•

15 years ago

Attached patch patch to heuristically decide whether to trace a file (obsolete) — Details — Splinter Review

I've spent some time looking this over. The good news is that there seem to be a few simple tests we can do to decide whether to trace a given SunSpider benchmark. These are the tests: - Are there any short loops of the form |for (...; x<c; ...)|, where |c| is a constant <= 8? - Are there any triply nested loops? - Is eval called within a loop? If any of these hold, then we should not trace. Otherwise, trace. I've attached a patch that performs these tests on bytecode and then blacklists entire JS files from the tracer. With the patch, our SunSpider score drops from 624ms to 564ms on my x86 box. The patch is pretty preliminary, but I'm hoping to get some feedback about where to put the hooks for it, as well as whether this sort of filename-based classification is acceptable. It would be a lot nicer to make the trace decision on a loop-by-loop basis. This seems to be a lot more difficult because the decision to trace one loop depends on whether you trace nearby loops. In some cases, tracing one loop and not another can lead to really pathological performance (6x slower in one case). I'll keep working on this.

Comment 27

•

15 years ago

(In reply to comment #26) > It would be a lot nicer to make the trace decision on a loop-by-loop basis. > This seems to be a lot more difficult because the decision to trace one loop > depends on whether you trace nearby loops. In some cases, tracing one loop and > not another can lead to really pathological performance (6x slower in one > case). I'll keep working on this. Can you clarify this? Specifically: 1. What is the relationship between the loops that affect each other that way? Nested, side by side, etc? 2. What exactly is the effect you observed? Is the effect simply that tracing a certain loop L1 gives a perf loss, but tracing L1+L2 is a win? Or are there more things going on?

Comment 28

•

15 years ago

Oh, forgot to say, very promising results!

Comment 29

•

15 years ago

(In reply to comment #26) > > It would be a lot nicer to make the trace decision on a loop-by-loop basis. Indeed. I'm sure the file-by-file basis works well for SunSpider because they're mostly so small, but as JS programs get bigger I suspect finer-grained tracing will be more important.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 30

•

15 years ago

THe improved oracle patch really helps here. Basically if you make a mistake you detect it and won't ever do it again. Benchmarks often run repeatedly, so after the first time we are set until the HTTP cache is flushed.

Assignee

Comment 31

•

15 years ago

Attached patch patch to heuristically decide whether to trace a loop (obsolete) — Details — Splinter Review

This patch tries to classify on a loop-by-loop basis. The classification is done using a support vector machine. The parameters for the machine are in jsanalysis.data in the patch. This patch gets about the same performance boost as the previous one on SunSpider, although it's spread out a little differently between the benchmarks. In some cases, we're able to get better performance than either -m or -j by tracing just the right set of loops. To answer Dave's questions: (1) the biggest performance effect by far is due to nested loops, and (2) there are cases where tracing one loop and not another can drop you into the interpreter for the rest of the function. dvander is looking into the second issue. I'm not content with this patch. I want to make sure that it doesn't ruin performance on other benchmarks. I'd also like to extract more features from the loops so that we don't overtrain on the benchmarks.

Alon Zakai (:azakai)

Comment 32

•

15 years ago

(In reply to comment #31) > I'm not content with this patch. I want to make sure that it doesn't ruin > performance on other benchmarks. The Leave-One-Out loss used in the python training script should be a good indication of whether we are at risk of bad performance on other benchmarks. Even that can be biased, if the current benchmarks trained on are very similar to each other. To some degree that can be fixed by leaving more than one out (for example, train on one family of benchmarks, test on another, and vice versa).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 33

•

15 years ago

(In reply to comment #32) > (In reply to comment #31) > > I'm not content with this patch. I want to make sure that it doesn't ruin > > performance on other benchmarks. > > The Leave-One-Out loss used in the python training script should be a good > indication of whether we are at risk of bad performance on other benchmarks. > > Even that can be biased, if the current benchmarks trained on are very similar > to each other. To some degree that can be fixed by leaving more than one out > (for example, train on one family of benchmarks, test on another, and vice > versa). I tried something like this. (By the way, thanks for all your help with SVMs, Alon.) It now takes a while to do SVM training. So rather than do the full leave-one-out analysis, I split the data into a training set and a validation set. On the combined SunSpider/V8 data, the result is that we classify only 47% of the validation loops correctly, which is worse than random. I also did a few timing experiments. First I trained on SunSpider and tested on V8 and SS. The results (using the SunSpider harness): V8 with -m: 5040ms V8 with -m -j: 4008ms V8 with SVM(SS): 5147ms SS with -m: 630ms SS with -m -j: 652ms SS with SVM(SS): 570ms So we do well on SS and pretty badly on V8. The notation SVM(x) means running with -m -j and deciding whether to trace using an SVM trained on benchmark x. Then I trained on V8 and tested on V8 and SunSpider: V8 with SVM(V8): 3697ms SS with SVM(V8): 669ms So we do well on V8 and badly on SS. Finally I trained on V8 and SS together. V8 with SVM(SS+V8): 3716ms SS with SVM(SS+V8): 570ms Here we do quite well on both--about a 10% speedup on SunSpider over -m and a 7% speedup on V8 over -m -j. tl;dr: We're definitely overtraining. We do well on benchmarks that we train for but not very well at all on the rest. To try and fix this, I'm going to investigate doing machine learning on the traces themselves rather than the bytecode.

Reece H. Dunn

Comment 34

•

15 years ago

Are you just using the SunSpider and V8 benchmarks to decide when to trace and when to method jit? There are likely other javascript code that will trace/jit differently. I'm thinking here things like a jquery-heavy site (for table manipulation, etc.), the SNES and GameBoy emulators in javascript, the javascript SWF flash interpreter ('gordon' IIRC), the IE9 demos (fishtank, Amazon shelf and others) and other things that interact with canvas and/or dom. It would be interesting to see how the decision to trace or not works out in browser -- possibly being able to see/analyse the data from a browser session through some browser UI (to monitor dom/canvas javascript). Great work so far.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 35

•

15 years ago

Interesting results. Regarding overtraining, I wonder what would happen if you excluded the "don't care" loops (e.g., a set of loops where making the wrong choice on those loops gives a perf loss <= N). Maybe part of the overtraining comes from trying to get the just-right result on those.

Assignee

Comment 36

•

15 years ago

(In reply to comment #34) > I'm thinking here things like a jquery-heavy site (for table manipulation, > etc.), the SNES and GameBoy emulators in javascript, the javascript SWF flash > interpreter ('gordon' IIRC), the IE9 demos (fishtank, Amazon shelf and others) > and other things that interact with canvas and/or dom. Yes, I plan to look at other code soon. (In reply to comment #35) > Interesting results. Regarding overtraining, I wonder what would happen if you > excluded the "don't care" loops (e.g., a set of loops where making the wrong > choice on those loops gives a perf loss <= N). Maybe part of the overtraining > comes from trying to get the just-right result on those. I sort of did that, using a method Alon suggested. For each loop I estimated its importance, roughly as "if I do the wrong thing, how much worse does performance get?" Then in the training data, I used multiple copies of each loop based on its importance. So the wrong answer on an important loop was much more significant than on an unimportant one.

Updated

•

15 years ago

blocking2.0: --- → ?

Updated

•

15 years ago

Blocks: 598655

Comment 37

•

15 years ago

Yeah, trace jit still needs to be worked on, so that it doesn't slow down method jit's performance. The GameBoy Color emulator seems to be hit when tracejit is turned on when methodjit is already on. But only by a few CPU % points.

Robert Sayre

Updated

•

15 years ago

blocking2.0: ? → final+

Comment 38

•

15 years ago

FWIW, I played a bit with different values for HOTLOOP, and 200 seems to work quite well: TEST COMPARISON FROM TO =============================================== ** TOTAL **: 1.032x as fast 2293.8ms +/- 0.4% 2222.9ms +/- 0.3% =============================================== v8: 1.032x as fast 2293.8ms +/- 0.4% 2222.9ms +/- 0.3% crypto: ?? 293.1ms +/- 2.5% 297.0ms +/- 1.8% deltablue: 1.110x as fast 444.6ms +/- 1.8% 400.4ms +/- 0.7% earley-boyer: 1.034x as fast 348.4ms +/- 0.1% 337.0ms +/- 0.3% raytrace: 1.014x as fast 363.6ms +/- 0.3% 358.6ms +/- 0.3% regexp: 1.025x as fast 298.3ms +/- 0.3% 291.0ms +/- 0.2% richards: 1.043x as fast 240.4ms +/- 1.2% 230.4ms +/- 0.7% splay: ?? 305.4ms +/- 1.1% 308.5ms +/- 0.5% It's interesting that even richards gets faster. Sunspider: TEST COMPARISON FROM TO ============================================================================= ** TOTAL **: *1.015x as slow* 390.5ms +/- 0.2% 396.3ms +/- 0.2% ============================================================================= 3d: *1.065x as slow* 77.0ms +/- 0.6% 82.0ms +/- 0.4% cube: 1.024x as fast 26.8ms +/- 0.8% 26.2ms +/- 0.7% morph: - 17.7ms +/- 1.1% 17.7ms +/- 0.9% raytrace: *1.174x as slow* 32.4ms +/- 0.6% 38.1ms +/- 0.4% access: 1.020x as fast 46.9ms +/- 0.8% 46.0ms +/- 0.4% binary-trees: 1.058x as fast 8.5ms +/- 2.2% 8.0ms +/- 0.9% fannkuch: - 22.1ms +/- 0.6% 22.0ms +/- 0.4% nbody: 1.072x as fast 7.4ms +/- 2.5% 6.9ms +/- 1.4% nsieve: *1.019x as slow* 8.9ms +/- 1.5% 9.0ms +/- 0.8% bitops: ?? 19.4ms +/- 1.5% 19.6ms +/- 1.3% 3bit-bits-in-byte: *1.29x as slow* 0.8ms +/- 19.0% 1.0ms +/- 6.6% bits-in-byte: ?? 8.8ms +/- 1.6% 8.9ms +/- 1.7% bitwise-and: - 2.2ms +/- 6.5% 2.2ms +/- 6.5% nsieve-bits: - 7.6ms +/- 2.5% 7.5ms +/- 2.5% controlflow: - 5.0ms +/- 0.0% 4.9ms +/- 1.9% recursive: - 5.0ms +/- 0.0% 4.9ms +/- 1.9% crypto: *1.016x as slow* 35.8ms +/- 1.3% 36.4ms +/- 0.6% aes: 1.048x as fast 21.0ms +/- 2.2% 20.0ms +/- 0.5% md5: *1.147x as slow* 10.0ms +/- 0.7% 11.4ms +/- 1.6% sha1: ?? 4.9ms +/- 2.3% 5.0ms +/- 1.4% date: *1.010x as slow* 43.4ms +/- 0.6% 43.8ms +/- 0.5% format-tofte: *1.014x as slow* 20.7ms +/- 0.8% 21.0ms +/- 0.7% format-xparb: ?? 22.7ms +/- 0.8% 22.8ms +/- 0.7% math: 1.024x as fast 26.1ms +/- 0.6% 25.5ms +/- 0.7% cordic: - 8.3ms +/- 2.2% 8.1ms +/- 1.6% partial-sums: ?? 12.0ms +/- 0.8% 12.1ms +/- 0.8% spectral-norm: 1.088x as fast 5.8ms +/- 2.8% 5.3ms +/- 3.3% regexp: - 18.5ms +/- 1.0% 18.3ms +/- 1.0% dna: - 18.5ms +/- 1.0% 18.3ms +/- 1.0% string: *1.012x as slow* 118.4ms +/- 0.4% 119.8ms +/- 0.3% base64: *1.045x as slow* 5.9ms +/- 1.6% 6.2ms +/- 2.5% fasta: *1.034x as slow* 20.3ms +/- 0.9% 21.0ms +/- 0.3% tagcloud: - 32.4ms +/- 0.6% 32.3ms +/- 0.6% unpack-code: *1.007x as slow* 48.6ms +/- 0.4% 48.9ms +/- 0.4% validate-input: ?? 11.2ms +/- 1.3% 11.3ms +/- 1.5% This gets 6 ms slower, mostly due to 3d-raytrace and crypto-md5. If we can make these faster this patch might actually be a win...

Comment 39

•

15 years ago

A more conservative value like 40 works even better. V8 drops to 2197 ms, a ~100 ms win. SS goes to 388.3 ms, a 2 ms win. 3d-raytrace becomes 4 ms faster instead of 6 ms slower. Tracing fewer loops might also make tuning a bit easier.

Updated

•

15 years ago

Blocks: 576688

Updated

•

15 years ago

Blocks: 576687

Updated

•

15 years ago

Blocks: 594249

Comment 40

•

15 years ago

I've determined that Firefox 4b7pre on Mac OS 10.6 still uses ~20% CPU than Google Chrome 6 with the GameBoy Color emulator ( http://grantgalitz.org/gameboy/ ) when the graphics have been taken out as a CPU usage factor.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 41

•

15 years ago

Sorry, I meant ~20% more CPU usage. Should help you understand my last comment.

Assignee

Updated

•

15 years ago

Assignee: dvander → wmccloskey

OS: Linux → All

Hardware: x86_64 → All

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

15 years ago

Depends on: 596364, 599214, 600016, 600414

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 42

•

15 years ago

Attached patch patch to dynamically decide whether to trace a loop (obsolete) — Details — Splinter Review

This patches uses a different approach from the previous two. For the most part, it appears to work much better. Rather than decide whether to trace by looking at the bytecode of a single function, as the previous patches did, this one profiles an entire trace to decide if it's worthwhile. It hooks into the interpreter in the same way that the trace recorder does. When a loop has executed enough iterations in the method JIT, we switch to the interpreter to profile it for one iteration. Profiling keeps track of which opcodes are executed and the types of the operands. It also tries to predict the eventual branchiness of the trace. When it's done, it decides whether the trace is likely to do well based on some of the same factors as in the last patches: - Some features cause us never to trace the loop: recursion, eval, more than three (dynamically) nested loops, or loops that are known to have few iterations. - If the loop, or any loop nested inside it, is too branchy (meaning lots of conditionals or polymorphic calls), then it will probably take a long time to compile so we don't trace. The branchiness computation is unusual. It is an estimate of the number of instructions we'll have to compile, taking into account the fact that branches may cause one instruction to be compiled many times. It is computed roughly as follows. If we see a branch, then there's a good chance that we'll have to compile all the instructions after the branch twice. For an n-way polymorphic call, we estimate n compilations. If the number of instructions to compile is drastically larger than the number of instructions profiled (by a factor of 50) then tracing is not allowed because we would spend too much time compiling. The number 50 is arbitrary, although it works well for these benchmarks. Fundamentally, this number should not be a constant. The longer the benchmark runs, the more time we have to compile. I have some ideas to fix this issue, but they would take time to implement. Right now, 50 seems to work decently well. If none of these bad features above appear in the profile, then we decide whether to trace based on the instruction mix. The following test is used: goodOps = (#FLOAT + #BIT + #INT)*10 + (#CALL + #NEW)*20 trace if goodOps >= total number of ops profiled where #FLOAT is, for example, the number of arithmetic operations with double operands. This formula is somewhat arbitrary, but it seems to work pretty well in practice. It's based on the idea that the tracer is good with arithmetic because it can specialize types, and it's good with function calls because it can inline. If we decide not to trace a loop, we blacklist it so that the method JIT no longer calls into the tracer for each iteration. If we decide to trace it, then we do recording and compilation as normal. When we decide to trace an outer loop, we immediately un-blacklist all its inner loops so that they can be traced as well. There's still some debugging code in there, which I find useful. I'm not sure whether to leave it in (adding some TMFLAGS option) or just remove it. I'll post performance data in the next comment.

Attachment #469995 - Attachment is obsolete: true

Attachment #470964 - Attachment is obsolete: true

Attachment #479499 - Flags: review?(dmandelin)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 43

•

15 years ago

Attached file performance results for profiling (obsolete) — Details

This attachment shows performance data for the above patch. It works pretty well on SS and V8, exceeding the performance of -m, -j, and -m -j. It works okay on Kraken, but -j and -m -j beat it. I also included some other benchmarks near the bottom, including the JSNES benchmark and some JS1K demos. It does quite well on these. There are certainly a lot of improvements that could be made to tune the patch more, but I feel like it's in good enough shape for people to start trying it out and finding problems.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 44

•

15 years ago

Comment on attachment 479502 [details] performance results for profiling >The -mjp column shows the running time for the new code. The next three columns use >TM tip. The delta column shows the difference between -mjp and min(-m, -j, -mj). >Negative values are bad. Positive values occur when we chooose exactly the right set >of loops to trace and do better than -m, -j, or -mj. > >--- Sunspider --- >Test -mjp -m -j -mj delta V8 >... >TOTAL 267.8 281.1 372.8 289.0 -12.0 253.6 would mean "delta" should be: min(m,j,mj) - mjp = 281.1 - 267.8 = 13.3 Each line looks correct, however. Numbers off or missing somewhere?

Assignee

Comment 45

•

15 years ago

(In reply to comment #44) > would mean "delta" should be: > min(m,j,mj) - mjp = 281.1 - 267.8 = 13.3 > > Each line looks correct, however. Numbers off or missing somewhere? The delta value in the TOTAL row is just the sum of the delta values above it.

Comment 46

•

15 years ago

(In reply to comment #45) I see. The sum of the deltas is -12.0, but the overall total is +13.3 for mjp vs. m|j|mj. In other words, mjp outperforms any of the 3 in the full test set, but not really on many individual tests itself. The overall total just isn't shown with these tables. There are however some other lines that are odd, for example access-nsieve shows a delta of -0.0 but it looks like it should be -0.1. Looks like rounding issues are at play here.

Comment 47

•

15 years ago

Comment on attachment 479499 [details] [diff] [review] patch to dynamically decide whether to trace a loop I really like the general approach. It incorporates most of we know so far about what traces well, and it seems more robust than the machine learning idea. It looks like we still need to figure out some thing for Kraken, but the SS/V8 results are excellent. Important review comments: 1. Can you explain how the memory management for the LoopProfileMap works? It looks like you are allocating the profiles along with the traces, and then letting Nanojit flushes clean them up, but it's not obvious to me how the lifetimes match up or what the invariants are. We should probably make a block comment explaining how it works. 2. Please add 1-line comments for all new data members, especially in the loop profile struct. 3. Similarly, please add comments for the non-obvious new functions. For example, "PCWithinLoop" is obvious to me, but "isCompilationExpensive" and the meaning of its depth parameter are not. 4. Calling ProfileLoopEdge even when we are not profiling, and then delegating to MonitorLoopEdge, seems weird to me. I think the entry point should still be named "MonitorLoopEdge", which will mean "do something with this loop edge related to tracing". That function will then decide whether to profile, record, or do nothing. Ideally, that entry point would just be a dispatch function and all the intelligence would live in functions specialized for the profiling or recording. 5. I don't like the open-coded finding of the profile struct in ProfileTracePoint. We have a history of open-coded caches, and they always seem to be a source of confusion and bugs. There is a function to find a profile--that one should be used, and other new ones added as needed. Also, the #if'd part about JS_MONOIC there is non-obvious to me, so please add some comments explaining what that is needed for. 6. How about a new name for "GetCallsitePolymorphism"? (I am on a one-man quest to purge the word "polymorphic" from computer science.) Maybe something like "GetCurrentTargetCount". Minor comments: 7. I saw a "JS_TRACE_MONITOR(cx).profile" but I think you also have an API "TRACE_PROFILER(cx)" for that. 8. ReadInitialBlacklist looks like debugging code, which we normally don't keep in the tree. 9. The profiler counts double/double arithmetic and int/int arithmetic, but it doesn't seem to count double/int arithmetic as an arithmetic op. Was this intentional? 10. "for (int i=int(numInnerLoops)-1; i>=0; i--) {" needs more spaces. On format nits, I also saw an if statement where the "then" part needs to be moved to the next line. 11. Why doesn't PICPCComparator just return |pc - pic->pc|? 12. " /* Locked and loaded with a recorder. Ask the interperter to go run some code. */" looks like a copy/paste error.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

15 years ago

Depends on: 600764

Assignee

Comment 48

•

15 years ago

Attached file updated performance results (obsolete) — Details

I found a fix for the Gaussian blur problem. It had to do with Bug 600414. Here are the updated performance results. I think everything is mostly the same, except that -mjp (the version with profiling) is now only 0.6% slower than -mj on Kraken.

Attachment #479502 - Attachment is obsolete: true

Comment 49

•

15 years ago

Ok, so I tested the JS GameBoy Color emulator without the LCD being drawn out to. On the latest Firefox 4b7pre as of this post, the CPU usage actually is somewhere around 57% when a ROM is just generating NOPs to keep the emulated CPU running. In comparison, Chrome 7 drops to around an average of 36% on the exact same computer. So jaegermonkey should still have a consider way to go before catching up to chrome. This seems to match my earlier "guesstimate" of around a 25% CPU difference between FF4b7pre and the latest dev channel copy of chrome 7 (57 - 36 = 21, so 4% off isn't that bad considering the scale of the total difference). Both being on Mac OS 10.6 Intel.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

15 years ago

Blocks: 602479

Assignee

Updated

•

15 years ago

Depends on: 603017

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 50

•

15 years ago

Attached patch updated patch — Details — Splinter Review

Here is the latest version of the patch. I've fixed the things Dave pointed out and made a few changes to improve performance.

Attachment #479499 - Attachment is obsolete: true

Attachment #482604 - Flags: review?(dmandelin)

Attachment #479499 - Flags: review?(dmandelin)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 51

•

15 years ago

Attached file updated performance results — Details

And here are the performance results for the latest patch.

Attachment #479981 - Attachment is obsolete: true

Comment 52

•

15 years ago

The results look great! The patch doesn't apply cleanly to tip--could you rebase it?

Comment 53

•

15 years ago

Comment on attachment 482604 [details] [diff] [review] updated patch >-/* Minimum number of times a loop must execute, or else it is blacklisted. */ >-#define MIN_LOOP_ITERS 8 >+/* >+ * If after running a trace MIN_LOOP_EXECS times, it hasn't done MIN_LOOP_ITERS >+ * iterations, we blacklist it. >+*/ >+#define MIN_LOOP_ITERS 200 >+#define MIN_LOOP_EXECS 10 MIN_LOOP_EXECS is probably not the right name. Maybe CHECK_LOOP_EXECS? ITER_CHECKPOINT? >+ JSStackFrame *fp = cx->fp(); >+ const char *prefix = ""; >+ if (iters == LOOP_COUNT_MAX) >+ prefix = ">"; >+ debug_only_printf(LC_TMMinimal, " [%.3f ms] Tree at line %u executed for %s%u iterations;" >+ " executed %u times; leave for %s at %s:%u (%s)\n", >+ double(t1-t0) / PRMJ_USEC_PER_MSEC, >+ f->treeLineNumber, prefix, (uintN)iters, f->execs, >+ getExitName(lr->exitType), >+ fp->script()->filename, >+ js_FramePCToLineNumber(cx, fp), >+ js_CodeName[fp->hasImacropc() ? *fp->imacropc() : *cx->regs->pc]); >+ This whole chunk should be within |#ifdef DEBUG|. >+ /* >+ * We try to keep a pointer to the loop profile inside the TRACE IC. >+ * We also keep a pointer inside a hashtable for when we need to >+ * look up nested loops (or when ICs are disabled). >+ * >+ * Memory for the profile is allocated in the dataAlloc for the >+ * trace monitor. Since this thing can get flushed periodically, >+ * we use epochs to decide if the profile in the MIC is valid, as >+ * follows. Every time the trace monitor is flushed, >+ * |tm->flushEpoch| is incremented. When storing the profile in >+ * the IC, we store the current |tm->flushEpoch| along with it. >+ * Before pulling a profile out of the IC, we check that its >+ * stored epoch is still up-to-date with |tm->flushEpoch|. >+ * This ensures that no flush has happened in between. >+ */ Love it. >+static inline bool >+PCWithinLoop(JSScript *script, jsbytecode *pc, LoopProfile *prof) >+{ >+ return script != prof->script || (pc >= prof->top && pc <= prof->bottom); >+} >+ >+static inline bool >+PCWithinLoop(JSScript *script, jsbytecode *pc, LoopProfile::InnerLoop &loop) >+{ >+ return script != loop.script || (pc >= loop.top && pc <= loop.bottom); >+} It's minor, but these should share code, by delegating to a helper function or using a template on the type of the last argument. Also, question about the first disjunct: is that intended to make this return true when the PC is in a different function that is called from the given loop? And is that conservative, or does it happen to be exact here? This is non-obvious enough to need a comment after all. :-) r+ with 4 indicated fixes.

Attachment #482604 - Flags: review?(dmandelin) → review+

Comment 54

•

15 years ago

Minor nitpick, if you don't mind. The new options don't mention the JIT: javascript.options.profiling.chrome javascript.options.profiling.content This would be clear as to what's being profiled, and would quickly show all 6 on filter of "jit" in about:config: javascript.options.jitprofiling.chrome javascript.options.jitprofiling.content

Updated

•

15 years ago

Attachment #461848 - Attachment is patch: false

Updated

•

15 years ago

Blocks: 604905

Sean Stangl [:sstangl]

Comment 55

•

15 years ago

The patch as pushed in http://hg.mozilla.org/tracemonkey/rev/55597c32701d breaks compilation with --disable-tracejit: ../jscntxt.cpp: In member function ‘void JSContext::updateJITEnabled()’: ../jscntxt.cpp:2304: error: ‘traceJitEnabled’ was not declared in this scope

Updated

•

15 years ago

Blocks: 605486

Comment 56

•

15 years ago

The JM+TM output on arewefastyet.com looks to need an update to use profiling.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

15 years ago

Blocks: 531118

Assignee

Comment 57

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/339457364540

Whiteboard: fixed-in-tracemonkey

Updated

•

15 years ago

Blocks: 497458

Updated

•

15 years ago

Depends on: 606483

Updated

•

15 years ago

Depends on: 606494

Updated

•

15 years ago

Depends on: 606650

Updated

•

15 years ago

Depends on: 606897

Updated

•

15 years ago

Blocks: 606890

Updated

•

15 years ago

Depends on: 607201

http://hg.mozilla.org/mozilla-central/rev/339457364540

Updated

•

15 years ago

Depends on: 607539

Robert Sayre

Comment 58

•

15 years ago

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Updated

•

15 years ago

Blocks: 607686

Updated

•

15 years ago

Depends on: 608224

Updated

•

15 years ago

Depends on: 608480

Updated

•

15 years ago

Depends on: 608733

Updated

•

15 years ago

Depends on: 608867

Oleg Romashin (:romaxa)

Updated

•

15 years ago

Depends on: 609206

Updated

•

15 years ago

Blocks: 609212

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

15 years ago

No longer blocks: 609212

Depends on: 609212

Reporter

Updated

•

15 years ago

No longer depends on: 583374

Alon Zakai (:azakai)

Updated

•

15 years ago

Depends on: 610296

Updated

•

15 years ago

No longer depends on: 610296

Updated

•

15 years ago

Depends on: 610580

Updated

•

15 years ago

Depends on: 608490

Alon Zakai (:azakai)

Updated

•

15 years ago

Depends on: 612019

Updated

•

15 years ago

Blocks: 569327