Closed Bug 580468 Opened 10 years ago Closed 9 years ago

JM: Tune Trace JIT Heuristics

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: dvander, Assigned: billm)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(6 files, 5 obsolete files)

16.86 KB, text/plain
Details
7.63 KB, text/plain
Details
4.03 KB, text/plain
Details
4.69 KB, patch
Details | Diff | Splinter Review
85.54 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
4.75 KB, text/plain
Details
We should tune when we use the trace JIT to maximize our combined performance with the method JIT.
Attached file sunspider analysis v1
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/628d8a832720

Simple idea so far is to manually instrument benchmarks with comments denoting loop bodies. A python script preprocesses these to time individual loops. It then runs the new benchmark suite with -m, -j, and then -m -j, and outputs a chart describing how long each loop took, and which mode is best for that loop.

More fine-grained statistics coming soon.

Attached is the initial spew for SunSpider.
Assignee: general → dvander
Status: NEW → ASSIGNED
I wonder how things correlate to the branchiness of the loops...
dvander, I have the new persistent oracle patch half-way working. I will upload it soon-ish. That will do much better blacklisting and avoid a lot of unnecessary compilation. That should help, too.
bz: That's a good question. I am trying to find a good way to automatically correlate results with information like that, right now it's by hand.

Working on a new script with the ability to selectively enable tracing on any combination of loops in a script. It performs a tree search to find a supposedly optimal result.

The results are kind of hard to interpret at a glance so I'm still playing with it. Here's some samples.

bitops-bitwise-and:
> Method JIT:  2ms
> Tracing JIT: 1ms
> Combined:    1ms
> Search found winning time 0ms!
> Combos at this time: 1
>	loop @ 27-30 traced 100% of the time

This is pretty straightforward. We should always trace bitwise-and.

access-binary-trees:
> Method JIT:  11ms
> Tracing JIT: 22ms
> Combined:    11ms
> Search found winning time 10ms!
> Combos at this time: 1
>	loop @ 41-54 traced 0% of the time
>	loop @ 46-52 traced 0% of the time
>	loop @ 32-58 traced 0% of the time

Another easy one. We should never trace this.

crypto-sha1:
> Method JIT:  4ms
> Tracing JIT: 2ms
> Combined:    2ms
> Search found winning time 2ms!
> Combos at this time: 31
>	loop @ 54-85 traced 54% of the time
>	loop @ 63-77 traced 100% of the time
>	loop @ 119-126 traced 51% of the time
>	loop @ 158-162 traced 100% of the time
>	loop @ 172-176 traced 35% of the time
>	loop @ 186-193 traced 32% of the time
>	loop @ 203-218 traced 41% of the time
>	loop @ 209-216 traced 58% of the time
>	loop @ 237-242 traced 25% of the time

This seems to be saying that for a bunch of loops are dontcare for tracing, but others are worthwhile.

Patch & more data tomorrow. I think v8 will be more interesting, since it's longer running.
Attached file exhaustive search
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/0de8f6369f8d

Initial script only does exhaustive searches, which doesn't work for a few SS tests, and all of v8. Work on that forthcoming. For now, attached is the exhaustive analysis where we can do it.

Method JIT alone got 376ms. Tracing JIT alone got 382ms. Combined, as-is, got 358ms.

Optimal result is 326ms.
Attached file MCTS results
Results for the four remaining interesting benchmarks, using a random search. I've no idea how accurate this is, but the results seem in line with the exhaustive searches on other tests (sometimes with more confidence).

Method JIT total:    107ms
Tracing JIT total:   187ms
Combined total:      151ms
Estimated optimum:   105ms

This seems to be saying that cube, raytrace, fannkuch, and aes do not trace well. My takeaway lesson from SunSpider is that we should give up on loop nesting very quickly if it's not working out. It seems to never be worth it.

v8 tomorrow.
> My takeaway lesson from SunSpider is that we should give up on loop
> nesting very quickly if it's not working out. It seems to never be worth it.

I wonder if only using tracing for fragments that self-loop would be effective.
I've been going over the SS data to try to make a start on simplifying the problem. Relative to the current JM+TM setup, where we try to trace everything, the biggest wins we can get are by not tracing anything in the following benchmarks:

  access-fannkuch           32 ms
  3d-raytrace               13
  crypto-aes                 9
  bitops-nsieve-bits         6
  date-formate-tofte         6

Features that jumped out at me as possibly being useful to distinguish these:

* fannkuch
    - (Note: It's probably fine/good to trace the first loop, a simple array init)
    - triply-nested loop within one function
    - loops inside conditionals
    - the outer loop has a total of 13 control-flow constructs inside it: 8 loops and 5 ifs. 

* 3d-raytrace
    - doubly-nested loops within one function
    - loop containing a call to a method with the same name as the current function -- this suggests recursion inside a loop, or a callout to a method in the inheritance hierarchy that might loop
    - loop with 6 function calls and 2 ifs

* crypto-aes
    - (Note: tracing the first 3 loops is fine)
    - doubly-nested loops within one function
    - (Note: there are some other loops that look like they should trace well to me, but the best times seem into indicate they should not trace; not sure what's going on here.)

* bitops-nsieve-bits
    - doubly-nested loops within one function

* date-format-tofte
    - loop containing a return statement
    - loop containing a call to eval
    - loop containing 4 method calls

Note that an equally important analysis that I have not done yet is to identify the loops that are most important *to* trace and their characteristics.

*Tentative* summary based on the above, with caveat regarding one-sided analysis so far:

  * Strong indicators of loops not to trace:
     - statically detectable nesting
     - call to eval

  * Weaker indicators of loops not to trace:
     - large number of ifs
     - large number of calls, especially method calls |a.f()| and
                                         maybe-recursive calls
     - return statements
Just curious, would it help to change the hotloop constant, the number of iterations before recording? E.g. a higher value could prevent tracing certain loops with few iterations?
Now, the biggest wins available by tracing some or all loops relative to not tracing anything:

  date-format-xparb          trace loops at 363, 422              6 ms
  math-partial-sums          trace loop at 12                     6
  string-base64              trace all                            6
  access-nbody               trace loops at 102, 105, 127, 178    5
  math-spectral-norm         trace loops at 12, 15, 25, 28        5
  bitops-3bit-bits-in-byte   trace all                            5
  string-validate-input      trace all                            4
  bitops-bits-in-byte        trace all                            3
  bitops-bitwise-and         trace all                            3
  crypto-sha1                trace loop at 63                     2

In general, the magnitudes of the times so far suggest that it's most important to avoid tracing just a few really bad loops that don't trace. Conversely, it's probably OK to lose an opportunity to trace a loop here or there--the loss is smallish.

Observations on the loops we really want to trace:

* date-format-xparb
  - loop with a total of 3 operators, including loop test
  - loop identical to one in tofte that we don't want to trace: 4 function calls, fairly short overall

* math-partial-sums
  - no control flow, no array access, only functions/props called are in Math.
  - lots of arithmetic ops
  - (The loop we don't need trace here is an outer loop, but in a different
     function from the inner loop).

* string-base64
  - no control flow or function calls, lots of arithmetic, some array reads
  - 3 ifs, no function calls, lots of arithmetic, some array reads

* access-nbody
  - a doubly-nested loop; outer loop has only one statement besides the inner loop; lots of arithmetic, no function calls other than Math
  - single loop with mostly arithmetic
  - single loop with one-line body that calls a function

* math-spectral-norm
  - doubly-nested loops; small bodies for both loops; mostly arithmetic; inner loop calls a function that is just one arithmetic expression

* bitops-3bit-bits-in-byte
  - doubly-nested loop; contains a function call that is mostly arithmetic

* string-validate-input
  - loop with 4 function calls; 2 are one-line functions; there is nesting--inner loops are small, so are outer loops

* bitops-bits-in-byte
  - triply-nested loop; mostly arithmetic

* bitops-bitwise-and
  - mostly arithmetic

* crypto-aes
  - single loop; calls several functions that are short arithmetic kernels

Summarizing, the key factors that indicate a loop we want to trace are:

  - lots of arithmetic
  - short loop bodies
  - few if statements
(In reply to comment #9)
> Just curious, would it help to change the hotloop constant, the number of
> iterations before recording? E.g. a higher value could prevent tracing certain
> loops with few iterations?

That might help, but I suspect we need to be adaptive about it: it looks like some loops you can tell will or will not trace immediately, but other loops may be expected to pay off only for more iterations.
Putting comment 8 and comment 10 together:

** Basic features that indicate what to trace

* If a loop contains one of these, do not trace it:

  - more than 2 nested loops (not including the outer loop; at any level)
  - call to eval
  - return out of loop

* If a loop contains these, probably do not trace it:

  - more than 5 conditionals
  - maybe-recursive calls (calls to the same name/prop name)

* If a loop contains these, question tracing it:

  - more than 4 function calls; methods seem to hurt tracing more than global
    function calls

* If a loop has these characteristics, do trace it:

  - short loop body
    (What does this mean? Not sure yet, but "2 lines of code" or "5 or fewer
     operators" are a reasonable starting place.)
  - mostly arithmetic
    (What does this mean? Not sure yet, but it's mostly about the proportion
     of arithmetic operations to other kinds.)

** Additional notes

- We can often guess a function call target statically and be right. This could be useful: if the targets to functions called from a loop have no loops or short loop bodies and are mostly arithmetic, then that loop probably traces well; if the functions have lots of control flow or are big, then the loop probably doesn't trace.

- It looks like we want to be adaptive about nested loops. We could say "never trace nested loops" and not give up too much. But some loop nests trace pretty well. Maybe we can figure all that out statically, but we might be able to tell better when we're ready to try an outer loop by how the inner loop is tracing.
dmandelin, great analysis.

You haven't mentioned crypto-md5.  It's most distinctive feature is that the main loop is enormous -- almost 20,000 LIR instructions get put into the writer pipeline.  (In comparison, most traced fragments have 10s or 100s of LIR instructions.)  In one respect it's an excellent loop to trace, because it's nothing but arithmetic.  But the compile-time cost is really high, and the loop isn't run enough times to make up that cost.  (I'd love to know how we compare against V8/Nitro if this benchmark checksummed 10x or 100x more text.)

Not tracing this loop seems the right decision.  It contains many more than 4 function calls, so the heuristics in comment 12 would do the right thing.
(In reply to comment #13)
> dmandelin, great analysis.

Thanks.

> You haven't mentioned crypto-md5.  It's most distinctive feature is that the
> main loop is enormous -- almost 20,000 LIR instructions get put into the 
> writer pipeline.  (In comparison, most traced fragments have 10s or 100s of 
> LIR instructions.)  In one respect it's an excellent loop to trace, because 
> it's nothing but arithmetic.  But the compile-time cost is really high, and 
> the loop isn't run enough times to make up that cost.  (I'd love to know how 
> we compare against V8/Nitro if this benchmark checksummed 10x or 100x more 
> text.)

I left that one out because we only lose 3 ms by tracing it when we shouldn't. But I see that it does raise a point that may matter on the web more generally. 

I see that the main giant loop (in core_md5, right?) runs 3967 times, which I would expect would give a nice speedup and be well worth tracing. Is the trace compile time linear in the length of the loop body, or can it grow faster (either asymptotically, or at thresholds or something)? If linear, then I'm really surprised. Otherwise it suggests some kind of "don't trace 'giant loops'" element to the heuristics.

> Not tracing this loop seems the right decision.

Yes, that's what David's stats say as well. It looks like the other loops in md5 are don't cares.

That reminds me of something I forgot to say earlier--many loops in SS are don't cares between tracejit and methodjit. I think that's a good thing, because it gives us a lot of freedom in the decision criteria to really get it right for the loops that matter.

> It contains many more than 4 function calls, so the heuristics in comment 12
> would do the right thing.

Another thing I forgot to mention earlier--I'm hoping those heuristics can actually be turned into 'features' that can then be plugged in to a machine learning algorithm to compute the best formula based on a training set.
The data set for crypto-md5 was cut down from an older (C?) benchmark when Maciej build SunSpider, because anything larger was too slow in IE at the time. Truly a bogus benchmark, and it may change (someone, perhaps not Apple, will do "Annuals" of SunSpider). Any way to take into account larger input size justifying tracing?

/be
(In reply to comment #15)
> The data set for crypto-md5 was cut down from an older (C?) benchmark when
> Maciej build SunSpider, because anything larger was too slow in IE at the time.
> Truly a bogus benchmark, and it may change (someone, perhaps not Apple, will do
> "Annuals" of SunSpider). Any way to take into account larger input size
> justifying tracing?

Could we use information collected by Nick's loop bounds elimination to guess at the number of iterations and use that as another input to the decision?
I have an idea to make the code in md5 trace faster. Gimme a day.
(In reply to comment #17)
> I have an idea to make the code in md5 trace faster. Gimme a day.

Bug 575529 already speeds it up by 1.25--1.50x.  I haven't landed that patch because it still needs some work but it also hasn't seemed that important so far... because it is such a bogus benchmark I'm reluctant to land patches that will only help it and nothing else.
Regarding interval analysis and real-world testcases, it might have a good impact on some of the "emulators are slow" complaints, like bug 509986 or bug 514765.  I would bet the hot core of an 8 bit emulator has a lot of integer ALU operations that can be statically shown to have 8 or 16 bit results...
(In reply to comment #19)
> Regarding interval analysis and real-world testcases, it might have a good
> impact on some of the "emulators are slow" complaints, like bug 509986 or bug
> 514765.  I would bet the hot core of an 8 bit emulator has a lot of integer ALU
> operations that can be statically shown to have 8 or 16 bit results...

Both of those should be greatly helped by JM even without any interval analysis.
Use some runtime rdtsc profiling to decide whether to trace or not. A quick attempt shows promise for v8, where earley-boyer traces 3X slower.

http://pastebin.mozilla.org/762023

I'm going to keep this in mind but look to see if we can find deterministic heuristics first. Timing dependent results could make reproducibility and debugging very painful.
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/9e7fa574c491

Two small heuristics. First, increase HOTLOOP to 4. We've wanted to do this for a while. The second is to blacklist a loop after it runs 300 times.

I'm not too sure on the reliability of the latter, though the two combined show a 26% perf win for v8 and help mitigate the perf we're currently losing on SunSpider by 2.6%. I tried counting opcodes and using that as a metric, but it was highly unreliable.
For v8, it's important to note that by making two of the benchmarks much faster, we took 5-10% hits across the board elsewhere.
I am pretty sure my improved oracle will help a ton here. Should be ready soon, just have to clear out some patches from my queue.
The biggest loss right now is access-fannkuch. In JM it's okay to finally speculate on integers in arrays, without Oracle changes:

http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/4e1c23ed3b2d

The method JIT seems to dampen the losses we previously observed by making this change in TM. We're still losing ~5ms on a few benchmarks here and there, but combined is 6% faster than TM on the graphs, and 2.5% slower than JM alone.
I've spent some time looking this over. The good news is that there seem to be a few simple tests we can do to decide whether to trace a given SunSpider benchmark. These are the tests:
- Are there any short loops of the form |for (...; x<c; ...)|, where |c| is a constant <= 8?
- Are there any triply nested loops?
- Is eval called within a loop?
If any of these hold, then we should not trace. Otherwise, trace.

I've attached a patch that performs these tests on bytecode and then blacklists entire JS files from the tracer. With the patch, our SunSpider score drops from 624ms to 564ms on my x86 box. The patch is pretty preliminary, but I'm hoping to get some feedback about where to put the hooks for it, as well as whether this sort of filename-based classification is acceptable.

It would be a lot nicer to make the trace decision on a loop-by-loop basis. This seems to be a lot more difficult because the decision to trace one loop depends on whether you trace nearby loops. In some cases, tracing one loop and not another can lead to really pathological performance (6x slower in one case). I'll keep working on this.
(In reply to comment #26)
> It would be a lot nicer to make the trace decision on a loop-by-loop basis.
> This seems to be a lot more difficult because the decision to trace one loop
> depends on whether you trace nearby loops. In some cases, tracing one loop and
> not another can lead to really pathological performance (6x slower in one
> case). I'll keep working on this.

Can you clarify this? Specifically:

1. What is the relationship between the loops that affect each other that way? Nested, side by side, etc?

2. What exactly is the effect you observed? Is the effect simply that tracing a certain loop L1 gives a perf loss, but tracing L1+L2 is a win? Or are there more things going on?
Oh, forgot to say, very promising results!
(In reply to comment #26)
> 
> It would be a lot nicer to make the trace decision on a loop-by-loop basis.

Indeed.  I'm sure the file-by-file basis works well for SunSpider because they're mostly so small, but as JS programs get bigger I suspect finer-grained tracing will be more important.
THe improved oracle patch really helps here. Basically if you make a mistake you detect it and won't ever do it again. Benchmarks often run repeatedly, so after the first time we are set until the HTTP cache is flushed.
This patch tries to classify on a loop-by-loop basis. The classification is done using a support vector machine. The parameters for the machine are in jsanalysis.data in the patch. This patch gets about the same performance boost as the previous one on SunSpider, although it's spread out a little differently between the benchmarks. In some cases, we're able to get better performance than either -m or -j by tracing just the right set of loops.

To answer Dave's questions: (1) the biggest performance effect by far is due to nested loops, and (2) there are cases where tracing one loop and not another can drop you into the interpreter for the rest of the function. dvander is looking into the second issue.

I'm not content with this patch. I want to make sure that it doesn't ruin performance on other benchmarks. I'd also like to extract more features from the loops so that we don't overtrain on the benchmarks.
(In reply to comment #31)
> I'm not content with this patch. I want to make sure that it doesn't ruin
> performance on other benchmarks.

The Leave-One-Out loss used in the python training script should be a good indication of whether we are at risk of bad performance on other benchmarks.

Even that can be biased, if the current benchmarks trained on are very similar to each other. To some degree that can be fixed by leaving more than one out (for example, train on one family of benchmarks, test on another, and vice versa).
(In reply to comment #32)
> (In reply to comment #31)
> > I'm not content with this patch. I want to make sure that it doesn't ruin
> > performance on other benchmarks.
> 
> The Leave-One-Out loss used in the python training script should be a good
> indication of whether we are at risk of bad performance on other benchmarks.
> 
> Even that can be biased, if the current benchmarks trained on are very similar
> to each other. To some degree that can be fixed by leaving more than one out
> (for example, train on one family of benchmarks, test on another, and vice
> versa).

I tried something like this. (By the way, thanks for all your help with SVMs, Alon.) It now takes a while to do SVM training. So rather than do the full leave-one-out analysis, I split the data into a training set and a validation set. On the combined SunSpider/V8 data, the result is that we classify only 47% of the validation loops correctly, which is worse than random.

I also did a few timing experiments. First I trained on SunSpider and tested on V8 and SS. The results (using the SunSpider harness):
  V8 with -m:      5040ms
  V8 with -m -j:   4008ms
  V8 with SVM(SS): 5147ms
  SS with -m:      630ms
  SS with -m -j:   652ms
  SS with SVM(SS): 570ms
So we do well on SS and pretty badly on V8. The notation SVM(x) means running with -m -j and deciding whether to trace using an SVM trained on benchmark x.

Then I trained on V8 and tested on V8 and SunSpider:
  V8 with SVM(V8):  3697ms
  SS with SVM(V8):  669ms
So we do well on V8 and badly on SS.

Finally I trained on V8 and SS together.
  V8 with SVM(SS+V8): 3716ms
  SS with SVM(SS+V8): 570ms
Here we do quite well on both--about a 10% speedup on SunSpider over -m and a 7% speedup on V8 over -m -j.

tl;dr: We're definitely overtraining. We do well on benchmarks that we train for but not very well at all on the rest. To try and fix this, I'm going to investigate doing machine learning on the traces themselves rather than the bytecode.
Are you just using the SunSpider and V8 benchmarks to decide when to trace and when to method jit? There are likely other javascript code that will trace/jit differently.

I'm thinking here things like a jquery-heavy site (for table manipulation, etc.), the SNES and GameBoy emulators in javascript, the javascript SWF flash interpreter ('gordon' IIRC), the IE9 demos (fishtank, Amazon shelf and others) and other things that interact with canvas and/or dom.

It would be interesting to see how the decision to trace or not works out in browser -- possibly being able to see/analyse the data from a browser session through some browser UI (to monitor dom/canvas javascript).

Great work so far.
Interesting results. Regarding overtraining, I wonder what would happen if you excluded the "don't care" loops (e.g., a set of loops where making the wrong choice on those loops gives a perf loss <= N). Maybe part of the overtraining comes from trying to get the just-right result on those.
(In reply to comment #34)
> I'm thinking here things like a jquery-heavy site (for table manipulation,
> etc.), the SNES and GameBoy emulators in javascript, the javascript SWF flash
> interpreter ('gordon' IIRC), the IE9 demos (fishtank, Amazon shelf and others)
> and other things that interact with canvas and/or dom.

Yes, I plan to look at other code soon.

(In reply to comment #35)
> Interesting results. Regarding overtraining, I wonder what would happen if you
> excluded the "don't care" loops (e.g., a set of loops where making the wrong
> choice on those loops gives a perf loss <= N). Maybe part of the overtraining
> comes from trying to get the just-right result on those.

I sort of did that, using a method Alon suggested. For each loop I estimated its importance, roughly as "if I do the wrong thing, how much worse does performance get?" Then in the training data, I used multiple copies of each loop based on its importance. So the wrong answer on an important loop was much more significant than on an unimportant one.
blocking2.0: --- → ?
Yeah, trace jit still needs to be worked on, so that it doesn't slow down method jit's performance. The GameBoy Color emulator seems to be hit when tracejit is turned on when methodjit is already on. But only by a few CPU % points.
blocking2.0: ? → final+
FWIW, I played a bit with different values for HOTLOOP, and 200 seems to work quite well:

TEST              COMPARISON            FROM                 TO

===============================================

** TOTAL **:      1.032x as fast    2293.8ms +/- 0.4%   2222.9ms +/- 0.3%

===============================================

  v8:             1.032x as fast    2293.8ms +/- 0.4%   2222.9ms +/- 0.3%
    crypto:       ??                 293.1ms +/- 2.5%    297.0ms +/- 1.8%
    deltablue:    1.110x as fast     444.6ms +/- 1.8%    400.4ms +/- 0.7%
    earley-boyer: 1.034x as fast     348.4ms +/- 0.1%    337.0ms +/- 0.3%
    raytrace:     1.014x as fast     363.6ms +/- 0.3%    358.6ms +/- 0.3%
    regexp:       1.025x as fast     298.3ms +/- 0.3%    291.0ms +/- 0.2%
    richards:     1.043x as fast     240.4ms +/- 1.2%    230.4ms +/- 0.7%
    splay:        ??                 305.4ms +/- 1.1%    308.5ms +/- 0.5%

It's interesting that even richards gets faster.

Sunspider:

TEST                   COMPARISON            FROM                 TO   

=============================================================================

** TOTAL **:           *1.015x as slow*  390.5ms +/- 0.2%   396.3ms +/- 0.2% 

=============================================================================

  3d:                  *1.065x as slow*   77.0ms +/- 0.6%    82.0ms +/- 0.4%
    cube:              1.024x as fast     26.8ms +/- 0.8%    26.2ms +/- 0.7%
    morph:             -                  17.7ms +/- 1.1%    17.7ms +/- 0.9% 
    raytrace:          *1.174x as slow*   32.4ms +/- 0.6%    38.1ms +/- 0.4%

  access:              1.020x as fast     46.9ms +/- 0.8%    46.0ms +/- 0.4%
    binary-trees:      1.058x as fast      8.5ms +/- 2.2%     8.0ms +/- 0.9%
    fannkuch:          -                  22.1ms +/- 0.6%    22.0ms +/- 0.4% 
    nbody:             1.072x as fast      7.4ms +/- 2.5%     6.9ms +/- 1.4%
    nsieve:            *1.019x as slow*    8.9ms +/- 1.5%     9.0ms +/- 0.8%

  bitops:              ??                 19.4ms +/- 1.5%    19.6ms +/- 1.3%
    3bit-bits-in-byte: *1.29x as slow*     0.8ms +/- 19.0%     1.0ms +/- 6.6%
    bits-in-byte:      ??                  8.8ms +/- 1.6%     8.9ms +/- 1.7%
    bitwise-and:       -                   2.2ms +/- 6.5%     2.2ms +/- 6.5% 
    nsieve-bits:       -                   7.6ms +/- 2.5%     7.5ms +/- 2.5% 

  controlflow:         -                   5.0ms +/- 0.0%     4.9ms +/- 1.9% 
    recursive:         -                   5.0ms +/- 0.0%     4.9ms +/- 1.9% 

  crypto:              *1.016x as slow*   35.8ms +/- 1.3%    36.4ms +/- 0.6%
    aes:               1.048x as fast     21.0ms +/- 2.2%    20.0ms +/- 0.5%
    md5:               *1.147x as slow*   10.0ms +/- 0.7%    11.4ms +/- 1.6%
    sha1:              ??                  4.9ms +/- 2.3%     5.0ms +/- 1.4%

  date:                *1.010x as slow*   43.4ms +/- 0.6%    43.8ms +/- 0.5%
    format-tofte:      *1.014x as slow*   20.7ms +/- 0.8%    21.0ms +/- 0.7%
    format-xparb:      ??                 22.7ms +/- 0.8%    22.8ms +/- 0.7%

  math:                1.024x as fast     26.1ms +/- 0.6%    25.5ms +/- 0.7%
    cordic:            -                   8.3ms +/- 2.2%     8.1ms +/- 1.6% 
    partial-sums:      ??                 12.0ms +/- 0.8%    12.1ms +/- 0.8%
    spectral-norm:     1.088x as fast      5.8ms +/- 2.8%     5.3ms +/- 3.3%

  regexp:              -                  18.5ms +/- 1.0%    18.3ms +/- 1.0% 
    dna:               -                  18.5ms +/- 1.0%    18.3ms +/- 1.0% 

  string:              *1.012x as slow*  118.4ms +/- 0.4%   119.8ms +/- 0.3% 
    base64:            *1.045x as slow*    5.9ms +/- 1.6%     6.2ms +/- 2.5%
    fasta:             *1.034x as slow*   20.3ms +/- 0.9%    21.0ms +/- 0.3%
    tagcloud:          -                  32.4ms +/- 0.6%    32.3ms +/- 0.6% 
    unpack-code:       *1.007x as slow*   48.6ms +/- 0.4%    48.9ms +/- 0.4%
    validate-input:    ??                 11.2ms +/- 1.3%    11.3ms +/- 1.5%

This gets 6 ms slower, mostly due to 3d-raytrace and crypto-md5. If we can make these faster this patch might actually be a win...
A more conservative value like 40 works even better. V8 drops to 2197 ms, a ~100 ms win. SS goes to 388.3 ms, a 2 ms win. 3d-raytrace becomes 4 ms faster instead of 6 ms slower.

Tracing fewer loops might also make tuning a bit easier.
I've determined that Firefox 4b7pre on Mac OS 10.6 still uses ~20% CPU than Google Chrome 6 with the GameBoy Color emulator ( http://grantgalitz.org/gameboy/  ) when the graphics have been taken out as a CPU usage factor.
Sorry, I meant ~20% more CPU usage. Should help you understand my last comment.
Assignee: dvander → wmccloskey
OS: Linux → All
Hardware: x86_64 → All
This patches uses a different approach from the previous two. For the most part, it appears to work much better. Rather than decide whether to trace by looking at the bytecode of a single function, as the previous patches did, this one profiles an entire trace to decide if it's worthwhile.

It hooks into the interpreter in the same way that the trace recorder does. When a loop has executed enough iterations in the method JIT, we switch to the interpreter to profile it for one iteration. Profiling keeps track of which opcodes are executed and the types of the operands. It also tries to predict the eventual branchiness of the trace. When it's done, it decides whether the trace is likely to do well based on some of the same factors as in the last patches:

- Some features cause us never to trace the loop: recursion, eval, more than three (dynamically) nested loops, or loops that are known to have few iterations.
- If the loop, or any loop nested inside it, is too branchy (meaning lots of conditionals or polymorphic calls), then it will probably take a long time to compile so we don't trace.

The branchiness computation is unusual. It is an estimate of the number of instructions we'll have to compile, taking into account the fact that branches may cause one instruction to be compiled many times. It is computed roughly as follows. If we see a branch, then there's a good chance that we'll have to compile all the instructions after the branch twice. For an n-way polymorphic call, we estimate n compilations.

If the number of instructions to compile is drastically larger than the number of instructions profiled (by a factor of 50) then tracing is not allowed because we would spend too much time compiling. The number 50 is arbitrary, although it works well for these benchmarks. Fundamentally, this number should not be a constant. The longer the benchmark runs, the more time we have to compile. I have some ideas to fix this issue, but they would take time to implement. Right now, 50 seems to work decently well.

If none of these bad features above appear in the profile, then we decide whether to trace based on the instruction mix. The following test is used:
  goodOps = (#FLOAT + #BIT + #INT)*10 + (#CALL + #NEW)*20
  trace if goodOps >= total number of ops profiled
where #FLOAT is, for example, the number of arithmetic operations with double operands.

This formula is somewhat arbitrary, but it seems to work pretty well in practice. It's based on the idea that the tracer is good with arithmetic because it can specialize types, and it's good with function calls because it can inline.

If we decide not to trace a loop, we blacklist it so that the method JIT no longer calls into the tracer for each iteration. If we decide to trace it, then we do recording and compilation as normal. When we decide to trace an outer loop, we immediately un-blacklist all its inner loops so that they can be traced as well.

There's still some debugging code in there, which I find useful. I'm not sure whether to leave it in (adding some TMFLAGS option) or just remove it.

I'll post performance data in the next comment.
Attachment #469995 - Attachment is obsolete: true
Attachment #470964 - Attachment is obsolete: true
Attachment #479499 - Flags: review?(dmandelin)
Attached file performance results for profiling (obsolete) —
This attachment shows performance data for the above patch. It works pretty well on SS and V8, exceeding the performance of -m, -j, and -m -j. It works okay on Kraken, but -j and -m -j beat it.

I also included some other benchmarks near the bottom, including the JSNES benchmark and some JS1K demos. It does quite well on these.

There are certainly a lot of improvements that could be made to tune the patch more, but I feel like it's in good enough shape for people to start trying it out and finding problems.
Comment on attachment 479502 [details]
performance results for profiling

>The -mjp column shows the running time for the new code. The next three columns use
>TM tip. The delta column shows the difference between -mjp and min(-m, -j, -mj).
>Negative values are bad. Positive values occur when we chooose exactly the right set
>of loops to trace and do better than -m, -j, or -mj.
>
>--- Sunspider ---
>Test                                     -mjp       -m       -j      -mj    delta       V8
>...
>TOTAL                                   267.8    281.1    372.8    289.0    -12.0    253.6

would mean "delta" should be:
min(m,j,mj) - mjp = 281.1 - 267.8 = 13.3

Each line looks correct, however. Numbers off or missing somewhere?
(In reply to comment #44)
> would mean "delta" should be:
> min(m,j,mj) - mjp = 281.1 - 267.8 = 13.3
> 
> Each line looks correct, however. Numbers off or missing somewhere?

The delta value in the TOTAL row is just the sum of the delta values above it.
(In reply to comment #45)
I see. The sum of the deltas is -12.0, but the overall total is +13.3 for mjp vs. m|j|mj. In other words, mjp outperforms any of the 3 in the full test set, but not really on many individual tests itself. The overall total just isn't shown with these tables.

There are however some other lines that are odd, for example access-nsieve shows a delta of -0.0 but it looks like it should be -0.1. Looks like rounding issues are at play here.
Comment on attachment 479499 [details] [diff] [review]
patch to dynamically decide whether to trace a loop

I really like the general approach. It incorporates most of we know so far about what traces well, and it seems more robust than the machine learning idea. It looks like we still need to figure out some thing for Kraken, but the SS/V8 results are excellent. Important review comments:

1. Can you explain how the memory management for the LoopProfileMap works? It looks like you are allocating the profiles along with the traces, and then letting Nanojit flushes clean them up, but it's not obvious to me how the lifetimes match up or what the invariants are. We should probably make a block comment explaining how it works.

2. Please add 1-line comments for all new data members, especially in the loop profile struct.

3. Similarly, please add comments for the non-obvious new functions. For example, "PCWithinLoop" is obvious to me, but "isCompilationExpensive" and the meaning of its depth parameter are not.

4. Calling ProfileLoopEdge even when we are not profiling, and then delegating to MonitorLoopEdge, seems weird to me. I think the entry point should still be named "MonitorLoopEdge", which will mean "do something with this loop edge related to tracing". That function will then decide whether to profile, record, or do nothing. Ideally, that entry point would just be a dispatch function and all the intelligence would live in functions specialized for the profiling or recording.

5. I don't like the open-coded finding of the profile struct in ProfileTracePoint. We have a history of open-coded caches, and they always seem to be a source of confusion and bugs. There is a function to find a profile--that one should be used, and other new ones added as needed. Also, the #if'd part about JS_MONOIC there is non-obvious to me, so please add some comments explaining what that is needed for.

6. How about a new name for "GetCallsitePolymorphism"? (I am on a one-man quest to purge the word "polymorphic" from computer science.) Maybe something like "GetCurrentTargetCount".

Minor comments:

7. I saw a "JS_TRACE_MONITOR(cx).profile" but I think you also have an API "TRACE_PROFILER(cx)" for that. 

8. ReadInitialBlacklist looks like debugging code, which we normally don't keep in the tree.

9. The profiler counts double/double arithmetic and int/int arithmetic, but it doesn't seem to count double/int arithmetic as an arithmetic op. Was this intentional?

10. "for (int i=int(numInnerLoops)-1; i>=0; i--) {" needs more spaces. On format nits, I also saw an if statement where the "then" part needs to be moved to the next line.

11. Why doesn't PICPCComparator just return |pc - pic->pc|?

12. " /* Locked and loaded with a recorder. Ask the interperter to go run some code. */" looks like a copy/paste error.
Depends on: 600764
Attached file updated performance results (obsolete) —
I found a fix for the Gaussian blur problem. It had to do with Bug 600414. Here are the updated performance results. I think everything is mostly the same, except that -mjp (the version with profiling) is now only 0.6% slower than -mj on Kraken.
Attachment #479502 - Attachment is obsolete: true
Ok, so I tested the JS GameBoy Color emulator without the LCD being drawn out to.
On the latest Firefox 4b7pre as of this post, the CPU usage actually is somewhere around 57% when a ROM is just generating NOPs to keep the emulated CPU running. In comparison, Chrome 7 drops to around an average of 36% on the exact same computer. So jaegermonkey should still have a consider way to go before catching up to chrome. This seems to match my earlier "guesstimate" of around a 25% CPU difference between FF4b7pre and the latest dev channel copy of chrome 7 (57 - 36 = 21, so 4% off isn't that bad considering the scale of the total difference). Both being on Mac OS 10.6 Intel.
Attached patch updated patchSplinter Review
Here is the latest version of the patch. I've fixed the things Dave pointed out and made a few changes to improve performance.
Attachment #479499 - Attachment is obsolete: true
Attachment #482604 - Flags: review?(dmandelin)
Attachment #479499 - Flags: review?(dmandelin)
And here are the performance results for the latest patch.
Attachment #479981 - Attachment is obsolete: true
The results look great!

The patch doesn't apply cleanly to tip--could you rebase it?
Comment on attachment 482604 [details] [diff] [review]
updated patch

>-/* Minimum number of times a loop must execute, or else it is blacklisted. */
>-#define MIN_LOOP_ITERS 8
>+/*
>+ * If after running a trace MIN_LOOP_EXECS times, it hasn't done MIN_LOOP_ITERS
>+ * iterations, we blacklist it.
>+*/
>+#define MIN_LOOP_ITERS 200
>+#define MIN_LOOP_EXECS 10

MIN_LOOP_EXECS is probably not the right name. Maybe CHECK_LOOP_EXECS? 
ITER_CHECKPOINT?

>+    JSStackFrame *fp = cx->fp();
>+    const char *prefix = "";
>+    if (iters == LOOP_COUNT_MAX)
>+        prefix = ">";
>+    debug_only_printf(LC_TMMinimal, "  [%.3f ms] Tree at line %u executed for %s%u iterations;"
>+                      " executed %u times; leave for %s at %s:%u (%s)\n",
>+                      double(t1-t0) / PRMJ_USEC_PER_MSEC,
>+                      f->treeLineNumber, prefix, (uintN)iters, f->execs,
>+                      getExitName(lr->exitType),
>+                      fp->script()->filename,
>+                      js_FramePCToLineNumber(cx, fp),
>+                      js_CodeName[fp->hasImacropc() ? *fp->imacropc() : *cx->regs->pc]);
>+    

This whole chunk should be within |#ifdef DEBUG|.

>+    /*
>+     * We try to keep a pointer to the loop profile inside the TRACE IC.
>+     * We also keep a pointer inside a hashtable for when we need to
>+     * look up nested loops (or when ICs are disabled).
>+     *
>+     * Memory for the profile is allocated in the dataAlloc for the
>+     * trace monitor. Since this thing can get flushed periodically,
>+     * we use epochs to decide if the profile in the MIC is valid, as
>+     * follows. Every time the trace monitor is flushed,
>+     * |tm->flushEpoch| is incremented. When storing the profile in
>+     * the IC, we store the current |tm->flushEpoch| along with it.
>+     * Before pulling a profile out of the IC, we check that its
>+     * stored epoch is still up-to-date with |tm->flushEpoch|.
>+     * This ensures that no flush has happened in between.
>+     */

Love it.

>+static inline bool
>+PCWithinLoop(JSScript *script, jsbytecode *pc, LoopProfile *prof)
>+{
>+    return script != prof->script || (pc >= prof->top && pc <= prof->bottom);
>+}
>+
>+static inline bool
>+PCWithinLoop(JSScript *script, jsbytecode *pc, LoopProfile::InnerLoop &loop)
>+{
>+    return script != loop.script || (pc >= loop.top && pc <= loop.bottom);
>+}

It's minor, but these should share code, by delegating to a helper function
or using a template on the type of the last argument.

Also, question about the first disjunct: is that intended to make this return
true when the PC is in a different function that is called from the given loop?
And is that conservative, or does it happen to be exact here? This is non-obvious
enough to need a comment after all. :-)

r+ with 4 indicated fixes.
Attachment #482604 - Flags: review?(dmandelin) → review+
Minor nitpick, if you don't mind. The new options don't mention the JIT:
javascript.options.profiling.chrome
javascript.options.profiling.content

This would be clear as to what's being profiled, and would quickly show all 6 on filter of "jit" in about:config:
javascript.options.jitprofiling.chrome
javascript.options.jitprofiling.content
Attachment #461848 - Attachment is patch: false
The patch as pushed in http://hg.mozilla.org/tracemonkey/rev/55597c32701d breaks compilation with --disable-tracejit:

../jscntxt.cpp: In member function ‘void JSContext::updateJITEnabled()’:
../jscntxt.cpp:2304: error: ‘traceJitEnabled’ was not declared in this scope
The JM+TM output on arewefastyet.com looks to need an update to use profiling.
Blocks: 531118
Blocks: 497458
Depends on: 606494
http://hg.mozilla.org/mozilla-central/rev/339457364540
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Duplicate of this bug: 594621
Depends on: 608733
Depends on: 609206
No longer blocks: 609212
Depends on: 609212
No longer depends on: 583374
Duplicate of this bug: 583374
Depends on: 610296
No longer depends on: 610296
Depends on: 612019
No longer blocks: 606890
Depends on: 606890
You need to log in before you can comment on or make changes to this bug.