Closed Bug 783008 Opened 12 years ago Closed 12 years ago

IonMonkey: Use better heuristics for recompile checks from JM on behalf of Ion.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: djvj, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ion:p2:fx18])

Attachments

(4 files)

In ShouldJaegerCompileCallee, JM makes the decision to re-compile a script using method-jit even if the script has been compiled with Ion.

This leads to a pathological case described below:

1. Functions A and B exist, with A calling B.
2. Functions A and B get compiled with method jit.
3. B's usecount gets high enough that B now gets recompiled with Ion
  * B's methodjit script is discarded
  * A's call IC entry for calling B's JM script is invalidated.
4. A's methodjit script runs and hits the call IC fallback when it calls B
5. A calls ShouldJaegerCompileCallee on B, which returns true.
6. A re-compiles B with methodjit _AGAIN_, even though B has an IonScript, and adds it to its IC.

A prime example of where this plays out is in access-binary-tree.

The top-level script gets compiled with JM, as does |bottomUpTree|.  Then, bottomUpTree gets recompiled with Ion, and its JM code discarded, and the call IC from the top-level script's JM code is invalidated.

On a subsequent call, |ShouldJaegerCompileCallee| check returns true for bottomUpTree (it contains no loops), so the call IC fallback recompiles |bottomUpTree| with JM.
Changing |ShouldJaegerCompileCallee| to check if the callee already contains an IonScript and return false in that case leads to a roughly 5%/4% win on SunSpider 64-bit/32-bit respectively.

64-bit reference vs. patched Sunspider comparison: http://pastebin.mozilla.org/1757733
32-bit reference vs. patched Sunspider comparison: http://pastebin.mozilla.org/1757734

However, speaking with jandem about this on IRC we came to the conclusion that the general policy approach surrounding JM -> Ion calls should be investigated.



Note the huge win on binary trees, but slowdowns everywhere else.  Binary trees wins big because using the Ion script really helps, as |bottomUpTree| is recursive, and once it passes that initial expensive JM -> Ion call, it spend a lot of time in ioncode.

I suspect that in the other cases, there are a lot of smallish functions with high call counts that, when changed from JM -> JM calls to JM -> Ion calls, incur enough of an extra call overhead that it becomes a net loss.

V8 shows a net slowdown of about 3% with this change, which I think is because of this.
Kraken shows a slight win (0.5% or so) on 64-bit and no change in 32-bit.
Slightly embarassing, but the fantastic results posted above are misleading.  Mainly because the bench I was using had a modification to the loop counters in binary-trees, causing the overall runtime of binary trees to be overrepresented greatly in the benchmark, and skewing the results.

However further investigation did yield some promising insights.

DESCRIPTION OF BEHAVIOUR:

Profiling SS, it turns out that a number of cases, we end up adding recompile-checks to small JS functions compiled with methodjit.  These small-functions are called frequently, and are ion-compileable, so the methodjit generates a useCount increment for them and invalidates for recompile after about 10k iterations.

Examples include:
ShiftRows(), AddRoundKey(), SubBytes() in ss-crypto-aes
safe_add(), md5_cmn() in ss-crypto-md5
safe_add(), rol(), sha1_ft(), sha1_kt() in ss-crypto-sha1
arrayExists() in date-format-tofte
String.leftPad() in date-format-xparb

A lot of these are tiny functions are worth compiling only in the case where they're called frequently from an ion-compiled caller.  In the other cases, ion-compiling them is a net loss because:

1. They run extremely quickly, so the delta between the ion-compiled code and the methodjit-compiled code is small.
2. We will spend a relatively high amount of time ion-compiling them.
3. We will invalidate working JM code (with recompile check) to force an ion recompile
4. Often-times we will then RE-compile the same code with JM, AGAIN, because the ShouldJaegerCompile heuristic in the JM call ICs returns true (e.g. functions with no loops).

This leads to a number of circumstances where we do premature Ion compiles, and we invalidate good working JM code to do so.


APPROACH:

Part 1
------

We address the "premature invalidation of JM code" by changing the MaybeIonCompileable utility function used by recompileCheckHelper, to distinguish between two cases:

1. JITcode that wants to invalidate itself after a certain number of runs (i.e. for recompile with JM inlining, or to force attempt an Ion recompile).

2. JITCode that still wants to keep track of useCounts, because we may IonCompile later, but not explicitly invalidate itself for that purpose.

Currently, both cases are handled the same: useCount is incremented, and the mjitcode is invalidated if the recompilecheck limit is reached.  The new behaviour changes this so that in the second case, we increment useCount, but skip the invalidation-for-recompile when the useCount reaches a threshold.


By keeping the useCount increments active in the second case, we allow later Ion => JM calls (which go through the general interpreter mechanisms for entering mjit code, and thus check to see if the function can be ion-compiled), to still re-compile the function if it turns out that it's getting called frequently from other ion-compiled functions or the interpreter.

However, it also greatly reduces the situations where we invalidate working JM code.

Part 2
------
The second step is to improve the heuristics for determining when we want to invalidate JM code from within for the sake of recompiling with Ion.

Currently, any ion-compileable function will have a recompile check added in mjit code.  This doesn't make sense since if we really end up calling that code from Ion a lot, we'll recompile it for Ion anyway (assuming useCounts are kept updated by the JM-compiled code) when going through the IM => JM call path.

This also intersects with those pathological cases in the JM IC call path where JM-code is invalidated, compiled with ion, and then also compiled with JM again because JM => JM calls are preferred (in ShouldJaegerCompileCallee).  We want to do a better job of preserving the "small" function code that's compiled with JM, because throwing it away and recompiling with Ion is inefficient when we know the caller is going to be JM.  Of note in this situation is that |ShouldJaegerCompileCallee| checks whether the callee has loops, and uses that as a heuristic for preferring Ion over JM for calls from JM ICs.

However, in many cases, just the presence of loops doesn't indicate much.  The loops may be small loops of a few iterations that aren't really the kind that are suited for ion compilation.  This is true in several of the crypto-md5 and sha-1 cases, as well as date-formate-tofte.

To obtain a better heuristic for distinguishing between "high use" loops and "small" loops, we can change the interpreter to keep track of a maximum loop iterations counter on a per-call basis.  On every script entry in the interpreter, the loop counter is reset, and incremented on every LOOPHEAD occurrence inside the script.  The maximum of these is kept track of in another field on the JSScript.

By looking at the maximum loop counter for a function, we can get a better sense of how many times a function goes around a loop in a given script.

We now use this as a heuristic in both the |ShouldJaegerCompileCallee| check in the methodjit call IC fallback path, as well as the |MaybeIonCompileable| check used by mjit's recompileCheckHelper to insert the invalidation-for-recompile behaviour (however, it still inserts the useCount increment if the script may possibly be compileable by Ion at a later point in time).

This synchronizes both behaviours, and improves the number of cases where JM can choose not to invalidate its own scripts.  It still leaves open the possibility of scripts getting re-compiled with Ion by keeping the useCount active in the cases where ion-recompilation is possible.

Patch to follow.
Summary: IonMonkey: Jaeger IC recompiles with methodjit even if IonScript exists. Investigate policy changes to make this behave better. → IonMonkey: Use better heuristics for recompile checks from JM on behalf of Ion.
Patch for the above.

Posts general improvements of about 2-3 ms on both 32-bit and 64-bit SS.  Generally holds on other benches.

32-bit SS comparison: http://pastebin.mozilla.org/1761914
64-bit SS comparison: http://pastebin.mozilla.org/1761915
Attachment #652788 - Flags: review?(jdemooij)
Attachment #652788 - Flags: review?(jdemooij) → review?(dvander)
This is an additive patch to the the previous one which slightly alters the inlining heuristics for JM (while preserving ion-recompileability, which is otherwise turned off for inlined JM functions).

It establishes a metric for a "simple function": namely one that is small, contains no large loops, and contains only "simple calls" (if any).

Simple calls are those to other functions where: the callee is a well-known singleton at the call-site, is small, has no large loops, and itself contains no calls.

This patch tries to establish a heuristic for early inlining, measuring those cases where the presence of calls to other functions is truly trivial.  It seems to work pretty well.

Overall, with both patches (this one and the previous), sunspider 64-bit shows a 2.5% gain (~4.4ms) on my linux box, and sunspider 32-bit shows a 2.2% gain (~4ms).
Comment on attachment 652788 [details] [diff] [review]
Improve JM recompile check heuristics and syncrhonize ShouldJaegerCompileCallee with recompileCheck logic

Review of attachment 652788 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/methodjit/InvokeHelpers.cpp
@@ +263,4 @@
>          return true;
>  
> +    if (callee->length < 100 && !callee->analysis()->hasFunctionCalls() &&
> +        callee->getMaxLoopCount() < 40)

This logic appears to be duplicated from Compiler.cpp - possible to hoist it out somewhere?
Attachment #652788 - Flags: review?(dvander) → review+
Attachment #653394 - Flags: review?(dvander)
Attachment #653394 - Flags: review?(dvander)
Attachment #653394 - Flags: review?(bhackett1024)
Whiteboard: [ion:p2:fx18]
Comment on attachment 653394 [details] [diff] [review]
More aggressively inline "simple" JM functions

Review of attachment 653394 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, sorry for the slow review.

In the medium term I'm hoping we'll be able to kill inlining in JM entirely (two separate inlining compilers is too many), and I'm curious how performance compares if the simple/short script heuristics are used to adjust use count triggers for IM compilation rather than JM inlining.  Whenever off thread compilation is turned on by default, compiling short Ion scripts earlier should be attractive as we can continue to run JM code on the main thread.

::: js/src/jsanalyze.cpp
@@ +2045,5 @@
>      return cv;
>  }
>  
> +bool
> +ScriptAnalysis::isSimple(JSContext *cx)

I think this function should be a non-member static function in methodjit/Compiler.cpp: it uses type information, so shouldn't be in jsanalyze.cpp, and is only used for compilation heuristics.

@@ +2082,5 @@
> +            return false;
> +
> +        // If the callee is large, has big loops, or has its own function calls,
> +        // then the caller is not simple.
> +        if (callee->length > 100 || callee->getMaxLoopCount() >= 40 ||

Echoing dvander's comment, the repeated uses of 100 and 40 need to be factored into a shared function, IsShortScript() or something.

::: js/src/methodjit/Compiler.cpp
@@ +97,5 @@
>      chunkEdges(CompilerAllocPolicy(cx, *thisFromCtor())),
>      stubcc(cx, *thisFromCtor(), frame),
>      debugMode_(cx->compartment->debugMode()),
>      inlining_(false),
> +    forceIncUseCount_(false),

With bug 774253 in, I don't think this field is necessary anymore: JM now only includes recompile checks for triggering Ion compilation, not for compiling new JM code with calls inlined.

@@ +532,5 @@
>          }
>  
>          CHECK_STATUS(checkAnalysis(outerScript));
> +
> +        if (outerScript->length < 100 && outerScript->getMaxLoopCount() < 40 &&

Fix use of 100/40.

@@ +3972,5 @@
>          return false;
>  
> +    JaegerSpew(JSpew_Scripts, "KVKV IsSimple(%s:%d) (len=%d) -> %s\n",
> +                    script->filename, (int) script->lineno, (int) script->length,
> +                    script->analysis()->isSimple(cx) ? "yes" : "no");

The isSimple call here is expensive and will occur even in release builds.  This should be modified to only call isSimple in debug builds.

@@ +3978,4 @@
>      // If this script is small, doesn't have any function calls, and doesn't have
>      // any big loops, then throwing in a recompile check and causing an invalidation
>      // when we otherwise wouldn't have would be wasteful.
> +    if (!inlining && (script->length >= 100 || script->getMaxLoopCount() >= 40 ||

Fix use of 100/40.
Attachment #653394 - Flags: review?(bhackett1024) → review+
(In reply to Brian Hackett (:bhackett) from comment #6)

> @@ +3972,5 @@
> >          return false;
> >  
> > +    JaegerSpew(JSpew_Scripts, "KVKV IsSimple(%s:%d) (len=%d) -> %s\n",
> > +                    script->filename, (int) script->lineno, (int) script->length,
> > +                    script->analysis()->isSimple(cx) ? "yes" : "no");
> 
> The isSimple call here is expensive and will occur even in release builds. 
> This should be modified to only call isSimple in debug builds.

Argh.  That's temporary printf-debugging code.  I gotta remember to grep my patches for KVKV before submitting.
Blocks: 777561
The commit of this bug caused some major test regressions in tbpl so it was backed out.  Not to mention I was disappointed in general with the improvements posted by the second patch (about 0.6ms on SS in v8).

While tinkering with inlining behaviour I collected some general insights that would probably do well to be noted down:

Just for notes, in the following text, a "Simple" function is one that:
  * has no "large" loops (i.e. more than 40 loop iterations per function call in the interpreter)
  * is "small" (less than 100 bytes in bytecode).
  * contains either no calls, OR
  * all calls are well-characterized by TI with a single callee, and the callee itself has no large loops, no function calls in its body, and is also small.

Observations:
Disabling ion-compilation entirely of simple functions actually yielded significant gains in several places, most significantly in date-format-xparb, but sharp losses in particular on 3bits-bits-in-byte and bits-in-byte.

Overall the result was a net win, but heuristics should be about good intuition leading to good results, not toggling various options until some overall gain is observed.  Taking a look at the specific functions that were being compiled, and which benches benefitted from that change versus which benches did worse, there was a reasonably clear pattern that emerged.

Loop-heavy, very hot, highly numerical code did well when compiled early with Ion - Ion spends a lot of time to generate very tight code, and many of its techniques (GVN, proper regalloc, etc.) are particularly appropriate for code doing a lot of direct value manipulation.  This is why 3bits and bits-in-byte took losses when JM-only was toggled simple functions: their loop computations are almost purely numerical.  They were the perfect Ion code but were getting selected against for ion compilation because they were "small" and "simple".

JS functions that were heavy on uninlined calls and object access, however, seem to gain less benefit from early ion compilation.  Which makes sense - the runtime of the compiled versions of these functions are more likely to be influenced by overhead costs of VM calls, call ICs, getprop ICs, etc.


This suggests some approaches for choosing between JM and Ion.  In particular, if we can identify "lightweight" code - namely code that's heavily composed of local slot access, argument access, stack operations, and numerical operations - it might make sense to give it to Ion on a preferential basis.  For more heavyweight functions which contain a lot of un-inlineable function calls, or look like they're heavy on "object oriented stuff", it may pay to delay Ion-compilation slightly and wait for confirmation of the function's hotness before using Ion's heavy compile on it.


So I wrote code today to classify opcodes into various weights.  I have it working now to the point where it seems to correctly classify "lightweight" functions vs. "heavyweight" ones.  Tomorrow I'll see about tying that into the useCount limits for JM vs. Ion compilation and see what it does.
Mmmmm, strong echoes of the trace profiler (bug 580468).

I'm hoping that off thread Ion compilation will make highly specialized tuning of this sort unnecessary.  Instead of needing to weigh the cost vs. benefit of compiling a script, just queue up scripts with useCount > N and keep a core busy compiling code according to which scripts are hottest at the moment.  This isn't quite ready yet, too much work is being done on the main thread (see bug 785905 comment 0), but will be soon.
Hah :)  I didn't realize I was treading on such well trodden ground.

Nevertheless, I think there are a few points I can make which distinguish what I'm getting at from profiling in the trace case:

1. Ion does well even on general purpose code - the decision I'm trying to make is not a question of whether or not to trigger Ion compilation, but how to give preference towards ion-compiling certain functions early.

It's almost never a bad decision to compile with Ion for the long term, which is something that seems to not have applied to the trace compiler.  Ion _will_ generate better code than JM on pretty much everything, from what I've seen.  But compiling heavyweight functions may incur an up-front cost that we may want to avoid, and compiling lightweight functions may provide an up-front benefit we may really want to tap.

The issue we want to address is what to compile first, and how early to kick off that compilation.

2. Whereas the trace profiler seemed to have been oriented towards a general purpose decision engine that affected all compiles, I'm limiting the scope of this investigation to identifying the outliers.

If through a quick check we can identify functions that are obviously extremely numerically oriented, we can choose for earlier Ion compilation.  Likewise if we can identify functions that are obviously extremely polymorphically call and property-access heavy, we can de-emphasize those for early compilation.

The stuff that's not very obviously one way or the other can just use the default policy of UseCounts, and it'll be fine.

3. This actually fits well with the off-thread compilation.  The order in which we throw things at the compiler will still affect the performance benefits we see.

From a scheduling perspective, we want to prioritize the order of compiles to functions that are quick to compile and yield the greatest early benefit.

UseCount is also a heuristic - just an extremely coarse one.  With just a little bit more work, we should be able to easily identify at least some fraction of the outlier cases that will have a high performance impact (both positive and negative).

That information is _there_, we shouldn't ignore it.
Well my stabs at doing this ended up not yielding very much in terms of tangible gains.  A simple weighting scheme yielded too much fluctuation in performance between the various benches, especially on sunspider, and trying to resolve them would have basically led to making finer and finer "tuning" - effectively overfitting the benchmarks.  Not a good plan.  Idea abandoned (for now..)


Going back to the drawing board, I went over the compilation behaviour once again.  In JM we insert recompile checks for ion-compilation at every loop entry.  This includes trivial loops such as the following (just for example) in 3d-cubes:

// multiplies two matrices
function MMulti(M1, M2) {
  var M = [[],[],[],[]];
  var i = 0;
  var j = 0;
  for (; i < 4; i++) {
    j = 0;
    for (; j < 4; j++) M[i][j] = M1[i][0] * M2[0][j] + M1[i][1] * M2[1][j] + M1[i][2] * M2[2][j] + M1[i][3] * M2[3][j];
  }
  return M;
}

//multiplies matrix with vector
function VMulti(M, V) {
  var Vect = new Array();
  var i = 0;
  for (;i < 4; i++) Vect[i] = M[i][0] * V[0] + M[i][1] * V[1] + M[i][2] * V[2] + M[i][3] * V[3];
  return Vect;
}

function VMulti2(M, V) {
  var Vect = new Array();
  var i = 0;
  for (;i < 3; i++) Vect[i] = M[i][0] * V[0] + M[i][1] * V[1] + M[i][2] * V[2];
  return Vect;
}

Recompile checks in these loops are quite useless.  The value of recompile checks in loops is to catch situations where we avoid compiling a function because we get stuck in a long-running, high-iteration loop.  In the above situations, that's not a concern, and the recompile check is just extra bloat in the middle of hot code.  We'll recompile the function anyway when we later enter it.

The previous patch introduced some interpreter-based loop profiling that can be leveraged to make the evaluation that a function contains no "big" loop.  We can use this to selectively disable recompile checks inside these trivial loops.

Preliminary results show a 2.5% improvement on sunspider overall, and improvements on a number of v8 benches, except for v8-raytrace which regresses.  I'll look into why that's happening tomorrow.
> It's almost never a bad decision to compile with Ion for the long term,
> which is something that seems to not have applied to the trace compiler. 

The second half of that sentence is very, very true :)
That's an interestingly qualified statement... :)

Ok, updates on the loop-profiling stuff.  My initial stab at eliminating loop recompile checks used the |maxLoopCount()| metric that was collected by the intepreter.  Basically, completely turned off useCount increments and recompile  checks in those functions where the "biggest number of loop edges taken per function call" in the interpreter didn't exceed 32.

In retrospect, that mechanism might have been too aggressive.  Just because a loop ran for a short number of iterations in the first few runs of a function doesn't mean that it won't run for higher iterations in subsequent runs.  If we use the early data to completely disable the information collecting for the future, we risk not recompiling on functions that dynamically incur more loop iterations after JM compilation.

Today I wrote a more conservative incarnation of that approach, leading to some more conservative gains (namely, 1% on 64-bit SS on Linux).  A new "loop analysis" phase was added to jsanalyze.cpp.  The purpose of this phase is to identify the following pattern:

for (var i = X; i < Y; i++) { ... }

Where X and Y are small constants, no writes to i happen within the body (and |i| is a proper localvar, and not aliased), and the body of the loop contains no breaks beyond the loop (this still leaves exceptions unaccounted for, but I think it's safe to leave them as a pathological unhandled case).

In these loops, as long as the total number of iterations is small, we can host the useCount increments to a single operation after it runs.  Furthermore, in cases with nested small constant loops like these, we can host the useCount increments multiple levels.

Lastly, if a function contains only small constant loops like these, then we can leave out recompilechecks altogether at loop edges, treating the function as having no loops at all.  The determination being "they may as well be treated as if they were unrolled".

Two patches:
1. Add loop analysis to jsanalyze (there's a part in here which is technically quadratic time if there exist multiple large nested loops within a function.. I expect to fix that down the line, or just disable the analysis for really big functions entirely).

2. Change JM to use the results of the analysis to host useCount increments outside of the loop.

3. Alter JM to use the "all loops in function are small constant loops" metric to leave out useCount increments and recompile checks within a function entirely.

Bench results show a 1% gain in Linux-SS-64, and no movement anywhere else.  One thing to note is that this approach can be adapted for IonMonkey as well.  It'll only affect things if Ion is not compiling with inlining for a given function, but it might still yield marginal improvements there.

Here are the bench results, patches following shortly:

32-bit Sunspider (no movement): http://pastebin.mozilla.org/1783968
64-bit Sunspider (1% gain): http://pastebin.mozilla.org/1783969
32-bit V8 (no movement): http://pastebin.mozilla.org/1783989
64-bit V8 (no movement): http://pastebin.mozilla.org/1784035
32-bit Kraken (no movement): http://pastebin.mozilla.org/1783988
64-bit Kraken (no movement): http://pastebin.mozilla.org/1783990
Attachment #656683 - Flags: review?(jwalden+bmo)
Attachment #656683 - Flags: review?(jwalden+bmo)
So the loop analysis stuff is probably not going to go in - too finnicky.  I don't have any other major ideas for changing JM => Ion compilation heuristics, so I'm closing this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.