Implement trip count prediction

NEW
Unassigned

Status

()

Core
JavaScript Engine: JIT
P5
normal
5 years ago
2 years ago

People

(Reporter: Pericles Alves, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

5 years ago
For a while now, I have been trying to implement a form of trip count prediction in SpiderMonkey. My idea is to try to predict the number of iterations of simple loops that are controlled by interval comparisons, such as <,<=,>, and >=, e.g., while (i < N) {... i++;} or for (var i = N; i >= 0; i--){}. The predicted number of iterations for a given loop would be the absolute difference between the operands of the conditional expression that controls it in the first iteration, e.g, n = 100; i = 10; while(i < n) {... i++;} => predicted number of iterations = 100 - 10. We can then use this information to predict the "hotness" of a script and try to trigger an earlier compilation if a loop is predicted to be hot.

The prediction code would be inserted in the Interpreter and the code generated for loop compare instructions in the Baseline. To increase the precision, we could perform a quick bytecode analysis and verify if the last SET operation before a loop conditional is an increment/decrement. This would avoid in most cases instrumenting loops with stride different from 1. To avoid lack of type info, we can insert a warmup phase: even if a script is predicted to iterate a lot of times, we only compile it after a number X of iterations, in which X is a tiny fraction of usesBeforeCompile. Though an early loop exit (break/return) could imply in a terrible prediction, previous experiments showed that this approach gives an exact guess for ~70% of the loops analysed while navigating through the pages in the Alexa index.

In http://code.google.com/p/dynamic-loop-prediction/#Predicting_trip_counts I explained this idea (the diagrams and results in the page are not up to date with the latest implementation, which works with --ion-parallel-compile=on). As future uses for this, Nicolas B. Pierron suggested to maybe use this information for things like:
- Avoid generating OSR blocks for cold loops. This would be good in parallel compilation mode, to avoid hitting a different loop when compilation is done.
- Parallel JS: have more dynamic information about memory likely to be heavily accessed (arrays inside hot loops).
(Reporter)

Comment 1

5 years ago
Created attachment 823792 [details] [diff] [review]
Use trip count prediction for earlier Baseline/Ion compilation.

Current results (64-bit Mac OS X):

SunSpider 1.0:
--------------
3d-cube: 0
3d-morph: 0
3d-raytrace: 0
access-binary-trees: 0
access-fannkuch: -1%
access-nbody: 0
access-nsieve: 2%
bitops-3bit-bits-in-byte: 0
bitops-bits-in-byte: 0
bitops-bitwise-and: 2%
bitops-nsieve-bits: 0
controlflow-recursive: 0
crypto-aes: 0
crypto-md5: 6%
crypto-sha1: -2%
date-format-tofte: 2%
date-format-xparb: 11%
math-cordic: 2%
math-partial-sums: 8%
math-spectral-norm: -2%
regexp-dna: 1%
string-base64: 1%
string-fasta: 0
string-tagcloud: 0
string-unpack-code: 1%
string-validate-input: 9%
TOTAL: 2%

Kraken 1.1:
-----------
ai-astar: 0
audio-beat-detection: 4%
audio-dft: 0
audio-fft: 3%
audio-oscillator: 0
imaging-gaussian-blur: -1%
imaging-darkroom: 0
imaging-desaturate: 1%
json-parse-financial: 2%
json-stringify-tinderbox: 5%
stanford-crypto-aes: 3%
stanford-crypto-ccm: 0
stanford-crypto-pbkdf2: 0
stanford-crypto-sha256-iterative: 0
TOTAL: 1%

V8 v6:
------
crypto: 0
deltablue: 0
earley-boyer: 0
raytrace: 0
regexp: -1%
richards: 2%
splay: 2%
TOTAL: 0%
Attachment #823792 - Flags: review?(nicolas.b.pierron)
Thanks for the patch!

Would you mind measuring the interpreter and baseline parts separately? I think the interpreter changes make a lot of sense: the interpreter is really slow and triggering baseline compilation immediately shouldn't have any (noticeable) negative effect on performance.

However triggering Ion compilation sooner may confuse (inlining) heuristics and is more likely to backfire, for instance we could recompile more often when type information is incomplete etc.
(Reporter)

Comment 3

5 years ago
Here are the separate measures:

Interpreter level only
**********************
SunSpider 1.0:
--------------
3d-cube: 0
3d-morph: 0
3d-raytrace: 0
access-binary-trees: 0
access-fannkuch: -2%
access-nbody: 0
access-nsieve: 0
bitops-3bit-bits-in-byte: 0
bitops-bits-in-byte: 0
bitops-bitwise-and: 0
bitops-nsieve-bits: 0
controlflow-recursive: 0
crypto-aes: 0
crypto-md5: 5%
crypto-sha1: 0
date-format-tofte: 1%
date-format-xparb: 0
math-cordic: 0
math-partial-sums: -2%
math-spectral-norm: -2%
regexp-dna: 0
string-base64: 0
string-fasta: 0
string-tagcloud: -2%
string-unpack-code: -1%
string-validate-input: 0
TOTAL: 0

Kraken 1.1:
-----------
ai-astar: 0
audio-beat-detection: 0
audio-dft: 0
audio-fft: 0
audio-oscillator: 0
imaging-gaussian-blur: 0
imaging-darkroom: 0
imaging-desaturate: 0
json-parse-financial: 2%
json-stringify-tinderbox: 0
stanford-crypto-aes: 0
stanford-crypto-ccm: 0
stanford-crypto-pbkdf2: 0
stanford-crypto-sha256-iterative: 0
TOTAL: 0

V8 v6:
------
crypto: 0
deltablue: 0
earley-boyer: 0
raytrace: 0
regexp: 0
richards: 0
splay: 0
TOTAL: 0

Baseline level only
*******************
SunSpider 1.0:
--------------
3d-cube: 0
3d-morph: 0
3d-raytrace: 0
access-binary-trees: -2%
access-fannkuch: -2%
access-nbody: -2%
access-nsieve: 0
bitops-3bit-bits-in-byte: -6%
bitops-bits-in-byte: -5%
bitops-bitwise-and: 2%
bitops-nsieve-bits: 4%
controlflow-recursive: 0
crypto-aes: 0
crypto-md5: 0
crypto-sha1: 0
date-format-tofte: 0
date-format-xparb: 18%
math-cordic: 0
math-partial-sums: 6%
math-spectral-norm: 0
regexp-dna: 0
string-base64: 5%
string-fasta: 0
string-tagcloud: 0
string-unpack-code: -2%
string-validate-input: 7%
TOTAL: 1%

Kraken 1.1:
-----------
ai-astar: 0
audio-beat-detection: 4%
audio-dft: 0
audio-fft: 0
audio-oscillator: 2%
imaging-gaussian-blur: -2%
imaging-darkroom: 0
imaging-desaturate: 1%
json-parse-financial: 1%
json-stringify-tinderbox: 5%
stanford-crypto-aes: 3%
stanford-crypto-ccm: 0
stanford-crypto-pbkdf2: 0
stanford-crypto-sha256-iterative: -1%
TOTAL: 1%

V8 v6:
------
crypto: 0
deltablue: 0
earley-boyer: 0
raytrace: 0
regexp: -1%
richards: 1%
splay: 0
TOTAL: 0

Though the interpreter is slow, looking at output of the Tracelogger (the last update made by Hannes Verschore), the scripts spend too little time on it. So, triggering Baseline compilation really has almost no effect (I actually inserted that part because it helped the Baseline level results a little). Though I also put a little piece of code to try to perform earlier inlining, this confusion in the inlining heuristics really happen (I noticed that looking at some IONFLAGS outputs). The counters of outer and inner scripts get in some kind of disharmony, I think that's why I didn't get any more speedup.
Comment on attachment 823792 [details] [diff] [review]
Use trip count prediction for earlier Baseline/Ion compilation.

Review of attachment 823792 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for patching attaching the patch to Bugzilla, I am sure we can make this worth it.

::: js/src/jit/BaselineCompiler.cpp
@@ +572,5 @@
> +
> +    // Try to use Trip Count Prediction information.
> +    if (js_IonOptions.tcp) {
> +        // Avoid jumping into Ion before gathering enough type info. 
> +        uint32_t tcpWarmup = 0.03 * js_IonOptions.usesBeforeCompile;

I was thinking of the Jalapẽno paper, and I was wondering when we can safely estimate that the cost of the compilation would be masked by the estimated time spent in baseline.  From what I understand, the trip-count prediction is used to give an estimated time.

Also, as Jan mentioned the trigger for inlining function is higher than 30 runs.  Have a look at http://dxr.mozilla.org/mozilla-central/source/js/src/jit/IonBuilder.cpp#l4002 .  Also note that the default value for inlining is http://dxr.mozilla.org/mozilla-central/source/js/src/jit/Ion.h#l126, and the tcpWarmup factor (which should probably go next to the tcp flag in js_IonOptions) is way below the inlining factor.  Can you try with a tcpIonFactor set to 0.25 ?  

Another option would be to have a ratio of the usecount of the current script to decide if it is worth inlining.  On the other hand we still want a lower cap to prevent compiling a function have has a usecount of 3, as we proabbly want it to be compiled by baseline first.

Hannes (h4writer) was looking into having 2 modes of compilation, one which is not inlining at all, and another one which is inlining like crazy while the first one is running.  I think it would be good to either see with h4writer patches.

@@ +1412,5 @@
> +
> +    // Guard that both operands are integers.
> +    Label exitAugment;
> +    masm.branchTestInt32(Assembler::NotEqual, R0, &exitAugment);
> +    masm.branchTestInt32(Assembler::NotEqual, R1, &exitAugment);

nit: We have alias of the assembler conditions to make sense of the test vs. cmp conditions:
         NotEqual -> NotZero
Attachment #823792 - Flags: review?(nicolas.b.pierron)
For information, H4writer recently changed the condition to no longer be relative to the number of executions of the caller, see Bug Bug 942105.
(Reporter)

Comment 6

5 years ago
npb, thank you for the review.

I was looking at h4writer's last patches and tested my changes on top of them. They made almost no difference: they helped the SunSpider numbers a little (as shown below), but still I get no speedup in V8 or Kraken.

3d-cube: 0
3d-morph: 0
3d-raytrace: 2%
access-inary-trees: 2%
access-annkuch: 7%
access-body: 0
access-sieve: 0
bitops-3bit-bits-in-byte: 0
bitops-bits-in-byte: 0
bitops-bitwise-and: 0
bitops-nsieve-bits: -5%
controlflow-recursive: 0
crypto-aes: 0
crypto-md5: 0
crypto-sha1: 0
date-format-tofte: 1%
date-format-xparb: 10%
math-cordic: 2%
math-partial-sums: 10%
math-spectral-norm: 0
regexp-dna: 0
string-base64: 3%
string-fasta: 0
string-tagcloud: 0
string-unpack-code: 5%
string-validate-input: 7%
TOTAL: 3%

Just so I understand, the reason why you suggested a tcp warmup factor closer to the inlining factor is because you consider the later to be a good threshold for type warmup? Anyhow, increasing the warmup factor to 0.25 or 0.125 (same of inlining) mitigates the speedups in all suites. I guess that's because we are reducing the number of uses that the prediction can "save".

As for the part of using this to give an estimated time, this technique works really good for single loops (that 70% hit rate I mentioned), as it was implemented in the code emitted for loop conditionals, but not for a whole script. If you have a nested loop, for instance, it's hard to predict the combined number of iterations of the outer and inner loops. So, this prediction can be used to give a lower bound, but not an upper bound (at least as it is implemented now).

This week I'm planing on taking a look at these other h4writer's patches you mentioned (inlining while running a simpler version). I'm thinking about doing this only for functions with a really hot loop, as this kind of prediction is good only for lower bounds.

Quick question: I was playing around with ParallelJS in the last days (I was trying to use it in an image processing app I was writing, to figure out how I could possibly use prediction to improve it), but turns out that both APIs (ParallelArray and the <method>Par in array) were much slower than the serial version and almost never executing in parallel. Do you happen to know if I need anything more than a thread safe build to run it?
Hannes, do you have a Bug where you have attached the patches for trying the early Ion compilation?

PJS is first running sequentially to ensure that everything has been traversed at the beginning, then it will compile every function used in a Parallel execution mode, and start running slices of the array with Ion's code.  If you have any speed issues, you should ask Shu (shu) or Nikos (nmatsakis) on IRC.  Normally a thread safe build is enough to run PJS tests, and you should get error messages if we bail out of the parallel execution mode.
Flags: needinfo?(hv1989)
(In reply to Nicolas B. Pierron [:nbp] from comment #7)
> Hannes, do you have a Bug where you have attached the patches for trying the
> early Ion compilation?

The WIP patch in bug 939614. (Haven't tested it yet on PJS)
Flags: needinfo?(hv1989)

Comment 9

4 years ago
Created attachment 8398934 [details] [diff] [review]
Use trip count prediction for earlier BaselineIon compilation  for 29 fix.patch

I test the patch for 29.0 beta without PGO.

1)Sunspider
Before 183.9ms +/- 1.4%
After  185.9ms +/- 0.8%

2)V8
Before 16897
After  16893

3)Kraken
Before 1524.7ms +/- 2.1%
After  1501.1ms +/- 1.8%

4)Octane
Before 17488
After  17673

5)JSBench http://jsbench.cs.purdue.edu/
Before 103.60ms ± 3.70%
After  103.99ms ± 3.76%

6)http://qme.mydns.jp/javascript.html
Before 77255
After  76578
Flags: needinfo?
xunxun, thanks for remeasuring this. If you use the needinfo? flag, please make sure to specify a person, otherwise nobody gets notified (except for the usual flood of bugmail).

ni?'ing nbp as he gave the most in-depth feedback before.
Flags: needinfo? → needinfo?(nicolas.b.pierron)
Comment on attachment 8398934 [details] [diff] [review]
Use trip count prediction for earlier BaselineIon compilation  for 29 fix.patch

Review of attachment 8398934 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/BaselineJIT.cpp
@@ +278,5 @@
>      if (IsJSDEnabled(cx) || cx->runtime()->parallelWarmup > 0) {
>          if (osr)
>              return Method_Skipped;
> +    } else if (script->incUseCount() <= js_JitOptions.baselineUsesBeforeCompile &&
> +            script->predictedHotness < JSScript::HOT_FOR_BASELINE) {

The question I wonder, is is it worth it to save us from the 10 iterations in the Interpreter, compared to saving us from the 100-1000 iterations from Baseline?

Hannes is working on a new tracelogger interface: https://tl-beta.paas.allizom.org/

This highlight that on octane we should not expect to save more than ~2%, while we could save up to 25% on Sunspider, if we only instrument baseline.  This might also help at tuning the value that you have to trigger at the right time the compilation.

::: js/src/jit/BytecodeAnalysis.cpp
@@ +216,5 @@
>      return true;
>  }
> +
> +bool
> +BytecodeAnalysis::isIntervalLoopCompare(jsbytecode *pc)

The BytecodeAnalysis is moving away these days.

@@ +246,5 @@
> +
> +    if (pcIt < (script_->code() + patternOffset))
> +        return false;
> +
> +    // Step backwards through the bytecode, looking for the first SET operation.

This sounds unsafe.  Can't we keep an old pc?
(In reply to xunxun from comment #9)
> I test the patch for 29.0 beta without PGO.

As far as I know, we have disabled PGO on the JS engine, as most of the code which might need PGO is produced by the engine.

> 1)Sunspider
> Before 183.9ms +/- 1.4%
> After  185.9ms +/- 0.8%
> 
> 3)Kraken
> Before 1524.7ms +/- 2.1%
> After  1501.1ms +/- 1.8%

Without any other win, this is hard to accept, and from what I understand TCP is supposed to help us on cold start-up to compile early.

> 4)Octane
> Before 17488
> After  17673

Octane is the successor of V8 benchmark, there is little value to check both.  Do you have a break down, is there one major benchmark which benefit from it, such as CodeLoad, or this is mostly noise?

> 5)JSBench http://jsbench.cs.purdue.edu/
> Before 103.60ms ± 3.70%
> After  103.99ms ± 3.76%

JSBench was an attempt to extract features of the Web, but from what I heard it does not reflect the same properties as the web pages from which it has been extracted.

> 6)http://qme.mydns.jp/javascript.html
> Before 77255
> After  76578

From what I understand this benchmark is looking at how fast compiler can remove the code.
Flags: needinfo?(nicolas.b.pierron)
(In reply to Nicolas B. Pierron [:nbp] from comment #12)
> As far as I know, we have disabled PGO on the JS engine, as most of the code
> which might need PGO is produced by the engine.

https://hg.mozilla.org/mozilla-central/file/d8b2e3738785/js/src/moz.build#l401
I tested PGO versus non-PGO on Windows with Baseline only (Ion disabled) recently, and found that SunSpider gets about 5% faster, whereas Octane and Kraken don't really benefit. But it's worth noting that the current PGO profiling pass runs SunSpider but *doesn't run* Octane or Kraken - so it's possible we could gain more there. Still, Octane and Kraken run mostly in Ion, so it probably wouldn't make a very big difference.
Pericles, are you still interested in working on this?
Flags: needinfo?(periclesroalves)
Priority: -- → P5
(Reporter)

Comment 16

2 years ago
Nope :( I've moved on to other projects. Sorry
Flags: needinfo?(periclesroalves)
You need to log in before you can comment on or make changes to this bug.