Closed Bug 768288 Opened 7 years ago Closed 7 years ago

IonMonkey: Inline functions with loops

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: dvander, Assigned: h4writer)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Whiteboard: [ion:t])

Attachments

(2 files, 3 obsolete files)

This isn't terribly important, but would could be worth about 2-3% on earley-boyer. sc_assq is fairly hot and manually inlining it gave some improvement.
Duplicate of this bug: 720990
Blocks: 768572
Attached patch Patch (obsolete) — Splinter Review
This wins about 500 points on v8-richards (14500 -> 15000) but regresses deltablue, so that needs investigation first.
Whiteboard: [ion:t]
Duplicate of this bug: 820625
My measurements now show a 5%-7% on earley-boyer, 6% on richards, but 7% loss on deltablue.
Assignee: general → hv1989
Blocks: 765980
That's a net win, right? I say take it :) might be even bigger with the inline-arguments-use patch.
Looks like there is a big difference in running the V8 scripts separately or not. When running the whole benchmark, the timings go to:
 0% on richards :(
-5% on deltablue
 3% on earley-boyer

That makes it a net loss. Therefore the need to first investigate that regression or investigate why richards looses its improvement.
Depends on: 823884
WIP patch does the following (WIP means just works, code barely readable):

Richards: 10598 to 11247
DeltaBlue: 12285 to 12629
Crypto: 18270 to 18759
RayTrace: 8856 to 8663
EarleyBoyer: 13917 to 14063
RegExp: 2031 to 2042
Splay: 11316 to 11026
NavierStokes: 23502 to 23697
----
Score (version 7): 10592 to 10711

After some more testing this comes down to 6% on richards and 1.5% on earley-boyer. 1%-2% in total.
Also now I know what is needed to get this:
1) Don't inline regularly bailing function. (The places where we can bail will increase if we inline them) (I tried a low barrier of max. 10 times and that worked)
2) Bug 823884: Don't mark top function as uncompileable when an inlined function abort. Just mark it as uninlineable. (Needed for earley-boyer. It will compile now function with loop and arguments and arguments_get_elem will abort and regress us enormeously)
3) I also disabled inlining function calls to itself. Not sure if that's required though...
4) Remove the loop check to disable inlining.
Attached patch Inline loops (obsolete) — Splinter Review
Implements point 1 and 4
Attachment #643848 - Attachment is obsolete: true
Attachment #696298 - Flags: review?
Attachment #696298 - Flags: review? → review?(nicolas.b.pierron)
Comment on attachment 696298 [details] [diff] [review]
Inline loops

Review of attachment 696298 [details] [diff] [review]:
-----------------------------------------------------------------

Using bailouts as a metric for inlining targets is a strange but good idea.
Namely: and inline frames doing too many bailouts will cause the caller to return to the interpreter and thus not being executed as fast as possible.

Still, you should reset the bailout counter after the invalidation of the IonScript, or better, store this counter on the IonScript / IonCode.  Especially since the bailout counter only matters for Ion-compiled functions.

::: js/src/ion/Ion.h
@@ +148,5 @@
>  
> +    // Maximum number of bailouts the callee may have to allow inlining.
> +    //
> +    // Default: 10
> +    uint32_t inlineMaxBailCount;

This metric only matters to script having a for loop since we are not supposed to inline a function after it has been used for a while in JM.  Comment that above.

Also you should reset the counter to prevent deoptimization happening after an invalidation of the IonScript to ensure that for example, the first GC will not forbid such optimization by invalidating active frames.
Attachment #696298 - Flags: review?(nicolas.b.pierron)
Depends on: 857360
Blocks: 856625
I have a few thoughts on this issue, having played earlier with heuristics for JM => Ion calls for scripts with loops.

An important observation is the following:

For functions with high-iteration, long-running loops, the value of inlining is low, because the cost of the call will become a very small fraction of the cost of the runtime of the function.

The place where disabling inlining of functions with loops really hits us is with functions where the loops are "small".  Either looping cross small arrays that act as vectors, or "for (name in object)" type loops, or (what I'm running into right now with argsobj optimization) loops over arguments object passed in from the caller.

It's possible to do this by having a heuristic that measures the "cardinality" of loops within a script (the interpreter actually does this already because of some tuning changes I did to choosing between JM and Ion compilation).  The basic idea is:

1. Keep 2 variables in script (loopCount and maxLoopCount, both initialized to 0), counting loops taken within the script.
2. Every time we enter the script, do:  |maxLoopCount = max(maxLoopCount, loopCount); loopCount = 0|
3. Every time we go around a loop, do |loopCount++|
4. When compiling with Ion, check |maxLoopCount| for values under a certain threshold (e.g. 40).  In that case, treat them as not containing loops.

I can implement loop counting in baseline to assist with this.
That was an interesting read. It gives me some new insights ;). But I don't think we need to look into adding this idea to baseline.

I've redone this patch this week and I got the deltablue/richards improvements. I.e. same speed as v8. The only reason why I didn't post yet, is because I have an really big regression on raytrace. This is kinda weird and I need to find out why. I've had a lot of theories, but they are all wrong. So I'm still chasing the real reason.

I do think adding this loopCount/maxLoopCount threshold would give the performance back on raytrace, but doesn't address the real reason of the performance hit...
Blocks: 861596
I found the issue in raytrace. To test this, I adjusted raytrace.js to just loop renderScene 100 times. The time difference between the function :680 inlined or not is 100ms. So 500ms vs 600ms.

The compilation of :680 takes 2ms. In :709 we inline :680 multiple times, increasing the compilation of :709 to 12ms. Now since we have background thread compilation the compilation time isn't a real issue. The issue is that function :709 isn't compiled. So we are stuck in BC/JM/interpreter until function :709 is fully compiled. Loosing the time delta between the speed of IM and BC/JM/Interpreter. That is at max 12ms every time. In the testcase we gc 10 times. Gives us 120ms. So really close to the 100ms we are indeed loosing.

Just for interest I include the breakdown. (the time is in ms and the description tells what has just finished). Most time goes to GVN and generate LIR:

ion_compile,v8/raytrace.js:680
0.0 Compile start
0.700327838834 Split
0.701115215455 Renumber Blocks
0.710153848321 Dominator tree
0.721660251918 Eliminate phis
0.722406076463 Phi reverse mapping
0.744747711995 apply types
0.769270479363 Alias analysis
0.792744585025 Eliminate dead resume point
1.08217803304 GVN
1.11643666302 UCE
1.16298977749 LICM
1.17035083333 Beta
1.18438134946 Range Analysis
1.18812807899 De-Beta
1.19324743557 Truncate
1.20048806082 EAA
1.21032745148 DCE
1.2221197897 Edge case analysi
1.23651230188 Bounds check removal
1.91628172307 Generate LIR
2.09893830925 Generate code

ion_compile,v8/raytrace.js:709
0.0 Compile start
2.68146953493 Split
2.68759881823 Renumber Blocks
2.72924949204 Dominator tree
2.7910057363 Eliminate phis
2.79546307676 Phi reverse mapping
2.88392533303 apply types
2.97173402446 Alias analysis
3.05626996172 Eliminate dead resume point
5.31791774802 GVN
5.47176043468 UCE
5.8717548005 LICM
5.89957638011 Beta
5.96631394354 Range Analysis
5.98146354862 De-Beta
6.00464819829 Truncate
6.0326028854 EAA
6.06857219321 DCE
6.11694585201 Edge case analysi
6.16851268219 Bounds check removal
11.5257774288 Generate LIR
12.1990675433 Generate code

I don't think we need to improve the speed of those passes (to fix this bug). I think the next step is to find why our heuristics don't stop this from inlining. Script :680 is too big to get inlined. It already takes 2ms to compile alone. Most scripts take 0.1ms...
The script is not too large to get inlined.

inlineMaxTotalBytecodeLength is 1000 bytes, and this one is just  few hundred bytes in length.

I suspect that there should be an individual limit (per-script), and MaxTotalBytecodeLength should be used only for the sum of multiple scripts in multi-inlining situations.
This patch gives a 2% - 4% improvement on v8. Noisy because of v8-splay. But definitely 2% at least ;). It improves richards, deltablue and crypto.

Note: there is a 4% regression on raytrace. I know the issue and will post a follow-up. The issue is not related with loops, but a general inlining issue. It just is visible on raytrace when enable loop inlining. Raytrace is coincidental on a sweet spot, when not loop inlining. A solution to get this sweet spot, not by accident probably will needs some refactoring of IonMonkey. Atm loop inlining is more important than getting this regression back (v8 improves + apparently fixes some drameo issues)

Patch includes:
1) Disable inlining when IonScript->bailoutExpected is set.

2) Add more spew information to canInlineTarget. Did this already for makeInlineDecisions in earlier bug. 

3) Remove always inlining of scripts with length 1. This was a heuristic added for deltablue before inlining of a subset of scripts was added.

4) Add a script specific max. inline depth. This tracks the maximum inline depth before a script can't inline it's current inlined scripts anymore. Improves score on richards, crypto and raytrace.

5) Disable inlining when caller is bigger than inlineMaxTotalBytecodeLength. Currently we only looked to callee. But for raytrace.js:709 is 4000+. We shouldn't inline anything in it. This fixes my previous comment about 709 taking 12ms to compile. It is already that big and we are inlining :680 3 times in it. And :680 (with all inlined scripts) takes already 2ms...

6) Remove the "isIonInlineable" for loops in jsanalyze.cpp. Improves richards, deltablue. Regresses raytrace horribly (22%), because of the sweetspot issue.
Attachment #696298 - Attachment is obsolete: true
Attachment #737449 - Flags: review?(kvijayan)
Depends on: 861832
Comment on attachment 737449 [details] [diff] [review]
Inlining loops (finally successful)

Review of attachment 737449 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking care of this so quickly :)  Much appreciated.

::: js/src/ion/IonBuilder.cpp
@@ +3243,5 @@
> +    IonBuilder *baseBuilder = this;
> +    while (baseBuilder->callerBuilder_)
> +        baseBuilder = baseBuilder->callerBuilder_;
> +
> +    // Caller must not be excessively large.

I really don't think that we should disable ALL inlining for large callers.  If the function being inlined is TINY, then we should go ahead and inline it anyway.  Examples in raytrace:709 of tiny function calls which should be inlined:

Color.prototype.multiplyScalar
Vector.prototype.add
Color.prototype.multiply
Vector.prototype.subtract
Color.prototype.add
Vector.prototype.normalize
Vector.prototype.dot
(and more)

These should all be inlined, because :709 (rayTrace) is called at least 10,000 times, and there are like 20 or so calls like that in its body.

The thing is, once we inline these small functions, we want to avoid inlining other large functions within those.  I think there's a simple way to do this:

Add a "fatCompile_" flag to IonBuilder that's set to true for large scripts (for top-level IonBuilders), or inherited from parent IonBuilder (for inline builders).

If "fatCompile_" is true, then allow only small functions to be inlined.

@@ +3301,5 @@
> +    if (!baseBuilder->script()->isInlineDepthCheckDisabled()) {
> +        uint32_t updatedInlineDepth = globalMaxInlineDepth - inliningDepth_ - 1;
> +        if (updatedInlineDepth < baseBuilder->script()->maxInlineDepth())
> +            baseBuilder->script()->setMaxInlineDepth(updatedInlineDepth);
> +    }

I now get what you're doing here.  It's nice.
Attachment #737449 - Flags: review?(kvijayan) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6f337d218b4

FYI I enabled inlining of small functions into large callers. I did it different than suggested. It could be done much easier ;). This updated heuristic improved the regression on raytrace from 4% to 2%. Thanks!
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/786614886674 since auto-regress/bug648747.js somehow managed to fail you.
This also regressed SS crypto-aes on AWFY..
It regresses ss-raytrace, ss-cube and ss-crypto-aes. Giving a 3.3% regression on ss. Also it reports less improvement than on my machine. But that could be that single dot. Could be over multiple measurements we see a bigger improvement. I'll leave it in for now and discuss (after we have some more measurements) what to do and if we can take this hit. We need loop inlining and I think it will be hard to not make it a SS hit.
Backing out for SS regressions, there have been several runs on awfy now to see what numbers there look like.

https://hg.mozilla.org/integration/mozilla-inbound/rev/119c05c4437e
(In reply to Brian Hackett (:bhackett) from comment #21)
> Backing out for SS regressions, there have been several runs on awfy now to
> see what numbers there look like.
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/119c05c4437e

Especially on the Unagi [1] where the regression on sunspider is around ~10%.

[1] http://arewefastyet.com/#machine=14
https://hg.mozilla.org/mozilla-central/rev/8ba06fb1bba0
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Quick fix to inline loops in small functions. Gives all improvements, without the regression on raytrace.

Richards: 22420
DeltaBlue: 22454
Crypto: 24188
RayTrace: 35343
EarleyBoyer: 26685
RegExp: 4059
Splay: 19123
NavierStokes: 28260
----
Score (version 7): 19960

to

Richards: 24959
DeltaBlue: 25449
Crypto: 24770
RayTrace: 37369
EarleyBoyer: 27109
RegExp: 4004
Splay: 19721
NavierStokes: 28288
----
Score (version 7): 20842
Attachment #747594 - Flags: review?(kvijayan)
forgot to qref
Attachment #747594 - Attachment is obsolete: true
Attachment #747594 - Flags: review?(kvijayan)
Attachment #747597 - Flags: review?(kvijayan)
Comment on attachment 747597 [details] [diff] [review]
Inline small functions with loops

Review of attachment 747597 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #747597 - Flags: review?(kvijayan) → review+
This is now offically cursed! Four attempts to create a patch. One with a regression causing a backout and even when it doesn't regression anything. It get's backed out for jittest failure. (Even when I ran it locally to make sure there was no failure. Thanks shu for debugging it, since I still cannot repro.).

As a result this is going to try server.
https://tbpl.mozilla.org/?tree=Try&rev=0f2ca646a8ce
If this is green and it gets backed out on m-i, it's a sign of the gods and I'll add it to my ignore list.
https://hg.mozilla.org/mozilla-central/rev/6336284c7f1f
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Sorry for bothering, but I can't see the improvements mentioned in comment 25 on arewefastyet.com. Do I miss something?
(In reply to Coyote from comment #33)
> Sorry for bothering, but I can't see the improvements mentioned in comment
> 25 on arewefastyet.com. Do I miss something?

No problem. The big difference is that the report is on "v8" not "octane". Awfy reports only octane now. Since the improvement was for the original "v8" tests and not for the extra tests in "octane". I listed that. It is easier/better to see the difference, since there is less noise. But the improvement was a 1% for x86 octane and 2% for x64 octane.

deltablue: 20% (10% on x86, very noisy)
earley-boyer: 1%
richards: 10%
Ah ok, I assumed that the improvements would also improve the "V8-part" tests on Octane. Why is there such a difference between V8 and Octane?

Thanks for your explanation so far, I know this isn't contributing or something, I'm just curious :-)
Oh, it improved the "v8-part" tests on octane. The improvement listed above is on the octane bench. It is just harder to see on octane, since it contains more benchmarks.
Thanks for the explanation!
Depends on: 879727
No longer depends on: 879727
Depends on: 879727
Depends on: 900683
You need to log in before you can comment on or make changes to this bug.