Closed
Bug 768288
Opened 9 years ago
Closed 8 years ago
IonMonkey: Inline functions with loops
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla23
People
(Reporter: dvander, Assigned: h4writer)
References
(Depends on 1 open bug, Blocks 3 open bugs)
Details
(Whiteboard: [ion:t])
Attachments
(2 files, 3 obsolete files)
13.36 KB,
patch
|
djvj
:
review+
|
Details | Diff | Splinter Review |
1.89 KB,
patch
|
djvj
:
review+
|
Details | Diff | Splinter Review |
This isn't terribly important, but would could be worth about 2-3% on earley-boyer. sc_assq is fairly hot and manually inlining it gave some improvement.
Comment 2•9 years ago
|
||
This wins about 500 points on v8-richards (14500 -> 15000) but regresses deltablue, so that needs investigation first.
![]() |
Reporter | |
Updated•9 years ago
|
Whiteboard: [ion:t]
Assignee | ||
Comment 4•8 years ago
|
||
My measurements now show a 5%-7% on earley-boyer, 6% on richards, but 7% loss on deltablue.
Assignee: general → hv1989
Blocks: 765980
![]() |
Reporter | |
Comment 5•8 years ago
|
||
That's a net win, right? I say take it :) might be even bigger with the inline-arguments-use patch.
Assignee | ||
Comment 6•8 years ago
|
||
Looks like there is a big difference in running the V8 scripts separately or not. When running the whole benchmark, the timings go to: 0% on richards :( -5% on deltablue 3% on earley-boyer That makes it a net loss. Therefore the need to first investigate that regression or investigate why richards looses its improvement.
Assignee | ||
Comment 7•8 years ago
|
||
WIP patch does the following (WIP means just works, code barely readable): Richards: 10598 to 11247 DeltaBlue: 12285 to 12629 Crypto: 18270 to 18759 RayTrace: 8856 to 8663 EarleyBoyer: 13917 to 14063 RegExp: 2031 to 2042 Splay: 11316 to 11026 NavierStokes: 23502 to 23697 ---- Score (version 7): 10592 to 10711 After some more testing this comes down to 6% on richards and 1.5% on earley-boyer. 1%-2% in total. Also now I know what is needed to get this: 1) Don't inline regularly bailing function. (The places where we can bail will increase if we inline them) (I tried a low barrier of max. 10 times and that worked) 2) Bug 823884: Don't mark top function as uncompileable when an inlined function abort. Just mark it as uninlineable. (Needed for earley-boyer. It will compile now function with loop and arguments and arguments_get_elem will abort and regress us enormeously) 3) I also disabled inlining function calls to itself. Not sure if that's required though... 4) Remove the loop check to disable inlining.
Assignee | ||
Comment 8•8 years ago
|
||
Implements point 1 and 4
Attachment #643848 -
Attachment is obsolete: true
Attachment #696298 -
Flags: review?
Assignee | ||
Updated•8 years ago
|
Attachment #696298 -
Flags: review? → review?(nicolas.b.pierron)
Comment 9•8 years ago
|
||
Comment on attachment 696298 [details] [diff] [review] Inline loops Review of attachment 696298 [details] [diff] [review]: ----------------------------------------------------------------- Using bailouts as a metric for inlining targets is a strange but good idea. Namely: and inline frames doing too many bailouts will cause the caller to return to the interpreter and thus not being executed as fast as possible. Still, you should reset the bailout counter after the invalidation of the IonScript, or better, store this counter on the IonScript / IonCode. Especially since the bailout counter only matters for Ion-compiled functions. ::: js/src/ion/Ion.h @@ +148,5 @@ > > + // Maximum number of bailouts the callee may have to allow inlining. > + // > + // Default: 10 > + uint32_t inlineMaxBailCount; This metric only matters to script having a for loop since we are not supposed to inline a function after it has been used for a while in JM. Comment that above. Also you should reset the counter to prevent deoptimization happening after an invalidation of the IonScript to ensure that for example, the first GC will not forbid such optimization by invalidating active frames.
Attachment #696298 -
Flags: review?(nicolas.b.pierron)
Comment 10•8 years ago
|
||
I have a few thoughts on this issue, having played earlier with heuristics for JM => Ion calls for scripts with loops. An important observation is the following: For functions with high-iteration, long-running loops, the value of inlining is low, because the cost of the call will become a very small fraction of the cost of the runtime of the function. The place where disabling inlining of functions with loops really hits us is with functions where the loops are "small". Either looping cross small arrays that act as vectors, or "for (name in object)" type loops, or (what I'm running into right now with argsobj optimization) loops over arguments object passed in from the caller. It's possible to do this by having a heuristic that measures the "cardinality" of loops within a script (the interpreter actually does this already because of some tuning changes I did to choosing between JM and Ion compilation). The basic idea is: 1. Keep 2 variables in script (loopCount and maxLoopCount, both initialized to 0), counting loops taken within the script. 2. Every time we enter the script, do: |maxLoopCount = max(maxLoopCount, loopCount); loopCount = 0| 3. Every time we go around a loop, do |loopCount++| 4. When compiling with Ion, check |maxLoopCount| for values under a certain threshold (e.g. 40). In that case, treat them as not containing loops. I can implement loop counting in baseline to assist with this.
Assignee | ||
Comment 11•8 years ago
|
||
That was an interesting read. It gives me some new insights ;). But I don't think we need to look into adding this idea to baseline. I've redone this patch this week and I got the deltablue/richards improvements. I.e. same speed as v8. The only reason why I didn't post yet, is because I have an really big regression on raytrace. This is kinda weird and I need to find out why. I've had a lot of theories, but they are all wrong. So I'm still chasing the real reason. I do think adding this loopCount/maxLoopCount threshold would give the performance back on raytrace, but doesn't address the real reason of the performance hit...
Assignee | ||
Comment 12•8 years ago
|
||
I found the issue in raytrace. To test this, I adjusted raytrace.js to just loop renderScene 100 times. The time difference between the function :680 inlined or not is 100ms. So 500ms vs 600ms. The compilation of :680 takes 2ms. In :709 we inline :680 multiple times, increasing the compilation of :709 to 12ms. Now since we have background thread compilation the compilation time isn't a real issue. The issue is that function :709 isn't compiled. So we are stuck in BC/JM/interpreter until function :709 is fully compiled. Loosing the time delta between the speed of IM and BC/JM/Interpreter. That is at max 12ms every time. In the testcase we gc 10 times. Gives us 120ms. So really close to the 100ms we are indeed loosing. Just for interest I include the breakdown. (the time is in ms and the description tells what has just finished). Most time goes to GVN and generate LIR: ion_compile,v8/raytrace.js:680 0.0 Compile start 0.700327838834 Split 0.701115215455 Renumber Blocks 0.710153848321 Dominator tree 0.721660251918 Eliminate phis 0.722406076463 Phi reverse mapping 0.744747711995 apply types 0.769270479363 Alias analysis 0.792744585025 Eliminate dead resume point 1.08217803304 GVN 1.11643666302 UCE 1.16298977749 LICM 1.17035083333 Beta 1.18438134946 Range Analysis 1.18812807899 De-Beta 1.19324743557 Truncate 1.20048806082 EAA 1.21032745148 DCE 1.2221197897 Edge case analysi 1.23651230188 Bounds check removal 1.91628172307 Generate LIR 2.09893830925 Generate code ion_compile,v8/raytrace.js:709 0.0 Compile start 2.68146953493 Split 2.68759881823 Renumber Blocks 2.72924949204 Dominator tree 2.7910057363 Eliminate phis 2.79546307676 Phi reverse mapping 2.88392533303 apply types 2.97173402446 Alias analysis 3.05626996172 Eliminate dead resume point 5.31791774802 GVN 5.47176043468 UCE 5.8717548005 LICM 5.89957638011 Beta 5.96631394354 Range Analysis 5.98146354862 De-Beta 6.00464819829 Truncate 6.0326028854 EAA 6.06857219321 DCE 6.11694585201 Edge case analysi 6.16851268219 Bounds check removal 11.5257774288 Generate LIR 12.1990675433 Generate code I don't think we need to improve the speed of those passes (to fix this bug). I think the next step is to find why our heuristics don't stop this from inlining. Script :680 is too big to get inlined. It already takes 2ms to compile alone. Most scripts take 0.1ms...
Comment 13•8 years ago
|
||
The script is not too large to get inlined. inlineMaxTotalBytecodeLength is 1000 bytes, and this one is just few hundred bytes in length. I suspect that there should be an individual limit (per-script), and MaxTotalBytecodeLength should be used only for the sum of multiple scripts in multi-inlining situations.
Assignee | ||
Comment 14•8 years ago
|
||
This patch gives a 2% - 4% improvement on v8. Noisy because of v8-splay. But definitely 2% at least ;). It improves richards, deltablue and crypto. Note: there is a 4% regression on raytrace. I know the issue and will post a follow-up. The issue is not related with loops, but a general inlining issue. It just is visible on raytrace when enable loop inlining. Raytrace is coincidental on a sweet spot, when not loop inlining. A solution to get this sweet spot, not by accident probably will needs some refactoring of IonMonkey. Atm loop inlining is more important than getting this regression back (v8 improves + apparently fixes some drameo issues) Patch includes: 1) Disable inlining when IonScript->bailoutExpected is set. 2) Add more spew information to canInlineTarget. Did this already for makeInlineDecisions in earlier bug. 3) Remove always inlining of scripts with length 1. This was a heuristic added for deltablue before inlining of a subset of scripts was added. 4) Add a script specific max. inline depth. This tracks the maximum inline depth before a script can't inline it's current inlined scripts anymore. Improves score on richards, crypto and raytrace. 5) Disable inlining when caller is bigger than inlineMaxTotalBytecodeLength. Currently we only looked to callee. But for raytrace.js:709 is 4000+. We shouldn't inline anything in it. This fixes my previous comment about 709 taking 12ms to compile. It is already that big and we are inlining :680 3 times in it. And :680 (with all inlined scripts) takes already 2ms... 6) Remove the "isIonInlineable" for loops in jsanalyze.cpp. Improves richards, deltablue. Regresses raytrace horribly (22%), because of the sweetspot issue.
Attachment #696298 -
Attachment is obsolete: true
Attachment #737449 -
Flags: review?(kvijayan)
Comment 15•8 years ago
|
||
Comment on attachment 737449 [details] [diff] [review] Inlining loops (finally successful) Review of attachment 737449 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for taking care of this so quickly :) Much appreciated. ::: js/src/ion/IonBuilder.cpp @@ +3243,5 @@ > + IonBuilder *baseBuilder = this; > + while (baseBuilder->callerBuilder_) > + baseBuilder = baseBuilder->callerBuilder_; > + > + // Caller must not be excessively large. I really don't think that we should disable ALL inlining for large callers. If the function being inlined is TINY, then we should go ahead and inline it anyway. Examples in raytrace:709 of tiny function calls which should be inlined: Color.prototype.multiplyScalar Vector.prototype.add Color.prototype.multiply Vector.prototype.subtract Color.prototype.add Vector.prototype.normalize Vector.prototype.dot (and more) These should all be inlined, because :709 (rayTrace) is called at least 10,000 times, and there are like 20 or so calls like that in its body. The thing is, once we inline these small functions, we want to avoid inlining other large functions within those. I think there's a simple way to do this: Add a "fatCompile_" flag to IonBuilder that's set to true for large scripts (for top-level IonBuilders), or inherited from parent IonBuilder (for inline builders). If "fatCompile_" is true, then allow only small functions to be inlined. @@ +3301,5 @@ > + if (!baseBuilder->script()->isInlineDepthCheckDisabled()) { > + uint32_t updatedInlineDepth = globalMaxInlineDepth - inliningDepth_ - 1; > + if (updatedInlineDepth < baseBuilder->script()->maxInlineDepth()) > + baseBuilder->script()->setMaxInlineDepth(updatedInlineDepth); > + } I now get what you're doing here. It's nice.
Attachment #737449 -
Flags: review?(kvijayan) → review+
Assignee | ||
Comment 16•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6f337d218b4 FYI I enabled inlining of small functions into large callers. I did it different than suggested. It could be done much easier ;). This updated heuristic improved the regression on raytrace from 4% to 2%. Thanks!
Comment 17•8 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/786614886674 since auto-regress/bug648747.js somehow managed to fail you.
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/8ba06fb1bba0 Repushed with fix for auto-regress/bug648747.js
Comment 19•8 years ago
|
||
This also regressed SS crypto-aes on AWFY..
Assignee | ||
Comment 20•8 years ago
|
||
It regresses ss-raytrace, ss-cube and ss-crypto-aes. Giving a 3.3% regression on ss. Also it reports less improvement than on my machine. But that could be that single dot. Could be over multiple measurements we see a bigger improvement. I'll leave it in for now and discuss (after we have some more measurements) what to do and if we can take this hit. We need loop inlining and I think it will be hard to not make it a SS hit.
Comment 21•8 years ago
|
||
Backing out for SS regressions, there have been several runs on awfy now to see what numbers there look like. https://hg.mozilla.org/integration/mozilla-inbound/rev/119c05c4437e
Comment 22•8 years ago
|
||
(In reply to Brian Hackett (:bhackett) from comment #21) > Backing out for SS regressions, there have been several runs on awfy now to > see what numbers there look like. > > https://hg.mozilla.org/integration/mozilla-inbound/rev/119c05c4437e Especially on the Unagi [1] where the regression on sunspider is around ~10%. [1] http://arewefastyet.com/#machine=14
Comment 23•8 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/8ba06fb1bba0
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Updated•8 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 24•8 years ago
|
||
Backout: https://hg.mozilla.org/mozilla-central/rev/119c05c4437e
Assignee | ||
Comment 25•8 years ago
|
||
Quick fix to inline loops in small functions. Gives all improvements, without the regression on raytrace. Richards: 22420 DeltaBlue: 22454 Crypto: 24188 RayTrace: 35343 EarleyBoyer: 26685 RegExp: 4059 Splay: 19123 NavierStokes: 28260 ---- Score (version 7): 19960 to Richards: 24959 DeltaBlue: 25449 Crypto: 24770 RayTrace: 37369 EarleyBoyer: 27109 RegExp: 4004 Splay: 19721 NavierStokes: 28288 ---- Score (version 7): 20842
Attachment #747594 -
Flags: review?(kvijayan)
Assignee | ||
Comment 26•8 years ago
|
||
forgot to qref
Attachment #747594 -
Attachment is obsolete: true
Attachment #747594 -
Flags: review?(kvijayan)
Attachment #747597 -
Flags: review?(kvijayan)
Comment 27•8 years ago
|
||
Comment on attachment 747597 [details] [diff] [review] Inline small functions with loops Review of attachment 747597 [details] [diff] [review]: ----------------------------------------------------------------- Looks good.
Attachment #747597 -
Flags: review?(kvijayan) → review+
Assignee | ||
Comment 28•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/8011f4e535fa
Comment 29•8 years ago
|
||
Sorry, I backed this out on inbound because of jit-test failures: https://tbpl.mozilla.org/php/getParsedLog.php?id=22795467&tree=Mozilla-Inbound https://hg.mozilla.org/integration/mozilla-inbound/rev/58c6702b5d84
Assignee | ||
Comment 30•8 years ago
|
||
This is now offically cursed! Four attempts to create a patch. One with a regression causing a backout and even when it doesn't regression anything. It get's backed out for jittest failure. (Even when I ran it locally to make sure there was no failure. Thanks shu for debugging it, since I still cannot repro.). As a result this is going to try server. https://tbpl.mozilla.org/?tree=Try&rev=0f2ca646a8ce If this is green and it gets backed out on m-i, it's a sign of the gods and I'll add it to my ignore list.
Assignee | ||
Comment 31•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/6336284c7f1f
Comment 32•8 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/6336284c7f1f
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Comment 33•8 years ago
|
||
Sorry for bothering, but I can't see the improvements mentioned in comment 25 on arewefastyet.com. Do I miss something?
Assignee | ||
Comment 34•8 years ago
|
||
(In reply to Coyote from comment #33) > Sorry for bothering, but I can't see the improvements mentioned in comment > 25 on arewefastyet.com. Do I miss something? No problem. The big difference is that the report is on "v8" not "octane". Awfy reports only octane now. Since the improvement was for the original "v8" tests and not for the extra tests in "octane". I listed that. It is easier/better to see the difference, since there is less noise. But the improvement was a 1% for x86 octane and 2% for x64 octane. deltablue: 20% (10% on x86, very noisy) earley-boyer: 1% richards: 10%
Comment 35•8 years ago
|
||
Ah ok, I assumed that the improvements would also improve the "V8-part" tests on Octane. Why is there such a difference between V8 and Octane? Thanks for your explanation so far, I know this isn't contributing or something, I'm just curious :-)
Assignee | ||
Comment 36•8 years ago
|
||
Oh, it improved the "v8-part" tests on octane. The improvement listed above is on the octane bench. It is just harder to see on octane, since it contains more benchmarks.
Comment 37•8 years ago
|
||
Thanks for the explanation!
You need to log in
before you can comment on or make changes to this bug.
Description
•