Open Bug 847160 Opened 12 years ago Updated 3 months ago

IonMonkey: Better code for self-hosted functions and aggressive inlining

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: evilpie, Unassigned)

References

(Blocks 3 open bugs)

Details

Attachments

(2 files)

Attached file testcase
Julian Viereck just showed me this http://jsperf.com/iter-array testcase on twitter. The forEach case is unbelievable slow compared to everything else. We have multiple problems here: - We don't inline ArrayForEach, which might be worthwhile - The generated code for ForEach is very bad, we don't hoist bounds checks in the inner loop and use a CallGeneric for the callback - The call to the callback in ArrayForEach should be completely inlined. I attached a shell testcase that shows this problem.
Also, from the results of that test there is a big performance regression on the different simple for loops, counting up/down: browser (runs) -- down / up / up with length cache FF18 (3) -- 839,777 / 839,777 / 839,777 FF19 (5) -- 821,707 / 821,707 / 821,707 FF20 (1) -- 147,997 / 50,489 / 50,489 FF22 (4) -- 207,446 / 119,751 / 199,751 I am not sure what is causing the slowdown from 19 to 20 on those tests.
That's interesting. I wonder if this regressed since landing the self-hosting or if the slight differences to my performance tests make all the difference here. Back when I last tested this in December, there wasn't any speed difference at all between the forEach and the numeric for loop versions of my tests.
Yeah, so this regressed. Strongly. The attached testcase takes ~10x as long as it does when commenting out the the forEach version and activating the numeric loop version. With methodjit, (i.e., --no-ion), it runs slightly faster, but the numeric loop version is only ~2x faster. Maybe the regression is somehow related to the changes for ParallelArrays?
Flags: needinfo?(nmatsakis)
When did this regress? Should we be tracking this for some release?
Keywords: regression
I am not sure what changes would have caused the regression, but of course I wouldn't rule it out.
Flags: needinfo?(nmatsakis)
Mmh, so this didn't regress after all: the very first Nightly containing the self-hosted array forEach implementation[1] is just as fast as today's. Having said that, my testing before the landing clearly showed speed-parity with numeric for loops at some point. Unfortunately, I only did ad hoc testing without really documenting the results or making them reproducible. Luckily, evilpie already identified what needs to be improved to speed this up. [1]: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/11/2012-11-23-03-08-27-mozilla-central/
OS: Linux → All
Hardware: x86_64 → All
Um. Both today's build and the builds right after landing show self-hosted forEach comparable to manual numeric loops on the testcase in bug 602132, say. So what's special about the testcase(s) here?
Furthermore, if I take the attachments in this bug and use a manually self-hosted forEach, I don't see them get much faster. That is, the self-hosted forEach does just as well as a hand-rolled forEach _function_ but not nearly as well as numerical iteration with everything inlined. It's entirely possible that we used to only compare to hand-rolled forEach..... The main hit is in fact that we don't inline the call to the callback in ArrayForEach.
(In reply to Boris Zbarsky (:bz) from comment #7) > So what's special about the testcase(s) here? The answer to that is surprisingly simple: these testcases run inside a test() function whereas the ones in bug 602132 run in the top-level script. If you change the latter to run inside a function, times roughly quadruple. Evilpie, do you know of any heuristics in IM that cause these differences?
Oh, just did another quick variation I hadn't thought of before: Moving just the forEach callback out of the test() function to the top-level script restores the speed. So it looks like inner functions aren't inlined into ArrayForEach and somehow throw off some optimizations like hoisted bounds checks.
Did some investigation for why forEach is so slow (in particular, why we don't inline) beyond the obvious "we don't have enough TI resolution to inline". That is, using callsite cloning or the landing of bug 804676 do *not* solve these problems. This comment also applies for bug 851120. There are 2 things we want to be inlineable for self-hosted higher-order operations like forEach to be fast: the call to forEach itself (for nested iteration, like the one in bug 851120), and the call to the kernel. The call to forEach is not inlineable because: - For semantic reasons (like keeping functions' .length property spec-compliant), forEach uses |arguments[i]|, which is NYI in Ion. - jsanalyze.cpp marks functions that contains loops as not inlineable. bhackett and h4writer have said on IRC that this should be enabled sometime in the future? - Heavyweight functions are not inlineable. The call to the kernel function is not inlineable because: - TypeObjects are not inlineable. See bug 847694. - Closures aren't inlineable. Related to heavyweight functions not being inlineable.
Inlining closures probably go hand in hand with some kind of closure conversion, accounting for deopting on |f.caller| and all that.
FWIW, I don't think there's any technical reason why we couldn't inline heavyweight functions.
So, I looked at this again, and inlining doesn't seem to make much of a difference, for some reason. Adding `SetScriptHints(ArrayForEach, {cloneAtCallsite: true, inline: true});` to builtins/Array.js causes both forEach and the kernel to be inlined, but doesn't change the runtime of the first testcase at all. Interestingly, an opt shell is only a few percent faster than a debug one for the benchmark, and I have no idea why. Will look at it with the tracelogger.
Is this still a major regression from for-loops? I worked on optimizing these a while back. Inlining ArrayForEach directly is not that big of a deal, for most hot loops, we'll spend most of our time inside the ForEach loop, so a higher overhead for entering and leaving the loop is not such a big issue. OTOH, inlining the kernel would have a bigger impact. I do think we are able to inline the kernel these days, given that we're cloning ArrayForEach on site, and TI will be able to pick up the kernel as a singleton type. If you're seeing that opt is only a few percent faster than debug, it just means that we're spending most of our time in tight, jitted code.
(In reply to Kannan Vijayan [:djvj] from comment #15) > Is this still a major regression from for-loops? I worked on optimizing > these a while back. It is, sadly. Changing the first attached testcase to use a numeric loop makes it ~3x faster. More importantly, in the jsperf benchmark from comment 0, the forEach version is 97%(!) slower than the other versions.
Assignee: general → nobody
On the jsperf, Nightly / Chrome 43: forEach: 249,016 / 9,064 for, counting up: 1,070,746 / 401,057 for, counting up, caching length: 1,062,803 / 426,321 for, counting down: 1,065,931 / 145,648 forEach is 78% slower on Firefox and 98% on Chrome.
While this isn't fully solved, it's gotten much better: On the jsperf test, forEach is half as fast (~800kOps/s vs 1600kOps/s) in Nightly, but 1% as fast (~7kOps/s vs. 670kOps/s) on Chrome. Pretty sure the loop version is just completely optimized out for us, making the benchmark largely meaningless. The shell test cases are both as fast with forEach as they are with for loops: 600ms and 62ms for the first and second test, respectively. Other engines don't fare that well, with d8's 62 seconds(!) for the first test being the worst. Is there much left to do here, I wonder?
> On the jsperf test, forEach is half as fast (~800kOps/s vs 1600kOps/s) in Nightly, Fwiw, it's ~465kOps/s vs ~3400kOps/s for me on a Mac nightly....
(In reply to Boris Zbarsky [:bz] from comment #19) > Fwiw, it's ~465kOps/s vs ~3400kOps/s for me on a Mac nightly.... Huh, interesting. My results hold up in a fresh profile on a 2012 rMBP with current Nightly.
Oh, in a clean profile with a single tab I see numbers more like yours. In a clean profile with 10 copies of gmail opened in background tabs I see numbers a lot more like mine.
Severity: normal → S3

We are about 6.5x slower here.

Nightly: https://share.firefox.dev/3yAfBAD (5.5s)
Chrome: https://share.firefox.dev/3Av8D0j (841ms)

cc :iain, bthrall

Looks like the problem here is that we don't inline forEach because we had previously done OSR inside it.

Blocks: 1912835
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: