Closed
Bug 610296
Opened 14 years ago
Closed 11 years ago
Closures as a way to modularize JS code slows us down
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: azakai, Unassigned)
References
Details
Attachments
(2 files, 1 obsolete file)
Attached is a demo of the Bullet physics engine (compiled from C++ using emscripten) with WebGL (through the GLGE rendering engine). Instructions for how to run it are in the README.txt file.
In the file bullet.js, there are two possibilities - use a closure for the Bullet library, or not. At the very top of the file and the very bottom of the file are a few lines marked with //=== FAST or //=== SLOW, uncomment them out (the same way on top and bottom) to test each way. The difference is that in FAST we do not use a closure, whereas in SLOW we do. The difference is just a few lines of code here, but it makes a big difference in speed.
On minefield, compared to Chrome 7, I get the following (all numbers are frames per second, more is better):
Fast (without closure)
----------------------
Firefox: 15-40 during the 'action', stabilize later on 55-60
Chrome: 45-70, 85
Slow (with closure)
-------------------
Firefox: 5-20 during the 'action', stabilize later on 25
Chrome: 45-70, 85 (unchanged)
For Firefox the closure greatly affects performance, for Chrome it doesn't seem to matter.
Note that even without the closure Chrome is faster. This might be because that closure pattern is used also in GLGE, the rendering engine used in the attached demo. This seems to be a common pattern in large JavaScript projects, since it's useful as a way to modularize things (|var|s in the closure are hidden and only available to that module, etc.).
Comment 1•14 years ago
|
||
Alon, any chance you can extract a minimal-ish testcase here? We put in a fair amount of work to make closures very fast in the tracer, but that might not have migrated to JM yet... so whether you're tracing or not might matter a good bit. Or perhaps the access patterns involved here are slow on both. Hard to tell with a huge chunk of JS...
See also bug 517164.
Reporter | ||
Comment 2•14 years ago
|
||
I can't seem to make a minimalist test case that has this problem. I did however manage to simplify the original testcase, into one that runs in the shell - so all the browser stuff, webgl, rendering, etc., are irrelevant.
Run the shell test by running bullet_demo.js. Note that you need to apply the included patch, as otherwise the shell will complain about exceeding the script stack space (see bug 502836).
Results:
//=== FAST (no closure)
tm -m 0.926
v8 1.196
tm 1.398
tm -j -m 4.383
tm -j 11.258
//=== SLOW (with closure)
v8 0.635
tm -m 2.344
tm -j -m 17.124
tm 26.853
tm -j 44.706
So it looks like there are two issues here: first, the closures slow us down, and this happens even without either JIT enabled. Second, that the tracing JIT slows us down (by a factor of 10 or more).
Reporter | ||
Comment 3•14 years ago
|
||
Adding blocking on |JM: Tune Trace JIT Heuristics|, since as mentioned in the comment above, in the shell test the tracing JIT slows us down by a factor of 10.
Blocks: 580468
Comment 4•14 years ago
|
||
You ran your shell tests with heuristics disabled.
Here are the numbers on my machine:
~/test% foreach args ("" "-m" "-j" "-m -j" "-m -j -p")
foreach? echo "args: $args"
foreach? ~/mozilla/tracemonkey/obj-firefox-opt/dist/bin/js $args bullet_demo.js
foreach? end
args:
simulation time: 19.571 seconds
args: -m
simulation time: 1.487 seconds
args: -j
simulation time: 41.726 seconds
args: -m -j
simulation time: 8.931 seconds
args: -m -j -p
simulation time: 3.492 seconds
That last run is with the heuristics enabled. So the heuristics certainly help over raw -m -j, though they don't quite get us down to where -m is on its own.
So it really sounds like the issues here are:
1) Closures are slow in _JM_ (known issue; see comment 1).
2) This particular testcase doesn't trace well, for whatever reason, but
unrelated to the speed of closures.
3) Heuristics don't get us down to quite the JM speed.
I suggest filing a separate bug for #3, and possibly a separate bug for #2. #1 presumably depends on bug 517164.
No longer blocks: 580468
Depends on: ClosurePerf
Comment 5•14 years ago
|
||
Oh, and you don't have to patch things to change the script space quota; in the JS shell, I just did:
stackQuota(100000000);
at the top of bullet_demo.js and it worked fine.
Comment 6•14 years ago
|
||
I'm also not quite sure why you needed the two separate files, btw...
In any case, a profile of the -m run shows that we spend 82% of our time under js_FindPropertyHelper (split 50-50 between self time in js::PropertyCache::fill and self time in js_LookupPropertyWithFlags), so it's not the closure access per se that's being slow; it's finding where the heck to look for the values.
In both functions, the vast majority of the time is spent (according to shark) on lea instructions in a loop like so:
0x100105270 cmp rsi, qword ptr [rdx+16] Loop start[1] jsscope.h:833
0x100105274 jz 0x1001051a4 jsscope.h:833
0x10010527a lea rax, qword ptr [rdx+48] jsscope.h:832
0x10010527e mov rdx, qword ptr [rdx+48] jsscope.h:832
0x100105282 test rdx, rdx jsscope.h:832
0x100105285 jnz 0x100105270 Loop end[1] jsscope.h:832
In fact, it's the _same_ loop in both functions; it's inlined into both of them. In my tree, jsscope.h lines 832 and 833 are:
832 for (spp = startp; js::Shape *shape = *spp; spp = &shape->parent) {
833 if (shape->id == id) {
in Shape::search.
So the point is that we're likely just hitting L2 misses all over or something.
In fill(), we hit this loop because of this check, I think:
74 if (!pobj->nativeContains(*shape)) {
None of which is sounding like closure access per se being slow.
So why are we ending up with all the resolve/fill calls (are we missing propcache or pics somewhere? Too many distinct shapes?)?
And how deep is the chain in search?
And can we get rid of this code in fill() somehow?
Comment 7•14 years ago
|
||
So I unstrumented that loop to print the number of iterations it goes through (the number of times the loop body is entered). Here are the statistics for just the part after the testcase prints "simulating" (the second column is the iteration count; the first column is the number of times we saw that iteration count):
4 3
4 4
4 5
11 2
11 6
19 100
29 136
63 184
64 230
73 202
94 3744
99 182
144 83
166 524
213 147
323 80
22649 1
Why are those things like 3744 and 524 and whatnot not being hashed?
Comment 9•14 years ago
|
||
Numbers on my machine with my proposed patch for bug 610370:
args:
simulation time: 0.584 seconds
args: -m
simulation time: 0.64 seconds
args: -j
simulation time: 7.861 seconds
args: -m -j
simulation time: 3.589 seconds
args: -m -j -p
simulation time: 1.125 seconds
So getting better, though -m being no faster than interp is sadmaking.
v8 can't run the attached testcase, so I have no idea how we compare to that.
A new profile of the -m run shows that we're now spending:
40% of our time in mjit-generated code
20% under stubs::GetElem; calling js::PropertyTable::search from
js_GetProperty is half of this! The rest is GetElem and js_GetProperty
self time.
16% under stubs::SetElem; again PropertyTable::search is about half of this,
though this time we end up calling js_LookupPropertyWithFlags on the way
there(?).
9% is the mjit compiler itself, from ic::Call. I wonder why.... do we keep
generating new Function objects it has to compile?
4% under stubs::CallName
The rest is minor stuff. So yes, now things look like possible closure access, if the GetElem and SetElem are on closures... dvander, does that seem possible?
Comment 10•14 years ago
|
||
Oh, and issues 2 and 3 from comment 4 remain. Please file bugs?
Comment 11•14 years ago
|
||
This bug should be meta and track concrete bugs that get patches with tests, as bz said. Best not to make this bug about one of N concrete problems/solutions.
GetElem and SetElem are for indexed expressions, but you cannot access upvars except by unqualified names. Something else is causing these ops. What?
/be
Comment 12•14 years ago
|
||
One more thing. With the patch for bug 610370, the "FAST" -m time is 0.485 seconds here. The SLOW one is 0.566 seconds. So yeah, at that point most of the cost is not closure-related at all, though they do give some slowdown (perhaps in that mjit-generated code?).
I'd assume the GetElem is mostly the HEAP access and whatnot, but I'm not sure why we end up in js::PropertyTable::search, then...
Comment 13•14 years ago
|
||
Fwiw, there's also a performance issue with the load() call when using the closure, but I think that might be bug 501908, given that setFunctionKinds is where all the time is. Worth remeasuring once that's fixed.
Depends on: 501908
Comment 14•14 years ago
|
||
Alon, can you make an all-in-one .js file for the shell testcase? Thanks.
It looks like HEAP is just an array, which seems not to be a dense array. Even indexed property names end up atoms identifying shapes mapping slots in such an object. If so, why is HEAP not a dense array?
/be
Reporter | ||
Comment 15•14 years ago
|
||
(In reply to comment #14)
> Alon, can you make an all-in-one .js file for the shell testcase? Thanks.
>
I'm trying to simplify the test case now, but making it into one file won't work without the patch for the stack quota, as running stackQuota() as bz suggested must be done before loading the file. So we will need (for now) 2 files, one of which is a shell file that just calls stackQuota() and load()s the actual benchmark.
Comment 16•14 years ago
|
||
It's not dense at least because INDEX_TOO_SPARSE (capacity is 8, index is 1.5 million or so). But even if I make INDEX_TOO_SPARSE always return false, it's not-dense because array_trace makes sure that too-sparse arrays become not-dense during gc. And if I disable _that_ too, then my runtime drops to 0.28 seconds under -m (about 2x faster than with the slowarray HEAP).
That's for the SLOW case (with the closure). The closure-less case goes to 0.195 seconds under -m.
Note that -m -j -p is still a good bit slower than -m here, and we do still need a bug on that...
Of course the dense array issue is moot if the testcase uses typed arrays, once JM implements fast paths for them. ;)
Comment 17•14 years ago
|
||
Should be fine to just have one file and then run:
js -e "stackQuota(100000000)" -f filename
But in any case, the two-file thing is fine by me.
In case you care, btw, the time for v8 shell on that testcase on my machine is 0.481 seconds on the FAST version and 0.21 seconds on the SLOW version. So they're actually _faster_ on the SLOW version than the FAST one.
So if the array were dense, we'd be more or less on par with them under -m...
Reporter | ||
Comment 18•14 years ago
|
||
(In reply to comment #17)
> Should be fine to just have one file and then run:
>
> js -e "stackQuota(100000000)" -f filename
>
Ah, I didn't know about that, nice. Attached is a single-file shell testcase, runnable with
./js [-j -m -p etc.] -e "stackQuota(100000000)" -f bullet_test.js
Regarding typed arrays, the test can use them. It's disabled right now (to make the comparison to v8 easier), but you can change
var TYPED_ARRAYS = false;
to true, to use typed arrays. Might be worth benchmarking that later on.
Filed bug 610438 and bug 610440 for issues #2 and #3 mentioned before, that we said we should have bugs for.
Attachment #488816 -
Attachment is obsolete: true
Reporter | ||
Comment 19•13 years ago
|
||
Checked on latest tracemonkey, results with js -m are now
no closure : 2.503
with closure: 3.328
So the closure one is 32% slower.
Reporter | ||
Updated•13 years ago
|
Blocks: ClosurePerf
No longer depends on: ClosurePerf
Reporter | ||
Comment 20•13 years ago
|
||
Some current measurements:
no closure
m-c -m -j -p 2.6
jm -m -j -p 2.9
jm -m -j -p -n 4.0
v8 3.1
closure
m-c -m -j -p 3.3
jm -m -j -p 3.6
jm -m -j -p -n 4.5
v8 3.7
So closures make us about 25% slower. Interestingly this now seems to happen in v8 as well.
(I filed the slowness in jm -n as bug 678687)
Comment 21•12 years ago
|
||
It seems like this simple test case is either reproducing this problem in isolation, or a similar one:
http://jsperf.com/const-vs-var-in-closures
The FF performance demonstrates dramatically impaired variable access performance for vars that are local to the closure, compared with globals. V8 doesn't reproduce this particular glitch, though, so maybe there's more going on?
Comment 22•12 years ago
|
||
Opera seems to have a similar problem, but not as pronounced as Fx (at least if you take a look at the graphs).
This needs to be fixed. Everybody is using closures today.
Anybody having any idea why var and const is so much slower in a closure?
Comment 23•12 years ago
|
||
> Anybody having any idea why var and const is so much slower in a closure?
Sure. If you read the bug, you will too. ;)
Comment 24•12 years ago
|
||
I'm also experiencing this slowdown in a jsperf because I'm using closures.
The jsperf is this one: http://jsperf.com/oop-benchmark/79
The dejavu (after optimization) use a closure and it slows down things by a great margin (~45% in FF 17 and about 75%~ in FF 19 aurora!!!)
Comment 25•12 years ago
|
||
A question. Are the closures involved in the real-life use cases (as opposed to synthetic benchmarks) run once, or run multiple times?
Comment 26•12 years ago
|
||
Typically the code inside the closures are only run once. But closures tipically have functions inside that are exported to outside the closure. Those are executed several times and they are slower. E.g.:
(function () {
var test = 1;
var MyObject = function () {};
MyObject.prototype.foo = function () { return test; };
return MyObject;
}());
Comment 27•12 years ago
|
||
At this time, calling new MyObject() and calling foo will be ~30% slower!
Comment 28•12 years ago
|
||
(In reply to André Cruz from comment #26)
> Typically the code inside the closures are only run once. But closures
> tipically have functions inside that are exported to outside the closure.
> Those are executed several times and they are slower. E.g.:
>
> (function () {
> var test = 1;
>
> var MyObject = function () {};
> MyObject.prototype.foo = function () { return test; };
>
> return MyObject;
> }());
This run-once closure pattern is something I've wanted to optimize for a long time. The problem here is that we treat the outermost closure as if it can run many times, which causes all sorts of deoptimizations to functions within the closure. If we recognized that the outer closure only runs once and that these deoptimizations are not necessary, performance for those inner functions would be much better. I just filed bug 821361 for this case.
Comment 29•12 years ago
|
||
This may (or may not) provide some interesting insights into V8 closure optimizations, just FYI: http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html
Comment 30•11 years ago
|
||
The "slow" version is now ever so slightly faster than the "fast" one.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•