Closed Bug 610296 Opened 14 years ago Closed 11 years ago

Closures as a way to modularize JS code slows us down

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: azakai, Unassigned)

References

Details

Attachments

(2 files, 1 obsolete file)

982.61 KB, application/x-bzip
Details
803.08 KB, application/octet-stream
Details
Attached file Demo
Attached is a demo of the Bullet physics engine (compiled from C++ using emscripten) with WebGL (through the GLGE rendering engine). Instructions for how to run it are in the README.txt file.

In the file bullet.js, there are two possibilities - use a closure for the Bullet library, or not. At the very top of the file and the very bottom of the file are a few lines marked with //=== FAST or //=== SLOW, uncomment them out (the same way on top and bottom) to test each way. The difference is that in FAST we do not use a closure, whereas in SLOW we do. The difference is just a few lines of code here, but it makes a big difference in speed.

On minefield, compared to Chrome 7, I get the following (all numbers are frames per second, more is better):

Fast (without closure)
----------------------

Firefox: 15-40 during the 'action', stabilize later on 55-60
Chrome:  45-70,                                         85

Slow (with closure)
-------------------

Firefox:  5-20 during the 'action', stabilize later on 25
Chrome:  45-70,                                        85 (unchanged)


For Firefox the closure greatly affects performance, for Chrome it doesn't seem to matter.

Note that even without the closure Chrome is faster. This might be because that closure pattern is used also in GLGE, the rendering engine used in the attached demo. This seems to be a common pattern in large JavaScript projects, since it's useful as a way to modularize things (|var|s in the closure are hidden and only available to that module, etc.).
Blocks: WebJSPerf
Alon, any chance you can extract a minimal-ish testcase here?  We put in a fair amount of work to make closures very fast in the tracer, but that might not have migrated to JM yet... so whether you're tracing or not might matter a good bit.  Or perhaps the access patterns involved here are slow on both.  Hard to tell with a huge chunk of JS...

See also bug 517164.
Attached file Shell test (obsolete) —
I can't seem to make a minimalist test case that has this problem. I did however manage to simplify the original testcase, into one that runs in the shell - so all the browser stuff, webgl, rendering, etc., are irrelevant.

Run the shell test by running bullet_demo.js. Note that you need to apply the included patch, as otherwise the shell will complain about exceeding the script stack space (see bug 502836).

Results:

//=== FAST (no closure)

tm -m      0.926
v8         1.196
tm         1.398
tm -j -m   4.383
tm -j     11.258

//=== SLOW (with closure)

v8         0.635
tm -m      2.344
tm -j -m  17.124
tm        26.853
tm -j     44.706

So it looks like there are two issues here: first, the closures slow us down, and this happens even without either JIT enabled. Second, that the tracing JIT slows us down (by a factor of 10 or more).
Adding blocking on |JM: Tune Trace JIT Heuristics|, since as mentioned in the comment above, in the shell test the tracing JIT slows us down by a factor of 10.
Blocks: 580468
You ran your shell tests with heuristics disabled.

Here are the numbers on my machine:

~/test% foreach args ("" "-m" "-j" "-m -j" "-m -j -p")
foreach? echo "args: $args"
foreach? ~/mozilla/tracemonkey/obj-firefox-opt/dist/bin/js $args bullet_demo.js 
foreach? end
args: 
simulation time: 19.571 seconds
args: -m
simulation time: 1.487 seconds
args: -j
simulation time: 41.726 seconds
args: -m -j
simulation time: 8.931 seconds
args: -m -j -p
simulation time: 3.492 seconds

That last run is with the heuristics enabled.  So the heuristics certainly help over raw -m -j, though they don't quite get us down to where -m is on its own.

So it really sounds like the issues here are:

1)  Closures are slow in _JM_ (known issue; see comment 1).
2)  This particular testcase doesn't trace well, for whatever reason, but
    unrelated to the speed of closures.
3)  Heuristics don't get us down to quite the JM speed.

I suggest filing a separate bug for #3, and possibly a separate bug for #2.  #1 presumably depends on bug 517164.
No longer blocks: 580468
Depends on: ClosurePerf
Oh, and you don't have to patch things to change the script space quota; in the JS shell, I just did:

  stackQuota(100000000);

at the top of bullet_demo.js and it worked fine.
I'm also not quite sure why you needed the two separate files, btw...

In any case, a profile of the -m run shows that we spend 82% of our time under js_FindPropertyHelper (split 50-50 between self time in js::PropertyCache::fill and self time in js_LookupPropertyWithFlags), so it's not the closure access per se that's being slow; it's finding where the heck to look for the values.

In both functions, the vast majority of the time is spent (according to shark) on lea instructions in a loop like so:

0x100105270 cmp rsi, qword ptr [rdx+16]  Loop start[1]      jsscope.h:833
0x100105274 jz 0x1001051a4                                  jsscope.h:833
0x10010527a lea rax, qword ptr [rdx+48]                     jsscope.h:832
0x10010527e mov rdx, qword ptr [rdx+48]                     jsscope.h:832
0x100105282 test rdx, rdx                                   jsscope.h:832
0x100105285 jnz 0x100105270              Loop end[1]        jsscope.h:832

In fact, it's the _same_ loop in both functions; it's inlined into both of them.  In my tree, jsscope.h lines 832 and 833 are:

832        for (spp = startp; js::Shape *shape = *spp; spp = &shape->parent) {
833            if (shape->id == id) {

in Shape::search.

So the point is that we're likely just hitting L2 misses all over or something.

In fill(), we hit this loop because of this check, I think:

74     if (!pobj->nativeContains(*shape)) {

None of which is sounding like closure access per se being slow.

So why are we ending up with all the resolve/fill calls (are we missing propcache or pics somewhere?  Too many distinct shapes?)?

And how deep is the chain in search?

And can we get rid of this code in fill() somehow?
So I unstrumented that loop to print the number of iterations it goes through (the number of times the loop body is entered).  Here are the statistics for just the part after the testcase prints "simulating" (the second column is the iteration count; the first column is the number of times we saw that iteration count):

   4  3
   4  4
   4  5
  11  2
  11  6
  19  100
  29  136
  63  184
  64  230
  73  202
  94  3744
  99  182
 144  83
 166  524
 213  147
 323  80
22649 1

Why are those things like 3744 and 524 and whatnot not being hashed?
I filed bug 610370 on the lack of hashing there.
Depends on: 610370
Numbers on my machine with my proposed patch for bug 610370:

args: 
simulation time: 0.584 seconds
args: -m
simulation time: 0.64 seconds
args: -j
simulation time: 7.861 seconds
args: -m -j
simulation time: 3.589 seconds
args: -m -j -p
simulation time: 1.125 seconds

So getting better, though -m being no faster than interp is sadmaking.

v8 can't run the attached testcase, so I have no idea how we compare to that.

A new profile of the -m run shows that we're now spending:

 40% of our time in mjit-generated code
 20% under stubs::GetElem; calling js::PropertyTable::search from
     js_GetProperty is half of this!  The rest is GetElem and js_GetProperty
     self time.
 16% under stubs::SetElem; again PropertyTable::search is about half of this,
     though this time we end up calling js_LookupPropertyWithFlags on the way
     there(?).
  9% is the mjit compiler itself, from ic::Call.  I wonder why.... do we keep
     generating new Function objects it has to compile?
  4% under stubs::CallName

The rest is minor stuff.  So yes, now things look like possible closure access, if the GetElem and SetElem are on closures...  dvander, does that seem possible?
Oh, and issues 2 and 3 from comment 4 remain.  Please file bugs?
This bug should be meta and track concrete bugs that get patches with tests, as bz said. Best not to make this bug about one of N concrete problems/solutions.

GetElem and SetElem are for indexed expressions, but you cannot access upvars except by unqualified names. Something else is causing these ops. What?

/be
One more thing.  With the patch for bug 610370, the "FAST" -m time is 0.485 seconds here.  The SLOW one is 0.566 seconds.  So yeah, at that point most of the cost is not closure-related at all, though they do give some slowdown (perhaps in that mjit-generated code?).

I'd assume the GetElem is mostly the HEAP access and whatnot, but I'm not sure why we end up in js::PropertyTable::search, then...
Fwiw, there's also a performance issue with the load() call when using the closure, but I think that might be bug 501908, given that setFunctionKinds is where all the time is.  Worth remeasuring once that's fixed.
Depends on: 501908
Alon, can you make an all-in-one .js file for the shell testcase? Thanks.

It looks like HEAP is just an array, which seems not to be a dense array. Even indexed property names end up atoms identifying shapes mapping slots in such an object. If so, why is HEAP not a dense array?

/be
(In reply to comment #14)
> Alon, can you make an all-in-one .js file for the shell testcase? Thanks.
> 

I'm trying to simplify the test case now, but making it into one file won't work without the patch for the stack quota, as running stackQuota() as bz suggested must be done before loading the file. So we will need (for now) 2 files, one of which is a shell file that just calls stackQuota() and load()s the actual benchmark.
It's not dense at least because INDEX_TOO_SPARSE (capacity is 8, index is 1.5 million or so).  But even if I make INDEX_TOO_SPARSE always return false, it's not-dense because array_trace makes sure that too-sparse arrays become not-dense during gc.  And if I disable _that_ too, then my runtime drops to 0.28 seconds under -m (about 2x faster than with the slowarray HEAP).

That's for the SLOW case (with the closure).  The closure-less case goes to 0.195 seconds under -m.

Note that -m -j -p is still a good bit slower than -m here, and we do still need a bug on that...

Of course the dense array issue is moot if the testcase uses typed arrays, once JM implements fast paths for them.  ;)
Should be fine to just have one file and then run:

  js -e "stackQuota(100000000)" -f filename

But in any case, the two-file thing is fine by me.

In case you care, btw, the time for v8 shell on that testcase on my machine is 0.481 seconds on the FAST version and 0.21 seconds on the SLOW version.  So they're actually _faster_ on the SLOW version than the FAST one.

So if the array were dense, we'd be more or less on par with them under -m...
Depends on: 610438
Depends on: 610440
Attached file single-file shell test
(In reply to comment #17)
> Should be fine to just have one file and then run:
> 
>   js -e "stackQuota(100000000)" -f filename
> 

Ah, I didn't know about that, nice. Attached is a single-file shell testcase, runnable with

  ./js [-j -m -p etc.] -e "stackQuota(100000000)" -f bullet_test.js

Regarding typed arrays, the test can use them. It's disabled right now (to make the comparison to v8 easier), but you can change

  var TYPED_ARRAYS = false;

to true, to use typed arrays. Might be worth benchmarking that later on.

Filed bug 610438 and bug 610440 for issues #2 and #3 mentioned before, that we said we should have bugs for.
Attachment #488816 - Attachment is obsolete: true
Depends on: 663138
Checked on latest tracemonkey, results with js -m are now

no closure  : 2.503
with closure: 3.328

So the closure one is 32% slower.
Blocks: ClosurePerf
No longer depends on: ClosurePerf
Some current measurements:

no closure

m-c -m -j -p      2.6
jm  -m -j -p      2.9
jm  -m -j -p -n   4.0
v8                3.1

closure

m-c -m -j -p      3.3
jm  -m -j -p      3.6
jm  -m -j -p -n   4.5
v8                3.7

So closures make us about 25% slower. Interestingly this now seems to happen in v8 as well.

(I filed the slowness in jm -n as bug 678687)
It seems like this simple test case is either reproducing this problem in isolation, or a similar one:

http://jsperf.com/const-vs-var-in-closures

The FF performance demonstrates dramatically impaired variable access performance for vars that are local to the closure, compared with globals. V8 doesn't reproduce this particular glitch, though, so maybe there's more going on?
Opera seems to have a similar problem, but not as pronounced as Fx (at least if you take a look at the graphs). 

This needs to be fixed. Everybody is using closures today. 

Anybody having any idea why var and const is so much slower in a closure?
> Anybody having any idea why var and const is so much slower in a closure?

Sure. If you read the bug, you will too.  ;)
I'm also experiencing this slowdown in a jsperf because I'm using closures.
The jsperf is this one: http://jsperf.com/oop-benchmark/79

The dejavu (after optimization) use a closure and it slows down things by a great margin (~45% in FF 17 and about 75%~ in FF 19 aurora!!!)
A question.  Are the closures involved in the real-life use cases (as opposed to synthetic benchmarks) run once, or run multiple times?
Typically the code inside the closures are only run once. But closures tipically have functions inside that are exported to outside the closure. Those are executed several times and they are slower. E.g.:

(function () {
  var test = 1;

  var MyObject = function () {};
  MyObject.prototype.foo = function () { return test; };
  
  return MyObject;
}());
At this time, calling new MyObject() and calling foo will be ~30% slower!
Depends on: 821361
(In reply to André Cruz from comment #26)
> Typically the code inside the closures are only run once. But closures
> tipically have functions inside that are exported to outside the closure.
> Those are executed several times and they are slower. E.g.:
> 
> (function () {
>   var test = 1;
> 
>   var MyObject = function () {};
>   MyObject.prototype.foo = function () { return test; };
>   
>   return MyObject;
> }());

This run-once closure pattern is something I've wanted to optimize for a long time.  The problem here is that we treat the outermost closure as if it can run many times, which causes all sorts of deoptimizations to functions within the closure.  If we recognized that the outer closure only runs once and that these deoptimizations are not necessary, performance for those inner functions would be much better.  I just filed bug 821361 for this case.
This may (or may not) provide some interesting insights into V8 closure optimizations, just FYI: http://mrale.ph/blog/2012/09/23/grokking-v8-closures-for-fun.html
Blocks: 885526
The "slow" version is now ever so slightly faster than the "fast" one.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: