We need to measure perf of various closure-related operations with microbenchmarks to see where we are relative to the competition and where we need to improve most.
I reran the dhtml kitchen tests linked in bug 517164 (http://dhtmlkitchen.com/jstest/scope-chain-performance-iframe.html) on my MacBook Pro and got:
Fx3 Fx3.5 Fx3.6 Fx tip Safari 4
GlobalAssign 737 412 357 395 92
LocalAssign 53 42 38 41 55
ClosureAssign 782 834 730 774 61
ClosureToGlobalAssign 1731 420 354 401 95
GlobalRead 410 401 374 400 77
LocalRead 97 70 71 70 105
ClosureRead 501 562 531 525 67
ClosureToGlobalRead 1437 402 376 393 95
ObjPropertyAssign 396 404 360 395 98
ObjCreate 1251 1000 952 1004 558
Some microbenchmark results (see forthcoming attachment for code, but the names should suggest what the tests are, with the understanding that 'nc' means 'null closure', 'fc' means 'flat closure', and 'xc' means general closure):
TM jsc v8
create_fc.js 1106.93 349.96 25.96
create_nc.js 300.93 110.96 182.96
create_xc.js 890.93 109.96 9.96
incr_xc.js 4.02 16.64 8.76
incr_xc_inactive.js 53.46 15.53 8.99
incr_xc_loopinside.js 343.17 10.16 3.19
read_fc.js 23.4 15.01 10.52
read_xc.js 4.23 17.9 11.81
read_xc_eval.js 2.88 96.12 21.79
read_xc_inactive.js 33.09 15.42 12.23
read_xc_loopinside.js 201.5 6.04 3.25
write_xc.js 6.6 23.57 13.99
write_xc_inactive.js 26.82 22.75 13.39
write_xc_loopinside.js 262.27 4.94 3.59
Oops, I forgot to say in the previous comment that the numbers are microseconds per iteration, with the loop overhead from an empty loop subtracted out.
I should also note that in at least one case the kind of closure that was created was not what I wanted (read_xc_inactive actually uses a flat closure). I don't think that's too important, though.
- We are slow at creating closures. I am not sure why. I suspect that our
allocator is slow, but that probably doesn't explain all of it.
- For the tests with short names (read_xc, write_xc, incr_xc), we are really
fast. This is because in these tests, the upvar is defined in the trace
entry frame, so we can access them 'through the tracker', i.e., the same
way as for local variables.
One important issue with that fact is that if we were to remove the guard
we emit at function calls on the callee parent identity, we would not be
able to use the tracker in this way: that guard ensures that the closure
we are evaluating is defined in the trace entry frame. We could partially
mitigate with a fast path guarded by testing that the callee parent is the
same as the active callee.
- For the 'inactive' tests, we are 1.25x-6x slower than the competition. In
these tests, the function defining the upvar has returned by the time we
read the variable. In that case, the upvar must be set into a slot in
the call object. Thanks to bz we have a pretty efficient path for that now,
but it still needs a shape guard and a box operation.
- For the 'loopinside' tests, we are atrociously slow. The reason is that the
upvar is defined off the trace (which is pretty typical of actual code that
uses upvars, AFAICT), in which case we generate a call to a fairly complex
Conclusion: for high upvar access performance, we need to eliminate shape guards, boxing, and especially calls to builtins everywhere we can. The pressure to do those things is greater if we eliminate the callee parent identity guard.
Oops^2. The numbers are actually nanoseconds per iteration.
Created attachment 423107 [details]
JS closure microbenchmark suite and harness
I tried Python versions of 3 read_xc tests for comparison. Units are the same as in comment 2: nanoseconds per iteration (with loop overhead subtracted out):
It's strange that the first two are so much slower. My guess would be that Python has a high function call cost, because the first two call a function inside the test loop, while the second does not. The loop overhead itself was measured as 204, so the accuracy on the loopinside number is probably not that great. My guess is that in these tests, everything is kind of slow, so upvars can be moderately slow and it's not very noticeable.
What I really want to do is try these 3 tests in Lua. I will do that on Monday but if anyone who actually knows Lua gets the urge, by all means go ahead.
Lua, done the same way as Python:
The results look pretty much the same as Python, except 2x faster.
read2.scheme 50 (sum up the read variables instead of storing)
It appears that the compiler does something smart if we just try to read a variable that is defined in a still-active function. When that option is removed, it seems to do something that goes at about Python speed.
- We need to make creating closures faster. SFX is about 2x our speed there,
and I think we could get there by standard optimization of our existing
closure creation code. v8 is 5x faster: that may be more difficult to reach
and suggests we should study v8's closure creation code.
- We need to make all 'normal' closure variable accesses (including reads and
writes for all major kinds of closures at all nesting levels) not need any
calls to builtins, guards, or boxing operations in our compiled systems.
This means we need to make the access paths dead simple--anything that
requires searching lists, multiple options, resolve hooks, or piles of
wrappers is out.
- For 'local' (to the same trace) on-trace access of closures, we are the
fastest. We should try to preserve this advantage, and achieve it more
often. Broadly, I think the trick is to identify (heuristically) which
closure accesses can be done this way and generate appropriate guards.
- v8 shows the best performance for pretty much everything (other than us
for the same-trace case), and jsc and Lua are tied for second. We should
study all three to learn their techniques and see if they are applicable
to JM and TM.
I reran the Lua benchmarks using LuaJIT. The results were kind of confusing: they showed an execution time of 0.5ns/iter for the empty loop, and also for all of the read benchmarks. The compiler is probably throwing away some unused values, but a time of 0.5ns/iter is hard to believe even for an empty loop. (TM is 3ns/iter for an empty loop.) When I modified the benchmark to do an addition with the read closure value and print it out at the end, the time went up to 1.4ns/iter. There is probably some extra compiler magic going on, so I can't say what this means yet, but we should definitely look at what they are doing.