We need to measure perf of various closure-related operations with microbenchmarks to see where we are relative to the competition and where we need to improve most.
I reran the dhtml kitchen tests linked in bug 517164 (http://dhtmlkitchen.com/jstest/scope-chain-performance-iframe.html) on my MacBook Pro and got: Fx3 Fx3.5 Fx3.6 Fx tip Safari 4 GlobalAssign 737 412 357 395 92 LocalAssign 53 42 38 41 55 ClosureAssign 782 834 730 774 61 ClosureToGlobalAssign 1731 420 354 401 95 GlobalRead 410 401 374 400 77 LocalRead 97 70 71 70 105 ClosureRead 501 562 531 525 67 ClosureToGlobalRead 1437 402 376 393 95 ObjPropertyAssign 396 404 360 395 98 ObjCreate 1251 1000 952 1004 558
Some microbenchmark results (see forthcoming attachment for code, but the names should suggest what the tests are, with the understanding that 'nc' means 'null closure', 'fc' means 'flat closure', and 'xc' means general closure): TM jsc v8 create_fc.js 1106.93 349.96 25.96 create_nc.js 300.93 110.96 182.96 create_xc.js 890.93 109.96 9.96 incr_xc.js 4.02 16.64 8.76 incr_xc_inactive.js 53.46 15.53 8.99 incr_xc_loopinside.js 343.17 10.16 3.19 read_fc.js 23.4 15.01 10.52 read_xc.js 4.23 17.9 11.81 read_xc_eval.js 2.88 96.12 21.79 read_xc_inactive.js 33.09 15.42 12.23 read_xc_loopinside.js 201.5 6.04 3.25 write_xc.js 6.6 23.57 13.99 write_xc_inactive.js 26.82 22.75 13.39 write_xc_loopinside.js 262.27 4.94 3.59
Oops, I forgot to say in the previous comment that the numbers are microseconds per iteration, with the loop overhead from an empty loop subtracted out. I should also note that in at least one case the kind of closure that was created was not what I wanted (read_xc_inactive actually uses a flat closure). I don't think that's too important, though. Notable findings: - We are slow at creating closures. I am not sure why. I suspect that our allocator is slow, but that probably doesn't explain all of it. - For the tests with short names (read_xc, write_xc, incr_xc), we are really fast. This is because in these tests, the upvar is defined in the trace entry frame, so we can access them 'through the tracker', i.e., the same way as for local variables. One important issue with that fact is that if we were to remove the guard we emit at function calls on the callee parent identity, we would not be able to use the tracker in this way: that guard ensures that the closure we are evaluating is defined in the trace entry frame. We could partially mitigate with a fast path guarded by testing that the callee parent is the same as the active callee. - For the 'inactive' tests, we are 1.25x-6x slower than the competition. In these tests, the function defining the upvar has returned by the time we read the variable. In that case, the upvar must be set into a slot in the call object. Thanks to bz we have a pretty efficient path for that now, but it still needs a shape guard and a box operation. - For the 'loopinside' tests, we are atrociously slow. The reason is that the upvar is defined off the trace (which is pretty typical of actual code that uses upvars, AFAICT), in which case we generate a call to a fairly complex builtin. Conclusion: for high upvar access performance, we need to eliminate shape guards, boxing, and especially calls to builtins everywhere we can. The pressure to do those things is greater if we eliminate the callee parent identity guard.
Oops^2. The numbers are actually nanoseconds per iteration.
I tried Python versions of 3 read_xc tests for comparison. Units are the same as in comment 2: nanoseconds per iteration (with loop overhead subtracted out): read.py 285 read_inactive.py 277 read_loopinside.py 32 It's strange that the first two are so much slower. My guess would be that Python has a high function call cost, because the first two call a function inside the test loop, while the second does not. The loop overhead itself was measured as 204, so the accuracy on the loopinside number is probably not that great. My guess is that in these tests, everything is kind of slow, so upvars can be moderately slow and it's not very noticeable. What I really want to do is try these 3 tests in Lua. I will do that on Monday but if anyone who actually knows Lua gets the urge, by all means go ahead.
Lua, done the same way as Python: read.lua 126 read_inactive.lua 97 read_loopinside.lua 18 The results look pretty much the same as Python, except 2x faster.
With ChezScheme: read.scheme 0 read2.scheme 50 (sum up the read variables instead of storing) read_inactive.scheme 33 read_loopinside.scheme 0 It appears that the compiler does something smart if we just try to read a variable that is defined in a still-active function. When that option is removed, it seems to do something that goes at about Python speed.
Conclusions: - We need to make creating closures faster. SFX is about 2x our speed there, and I think we could get there by standard optimization of our existing closure creation code. v8 is 5x faster: that may be more difficult to reach and suggests we should study v8's closure creation code. - We need to make all 'normal' closure variable accesses (including reads and writes for all major kinds of closures at all nesting levels) not need any calls to builtins, guards, or boxing operations in our compiled systems. This means we need to make the access paths dead simple--anything that requires searching lists, multiple options, resolve hooks, or piles of wrappers is out. - For 'local' (to the same trace) on-trace access of closures, we are the fastest. We should try to preserve this advantage, and achieve it more often. Broadly, I think the trick is to identify (heuristically) which closure accesses can be done this way and generate appropriate guards. - v8 shows the best performance for pretty much everything (other than us for the same-trace case), and jsc and Lua are tied for second. We should study all three to learn their techniques and see if they are applicable to JM and TM.
I reran the Lua benchmarks using LuaJIT. The results were kind of confusing: they showed an execution time of 0.5ns/iter for the empty loop, and also for all of the read benchmarks. The compiler is probably throwing away some unused values, but a time of 0.5ns/iter is hard to believe even for an empty loop. (TM is 3ns/iter for an empty loop.) When I modified the benchmark to do an addition with the read closure value and print it out at the end, the time went up to 1.4ns/iter. There is probably some extra compiler magic going on, so I can't say what this means yet, but we should definitely look at what they are doing.