Last Comment Bug 541489 - Measure closure performance with microbenchmarks
: Measure closure performance with microbenchmarks
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: ---
Assigned To: David Mandelin [:dmandelin]
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on:
Blocks: ClosurePerf 541488
  Show dependency treegraph
 
Reported: 2010-01-22 13:36 PST by David Mandelin [:dmandelin]
Modified: 2010-01-25 13:54 PST (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
JS closure microbenchmark suite and harness (4.62 KB, application/octet-stream)
2010-01-22 17:36 PST, David Mandelin [:dmandelin]
no flags Details

Description David Mandelin [:dmandelin] 2010-01-22 13:36:12 PST
We need to measure perf of various closure-related operations with microbenchmarks to see where we are relative to the competition and where we need to improve most.
Comment 1 David Mandelin [:dmandelin] 2010-01-22 14:53:56 PST
I reran the dhtml kitchen tests linked in bug 517164 (http://dhtmlkitchen.com/jstest/scope-chain-performance-iframe.html) on my MacBook Pro and got:

	               Fx3     Fx3.5  Fx3.6  Fx tip   Safari 4
GlobalAssign	       737	 412	357	395	 92
LocalAssign	        53 	  42	 38	 41	 55
ClosureAssign	       782	 834	730	774	 61
ClosureToGlobalAssign 1731	 420	354	401	 95
GlobalRead	       410	 401	374	400	 77
LocalRead        	97	  70	 71	 70	105
ClosureRead	       501	 562	531	525	 67
ClosureToGlobalRead   1437	 402	376	393	 95
ObjPropertyAssign      396	 404	360	395	 98
ObjCreate	      1251	1000	952	1004	558
Comment 2 David Mandelin [:dmandelin] 2010-01-22 17:06:49 PST
Some microbenchmark results (see forthcoming attachment for code, but the names should suggest what the tests are, with the understanding that 'nc' means 'null closure', 'fc' means 'flat closure', and 'xc' means general closure):

                               TM        jsc        v8

create_fc.js                1106.93    349.96     25.96
create_nc.js                 300.93    110.96    182.96
create_xc.js                 890.93    109.96      9.96
incr_xc.js                     4.02     16.64      8.76
incr_xc_inactive.js           53.46     15.53      8.99
incr_xc_loopinside.js        343.17     10.16      3.19
read_fc.js                    23.4      15.01     10.52
read_xc.js                     4.23     17.9      11.81
read_xc_eval.js                2.88     96.12     21.79
read_xc_inactive.js           33.09     15.42     12.23
read_xc_loopinside.js        201.5       6.04      3.25
write_xc.js                    6.6      23.57     13.99
write_xc_inactive.js          26.82     22.75     13.39
write_xc_loopinside.js       262.27      4.94      3.59
Comment 3 David Mandelin [:dmandelin] 2010-01-22 17:25:37 PST
Oops, I forgot to say in the previous comment that the numbers are microseconds per iteration, with the loop overhead from an empty loop subtracted out.

I should also note that in at least one case the kind of closure that was created was not what I wanted (read_xc_inactive actually uses a flat closure). I don't think that's too important, though.

Notable findings:

 - We are slow at creating closures. I am not sure why. I suspect that our
   allocator is slow, but that probably doesn't explain all of it.

 - For the tests with short names (read_xc, write_xc, incr_xc), we are really
   fast. This is because in these tests, the upvar is defined in the trace 
   entry frame, so we can access them 'through the tracker', i.e., the same
   way as for local variables. 

   One important issue with that fact is that if we were to remove the guard
   we emit at function calls on the callee parent identity, we would not be
   able to use the tracker in this way: that guard ensures that the closure
   we are evaluating is defined in the trace entry frame. We could partially
   mitigate with a fast path guarded by testing that the callee parent is the
   same as the active callee.

 - For the 'inactive' tests, we are 1.25x-6x slower than the competition. In
   these tests, the function defining the upvar has returned by the time we
   read the variable. In that case, the upvar must be set into a slot in 
   the call object. Thanks to bz we have a pretty efficient path for that now,
   but it still needs a shape guard and a box operation.

 - For the 'loopinside' tests, we are atrociously slow. The reason is that the
   upvar is defined off the trace (which is pretty typical of actual code that
   uses upvars, AFAICT), in which case we generate a call to a fairly complex
   builtin.

Conclusion: for high upvar access performance, we need to eliminate shape guards, boxing, and especially calls to builtins everywhere we can. The pressure to do those things is greater if we eliminate the callee parent identity guard.
Comment 4 David Mandelin [:dmandelin] 2010-01-22 17:32:00 PST
Oops^2. The numbers are actually nanoseconds per iteration.
Comment 5 David Mandelin [:dmandelin] 2010-01-22 17:36:22 PST
Created attachment 423107 [details]
JS closure microbenchmark suite and harness
Comment 6 David Mandelin [:dmandelin] 2010-01-22 17:43:55 PST
I tried Python versions of 3 read_xc tests for comparison. Units are the same as in comment 2: nanoseconds per iteration (with loop overhead subtracted out):

  read.py             285
  read_inactive.py    277
  read_loopinside.py   32

It's strange that the first two are so much slower. My guess would be that Python has a high function call cost, because the first two call a function inside the test loop, while the second does not. The loop overhead itself was measured as 204, so the accuracy on the loopinside number is probably not that great. My guess is that in these tests, everything is kind of slow, so upvars can be moderately slow and it's not very noticeable.

What I really want to do is try these 3 tests in Lua. I will do that on Monday but if anyone who actually knows Lua gets the urge, by all means go ahead.
Comment 7 David Mandelin [:dmandelin] 2010-01-25 12:43:57 PST
Lua, done the same way as Python:

  read.lua               126
  read_inactive.lua       97
  read_loopinside.lua     18

The results look pretty much the same as Python, except 2x faster.
Comment 8 David Mandelin [:dmandelin] 2010-01-25 13:22:33 PST
With ChezScheme:

  read.scheme               0
  read2.scheme             50   (sum up the read variables instead of storing)
  read_inactive.scheme     33
  read_loopinside.scheme    0

It appears that the compiler does something smart if we just try to read a variable that is defined in a still-active function. When that option is removed, it seems to do something that goes at about Python speed.
Comment 9 David Mandelin [:dmandelin] 2010-01-25 13:40:57 PST
Conclusions:

  - We need to make creating closures faster. SFX is about 2x our speed there,
    and I think we could get there by standard optimization of our existing
    closure creation code. v8 is 5x faster: that may be more difficult to reach
    and suggests we should study v8's closure creation code.

  - We need to make all 'normal' closure variable accesses (including reads and
    writes for all major kinds of closures at all nesting levels) not need any
    calls to builtins, guards, or boxing operations in our compiled systems. 
    This means we need to make the access paths dead simple--anything that 
    requires searching lists, multiple options, resolve hooks, or piles of
    wrappers is out.
    
  - For 'local' (to the same trace) on-trace access of closures, we are the
    fastest. We should try to preserve this advantage, and achieve it more
    often. Broadly, I think the trick is to identify (heuristically) which
    closure accesses can be done this way and generate appropriate guards.

  - v8 shows the best performance for pretty much everything (other than us
    for the same-trace case), and jsc and Lua are tied for second. We should
    study all three to learn their techniques and see if they are applicable
    to JM and TM.
Comment 10 David Mandelin [:dmandelin] 2010-01-25 13:54:49 PST
I reran the Lua benchmarks using LuaJIT. The results were kind of confusing: they showed an execution time of 0.5ns/iter for the empty loop, and also for all of the read benchmarks. The compiler is probably throwing away some unused values, but a time of 0.5ns/iter is hard to believe even for an empty loop. (TM is 3ns/iter for an empty loop.) When I modified the benchmark to do an addition with the read closure value and print it out at the end, the time went up to 1.4ns/iter. There is probably some extra compiler magic going on, so I can't say what this means yet, but we should definitely look at what they are doing.

Note You need to log in before you can comment on or make changes to this bug.