Open Bug 460050 Opened 14 years ago Updated 5 years ago

go game benchmark is ~1.5x as slow as v8

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

x86
All
defect

Tracking

()

People

(Reporter: blizzard, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(8 files, 3 obsolete files)

As reported here:

http://www.apejet.org/aaron/blog/2008/10/15/benchmarking-firefox-31-beta-1/

Firefox 3.1b1 jit disabled (Mac) 	1108
Firefox 3.1b1 jit enabled (Mac) 	2158
Firefox 3.1b1 jit disabled (Win) 	642
Firefox 3.1b1 jit enabled (Win) 	1444

Looks like at least in this benchmark turning tracemonkey on hurts performance quite a bit.  The benchmark is listed in this post:

http://www.apejet.org/aaron/blog/2008/10/04/browser-benchmarking-using-go-game/

John, would you be willing to look at this and see what kind of structure the code takes?  Might be a recursion issue or something else that throws tracing for a loop.
OS: Mac OS X → All
Christopher asked that I try to create a small test case.  I've tried to do just that by focusing on one bit of my code that I know gets called a lot, but that doesn't use recursion (since that's a known issue).

I tested the attached file on my Mac Mini using FF3.1b1 both with and without TraceMonkey.  It's about 1.4 times slower with TraceMonkey in my tests.  The 2x slowdown in the game code may be a combination of multiple problems.
Thanks, Aaron.  Hopefully this will help us track down what's going on here.
Flags: wanted1.9.1?
Flags: blocking1.9.1+
I show a ~20% loss now (min 119 vs min 140); improved, but still not what we want.  Do we trace String as function (rather than constructor)?
We never get that far.  Here's what a verbose log has to say:

 fgrep -i abort ~/log.txt | sort | uniq -c | sort -nr
  11 abort: 4147: fp->scopeChain is not global or active call object
  11 Abort recording (line 62, pc 11): JSOP_NAME.
   5 Abort recording (line 62, pc 8): No compatible inner tree.

Those aborts on fp->scopeChain are the same as the JSOP_NAME aborts.  I'll attach the testcase in a second, but line 62 is the fun(arr[i]) call in |function eachl|.  Overall stats for one call to doTest (the testcase does 10 calls, for what it's worth):

recorder: started(18), aborted(16), completed(3), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(0), breaks(0), returns(1), unstableInnerCalls(5)
monitor: triggered(99999), exits(99999), type mismatch(0), global mismatch(4)

Looks like the bits we jit are too small (probably due to the heavyweight functions) and we have too many exits as a result.  :(

One weird thing is that in the testcase coming up the only heavyweight function seems to be doTest() (at least according to dis()).  Not sure why that only bytes us when we go to call fun().
Depends on: upvar2
Someone try this patch to see if it helps:

@@ -7378,17 +7396,18 @@ TraceRecorder::record_JSOP_ANONFUNOBJ()
 TraceRecorder::record_JSOP_ANONFUNOBJ()
 {
     JSFunction* fun;
     JSFrameRegs& regs = *cx->fp->regs;
     JSScript* script = cx->fp->script;
     LOAD_FUNCTION(0); // needs script, regs, fun
 
     JSObject* obj = FUN_OBJECT(fun);
-    if (OBJ_GET_PARENT(cx, obj) != cx->fp->scopeChain)
+    JSObject* parent = OBJ_GET_PARENT(cx, obj);
+    if (parent && parent != cx->fp->scopeChain)
         ABORT_TRACE("can't trace with activation object on scopeChain");
 
     stack(0, INS_CONSTPTR(obj));
     return true;
 }
 
 bool
 TraceRecorder::record_JSOP_NAMEDFUNOBJ()

/be
Uh...  not at all.  We don't hit that abort in the testcase, so I don't see how it _could_ help.
bz: duh, right -- I did want to call that abort to your attention, in case you've seen it in other benchmark spew. I trigger it with home-brew testcases; I'll file shortly.

/be
What bug was filed for the problems referenced in comment 8?
This is an even simpler test case that slows down in the JIT. (Note that the original Go benchmark may have more problematic features than this one, but I'm sure it at least does have this one problematic feature.) The problem is accessing the variable in the enclosing lexical scope via a closure, which causes an abort in TM.

I just listed to a bewildering discussion about what to do about this. I can't explain it in detail, but the idea was something like detecting conditions under which standard optimizations (conversion to argument, or display structures) can be applied together with tracing.
Right.  That testcase is what bug 452498 is all about, right?
Priority: -- → P2
Assignee: general → dmandelin
It looks we're breaking even on this now. dmandelin, please close this bug if your results match mine.
Do we have a way to add this bug's benchmark to our ever-growing perf test setup?

/be
I still get a 1.5x slowdown turning on the JIT for the the shell test case given here. It appears to be due to the 2 places where we abort: the upvar reference on line 65, and the call to 'String(s)' on line 70.
Blocks: 467263
Bug 482349 filed on String(v) conversion calls.

/be
(In reply to comment #12)
> It looks we're breaking even on this now. dmandelin, please close this bug if
> your results match mine.

I'm still getting a slowdown with bz's second attchment, but not the third one from dmandelin.
I just tried again with Andreas's patch for the String part. I no longer get those aborts, but total run time is still about 1.5x. 

Something complicated is going on: I printed out the individual times, and I saw that the first run is fast, and then the rest are slow. I don't know why yet.
The first call runs in about 17ms. The second takes about 43, and goes through 3 big record/compile cycles and then falls off trace all the time. The first call goes through just 1 big record/compile cycle and then does everything on trace. That's kind of weird given the aborts.
Currently I'm seeing a 50-100% slowdown on the various browser and shell tests here. In the case of the shell test, it's all upvar, so I'm going to wait on that.
The shell test cases are now fixed and run roughly 2x faster with tracing. (Exception: my test that runs fewer iterations is about the same speed, presumably due to tracing startup costs.)

But the browser test doesn't seem to be any better yet.
Still not sure what the problem is in the browser, but I should have that soon. While trying to figure that out, I discovered that the shell tests are unrolling one of the loops, apparently due to bug 485959.
Depends on: 485959
Changeset 9ad17d099fdd, which fixed bug 485959, undid the recent improvement in the shell cases, so they are now slow again.

The current problem is that the upvar access can't be traced because of an 'upvar out of reach' abort: The upvar is defined in an outer loop that contains the trace where it is accessed. I need more time to think about potential solutions.
Short take first: For now, accessing a variable in an enclosing lexical scope (other than the global scope) inside a loop is very bad for performance. Refactoring programs to avoid this construct will help. See below for an example. I tried it for the shell case and got a 2.2x speedup for jit vs. non-jit. 

At first I was hoping that we could make this fast by converting the upvar access in eachI to a parameter pass, or in SpiderMonkey terms, use the "flat closure" optimization. I.e., transform this:

  function f() {
    var a = [];
    function eachI(fun) {
      for (var i = 0; i < a.length; ++i) fun(a[i]);
    }
    ...
    eachI(p);
  }

to this:

  function f() {
    var a = [];
    function eachI(fun, a1) {
      for (var i = 0; i < a1.length; ++i) fun(a1[i]);
    }
    ...
    eachI(p, a);
  }

But SM doesn't allow that, partly because it thinks upvars are preferable to null closures, but more fundamentally, because function definitions are hoisted, i.e., the function definition is executed at the top of the enclosing function. I'm not sure why hoisting makes the flat closure optimization invalid, but apparently it does. Anyway, looking at that stuff made me think it was complicated enough to not be the right thing to do.

So, a better solution seems to be to teach the tracer to access upvars that were not defined on the current trace. The main problem is ensuring that any trace that calls the current trace or branches to it writes out any updates it made to those variables. Alternatively, we could just ensure the trace doesn't get called if those variables have been updated.
The interesting question here is what level of specialization is right. We can for example assert that we are being called with a sufficiently deep native frame (rp > rpbase + depth), in which case we know where the slot is we are looking for, and we know statically the type. We can compile this case for nested loops. However, we usually compile the nested loop first, in which case rp == rpbase and we have to look at the VM stack via the cx->fp->down chain. We could also try a hybrid approach. Generate conditional code that walks cx->fp->down if rp == rpbase. We also have to be careful since cx->fp is not updated. So if rp == rpbase + 1, then cx->fp is already "one down" since we have one native frame on the call stack. We would hence need a series of branches. The fast path has one comparison (rp - rpbase) == n, and the slow path a series of conditions that walk down cx->fp to the frame we want (minus any existing native frames), and then boxes the value. We can probably not safely unbox inside a conditional branch, so even the slow path would have to box the value so we can unbox after the two paths merge again, which of course blows. Here again the question is how much specialization we want. We could forgo the unboxing and just check and MISMATCH_EXIT if its not what we expected. How many type unstable upvars do we have? Thats the million dollar question.
(In reply to comment #23)
> Short take first: For now, accessing a variable in an enclosing lexical scope
> (other than the global scope) inside a loop is very bad for performance.
> Refactoring programs to avoid this construct will help. See below for an
> example. I tried it for the shell case and got a 2.2x speedup for jit vs.
> non-jit.

We should try to fix this, people won't rewrite their code, should not have to anyway, and we'll be uncompetitive at closure-heavy code.

I think we are close to being able to fix. Let's see...

> At first I was hoping that we could make this fast by converting the upvar
> access in eachI to a parameter pass, or in SpiderMonkey terms, use the "flat
> closure" optimization. I.e., transform this:
> 
>   function f() {
>     var a = [];
>     function eachI(fun) {
>       for (var i = 0; i < a.length; ++i) fun(a[i]);
>     }
>     ...
>     eachI(p);
>   }
> 
> to this:
> 
>   function f() {
>     var a = [];
>     function eachI(fun, a1) {
>       for (var i = 0; i < a1.length; ++i) fun(a1[i]);
>     }
>     ...
>     eachI(p, a);
>   }
> 
> But SM doesn't allow that, partly because it thinks upvars are preferable to
> null closures, but more fundamentally, because function definitions are
> hoisted, i.e., the function definition is executed at the top of the enclosing
> function. I'm not sure why hoisting makes the flat closure optimization
> invalid, but apparently it does. Anyway, looking at that stuff made me think it
> was complicated enough to not be the right thing to do.

Hoisting means the closure for eachI will be formed at the very top of f, before a has been initialized, and even before (conceptually, or per ECMA-262) it has been bound (a is hoisted too, but vars come after functions and do not recreate existing bindings). In practical implementations a would have a local var slot, so be "bound", but if eachI were a flat closure it would capture the default-initialized undefined value stored there.

If you make eachI a lambda expression stored in a local var, all is better. Again not a reasonable rewrite IMHO, and not one we should seem to evangelize.

We could try deferring the point at which eachI's closure is formed to its point of declaration. I considered this when doing the upvar2 work, but it's tricky in general. However in examples such as this one, it could be done.

> So, a better solution seems to be to teach the tracer to access upvars that
> were not defined on the current trace.

This is true in general, yes. We need this capability to compete.

> The main problem is ensuring that any
> trace that calls the current trace or branches to it writes out any updates it
> made to those variables. Alternatively, we could just ensure the trace doesn't
> get called if those variables have been updated.

Or as you pointed out we could recompile the inner loop once the outer has been recorded. Or maybe that's the same thing as what you just wrote.

(In reply to comment #24)
> The interesting question here is what level of specialization is right. We can
> for example assert that we are being called with a sufficiently deep native
> frame (rp > rpbase + depth), in which case we know where the slot is we are
> looking for, and we know statically the type. We can compile this case for
> nested loops. However, we usually compile the nested loop first, in which case
> rp == rpbase and we have to look at the VM stack via the cx->fp->down chain.

This makes the same mistake I keep making, of confusing static and dynamic linkage. cx->fp->down (should rename it link, to avoid mixing "up" and "down" metaphorical directions re: "upvar") is the dynamic link, not the static link. You can have a longer and unrelated dynamic call chain than static scope chain:

function f(a) {
  function g() { return h(); }
  function h() { return i(); }
  function i() { return a; }
  return i();
}

Ignoring tail call optimization, we have g, h, and i all defined at static level 1, with their bodies at level 2. The upvar a in i is at level 1 (f's body). But the dynamic link from i's activation to h's to g's has to be skipped to reach f's activation and find a. We use the old Algol optimization of the static link into an array, cx->display, indexed by level and maintained on entry and exit.

Anyway, this doesn't alter your point, but I wanted to repeat it since it is easy to confuse static and dynamic chains, not that doing so creates dynamic scoping (horrors) -- the bug is instead that a is not found, or if it is bound to a slot, the wrong stack slot (possibly no valid slot) is accessed.

> We
> could also try a hybrid approach. Generate conditional code that walks
> cx->fp->down if rp == rpbase. We also have to be careful since cx->fp is not
> updated. So if rp == rpbase + 1, then cx->fp is already "one down" since we
> have one native frame on the call stack. We would hence need a series of
> branches. The fast path has one comparison (rp - rpbase) == n, and the slow
> path a series of conditions that walk down cx->fp to the frame we want (minus
> any existing native frames), and then boxes the value. We can probably not
> safely unbox inside a conditional branch, so even the slow path would have to
> box the value so we can unbox after the two paths merge again, which of course
> blows. Here again the question is how much specialization we want. We could
> forgo the unboxing and just check and MISMATCH_EXIT if its not what we
> expected. How many type unstable upvars do we have? Thats the million dollar
> question.

Very few, IMHO. As good or better type stability as any other var. Indeed for these upvars whose initialization seems to dominate their use in closures, with no other assignments, there is only one value, not just only one type.

/be
Based on the last 2 comments and some more discussion with Brendan, here is a proposal for tracing out-of-reach upvars.

Purpose: To trace code where a loop (trace) refers to upvars that are defined in a stack frame that is not part of the trace. This is the exact situation in the bug.

This doesn't work now because the implementation can use values on a trace only if they are known to the tracker. Variables defined in a stack frame not part of the trace are not known to the tracker, so they cannot be used on trace.

Solution: When we start tracing, we will scan the upvars that are accessed by the active stack frame (there is a list) and import all of them. In this way, they will be known to the tracker, so it should be fairly easy to modify record_JSOP_GETUPVAR to use them, simply by referring to the existing value. If the values are never used on the current trace, they will be dead and optimized away.

The other part of the solution is that we have to box and save back the imported upvars to the interpreter copy. Presumably this can be patterned after the existing code that does this for globals. I also believe the existing mechanisms ensure that if the values are not written on the trace, code to copy them back is not generated.

This design preserves the current rule that traces are entered and exited from with the interpreter state fully up to date. This should give good results, especially for typical read-only upvar access.

Special considerations: The trace (starting with function "f" active) might call into a different function ("g") that also accesses upvars. If g accesses a subset of the upvars that f accesses, then g can use the same tracker entries and everything will be fine. If g accesses other upvars, then we have the original problem of untracked upvars again. For now, I propose not worrying about it. In the future, we could either import new upvars on the fly, or abort and leave a note about the extra upvars needed, and import those the next time we try to compile from the same start point.
I recently abstracted the helper that, given a context, a static level for the code using the upvar, and an upvar cookie (see jsscript.h), gives you the current value of the upvar. From jsinterp.h:

/*
 * Given an active context, a static scope level, and an upvar cookie, return
 * the value of the upvar.
 */
extern jsval
js_GetUpvar(JSContext *cx, uintN level, uintN cookie);

It would be easy to have this return jsval& instead of jsval.

/be
#26: we don't really have to import ahead of time. We already import lazily for globals. Look at lazilyImportGlobalSlot.
Depends on: 458838
As of this minute, on TM we're getting a 1.5x-2x speedup on the test cases filed here. Can we get a retest of the original benchmark?
I think we still have a slowdown on the original benchmark, with a current t-m build (which includes the fix for bug 458838).  It's hard to say for sure because the results are so noisy (vary by a factor of 2 without jit enabled; if nothing else moves with more pieces on the board are slower, so longer games give higher averages), but when the game just starts, a build with jit disabled is visually clearly faster than one with jit enabled.  As comment 1 says, there might well be multiple issues with the original testcase, and we should keep filing bugs blocking this one as we find them.

Looking into whether there are remaining aborts now.
Well, there are some aborts around (fp->scopechain, getters, etc, almost all from jquery).  But I doubt those are what causes the slowness.  I ran the go game in auto mode in a debug build with TRACEMONKEY=stats for about 20 turns (that took a good long while), then quit.  jitstats:

recorder: started(80422), aborted(240), completed(285667), different header(1), trees trashed(0), slot promoted(0), unstable loop variable(58412), breaks(1), returns(0), unstableInnerCalls(1)
monitor: triggered(2342521), exits(2342503), type mismatch(0), global mismatch(5211)

So we're bouncing on and off trace like mad (see triggered and exits); I have no idea why |completed| >> |started|, but I assume there's a good explanation.  In any case, the unstable loop variable number is also pretty high here...

Is there any sort of nice way to log why we're side-exiting?
OK, I made the printf on side exit not be debug_only_v and got a nice 35MB log in 4 moves.  If I take the output file log.txt and run it through:

grep "leaving trace" ~/log.txt | grep apejet | sed 's/, cycles=[0-9]*//' | sed 's/, lr=0x[0-9a-f]*//' | sort | uniq -c | sort -n -r | head

to filter out the "random" parts of the output and get an idea of what the most important side exits are, I get:

117143 leaving trace at http://www.apejet.org/sites/games/go2/board.js:388@71, op=getprop, exitType=0, sp=1, calldepth=0
34029 leaving trace at http://www.apejet.org/sites/games/go2/board.js:281@188, op=getprop, exitType=0, sp=3, calldepth=0
28866 leaving trace at http://www.apejet.org/sites/games/go2/board.js:572@58, op=getprop, exitType=0, sp=1, calldepth=0
27066 leaving trace at http://www.apejet.org/sites/games/go2/board.js:631@45, op=ifne, exitType=3, sp=3, calldepth=0
23420 leaving trace at http://www.apejet.org/sites/games/go2/board.js:388@74, op=getargprop, exitType=0, sp=1, calldepth=0
13966 leaving trace at http://www.apejet.org/sites/games/go2/board.js:387@110, op=lt, exitType=3, sp=2, calldepth=0
5824 leaving trace at http://www.apejet.org/sites/games/go2/board.js:282@222, op=nextiter, exitType=0, sp=2, calldepth=0
1140 leaving trace at http://www.apejet.org/sites/games/go2/board.js:631@41, op=getelem, exitType=0, sp=5, calldepth=0
 856 leaving trace at http://www.apejet.org/sites/games/go2/board.js:12@91, op=lt, exitType=3, sp=2, calldepth=0
 754 leaving trace at http://www.apejet.org/sites/games/go2/board.js:282@223, op=ifne, exitType=3, sp=3, calldepth=0

Line 388 of the file in question is:

  if(touching[j].chainId != i.chainId) {

Line 281 is:

  if(touching[t].stone == 1) touchingBlack = true;

Line 572 is:

  if(touching[j].eye != 0) {

Line 631 is:

  cloneIntersection[prop] = this[prop];

Line 387 is:

  for(var j = 0; j < touching.length; j++) {

That covers all the side exits we hit at least 10000 times in that log.

Those all look like either BRANCH_EXIT or LOOP_EXIT, right?  I have no idea why we're branching, though...
Attached image TraceVis output
Attached is a tracevis visualization. Ignore the left half--that's just browser startup and my navigating to the game. Once it starts playing, it seems to spend most of its time recording and compiling traces, and in the interpreter. I also got about 200 flushes in <10 seconds of execution.

Combined with bz's observations, this makes me think we are getting trace explosion--the program has many branches and/or types, so many traces are generated. This has 2 potential bad effects: (1) if we get tail duplication, then we compile each jsop many times, which is very inefficient, and (2) if we compile too many traces, we can OOM the trace cache before really even getting to run any code, and then we end up just repeating.

Next step, I think, is to figure out better exactly why we're getting so many traces.
We found that the reference to 'turn' on line 180 of board.js is causing the global shape to change. These frequent global shape changes cause many flushes, so the tracer can only run a few traces before a flush is needed.

bz provided this stack trace:

[6:36pm] bz: #0  js_LeaveTraceIfGlobalObject (cx=0x134a000, obj=0x1dbe69c0) at jscntxt.h:1476
[6:36pm] bz: #1  0x003044e6 in js_MakeScopeShapeUnique (cx=0x134a000, scope=0x1dc61c90) at jsscope.h:340
[6:36pm] bz: #2  0x0030d5fe in PurgeProtoChain (cx=0x134a000, obj=0x1dbe69c0, id=504945780) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsobj.cpp:3575
[6:36pm] bz: #3  0x0030d725 in js_PurgeScopeChainHelper (cx=0x134a000, obj=0x1dbe69c0, id=504945780) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsobj.cpp:3601
[6:36pm] bz: #4  0x0030d783 in js_PurgeScopeChain (cx=0x134a000, obj=0x129d1dc0, id=504945780) at jsobj.h:620
[6:36pm] bz: #5  0x0030e2e6 in js_DefineNativeProperty (cx=0x134a000, obj=0x129d1dc0, id=504945780, value=22, getter=0x2bdbfc <js_GetCallArg>, setter=0x2bdc28 <SetCallArg>, attrs=69, flags=4, shortid=0, propp=0x0, cacheResult=0) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsobj.cpp:3759
[6:39pm] bz: #6  0x002c02ab in call_resolve (cx=0x134a000, obj=0x129d1dc0, idval=504945780, flags=4, objp=0xbfffc170) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsfun.cpp:947
[6:39pm] bz: #7  0x0030b749 in js_LookupPropertyWithFlags (cx=0x134a000, obj=0x129d1dc0, id=504945780, flags=4, objp=0xbfffc200, propp=0xbfffc1fc) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsobj.cpp:3897
[6:39pm] bz: #8  0x0030f582 in js_FindPropertyHelper (cx=0x134a000, id=504945780, cacheResult=1, objp=0xbfffc5cc, pobjp=0xbfffc5c8, propp=0xbfffc5ac) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsobj.cpp:4038
[6:39pm] bz: #9  0x002e448a in js_Interpret (cx=0x134a000) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsinterp.cpp:5259
[6:39pm] bz: #10 0x002f7b78 in js_Invoke (cx=0x134a000, argc=1, vp=0x14fba20, flags=0) at jsinterp.cpp:1375

The key seems to be call_resolve. This calls js_DefineNativeProperty, which in turn purges the scope chain, thus giving the global object a new shape. I don't entirely understand this code, but I have a guess:

The function that contains line 180 is an anonymous closure that is passed to an 'each'-type function. Thus, the function is considered to possibly escape and is compiled as heavyweight. In order to look up the variable, we must then use JSOP_NAME, which in turn accesses/creates the 'Call' object representing the JS activation record. This object is unique for each invocation of the anonymous function. Thus, every time we call the outer function, we are looking in a fresh 'Call' object and go down the path where we call js_DefineNativeProperty, which flushes all the scopes up to and including the global scope.

I *really* don't understand this well enough to know how to fix it, but at a high level, it seems the fundamental problem is in doing the flush--looking up an existing property on a call object should not require updating the global object.

brendan, igor, any ideas on how to fix this?
go.js line 4 has 'var turn = 1;'.

As the stack shows, a lookup on a Call object may lazily define a formal param or local variable on that Call object, by reflecting on compiler-created local name metadata in the executing function object.

If that param or local has the same name as a global property, then the global object will be reshaped in case any cache entries or trace guards assumed that nothing in that Call object yet shadows 'turn' in the global object.

This happens only if the Call object is flagged as a delegate (linked on a scope and/or proto chain). And of course only if 'turn' names a global and a local.

Since the Call object in question is for a function enclosing the currently executing function, and the currently executing function is heavyweight too, the outer function's Call object is indeed flagged as a delegate -- it is no longer at the head of the scope chain, therefore some other object(s) may delegate to it. In this case the other object is the Call object for the innermost function, the lambda passed to this.eachInt starting at line 177 that is executing in the topmost interpreter frame on cx.

You should be able to verify that object is at the head of the scope chain, and the JSSLOT_PARENT fslot links to the outer lambda's Call object.

In Firefox 3, the property cache purges that would reshape objects with shadowed names had limited effects. Stale cache entries would not be hit, thanks to the entry capability (which includes the shape of the target object) not matching. New entries could be filled quickly.

With TraceMonkey, reshaping the global bails off trace and flushes the world. This is a big difference.

Our choices seem to be either:

(1) to avoid reshaping on shadowing, somehow, based on more precise knowledge of shadowing not always causing cache entries and trace guards to become stale -- it's pretty clear in this case, the 'turn' upvar being defined is not invalidating any inner-scope caching of the global 'turn', i.e. there are no such stale entries or guards -- or else:

(2) limit the effects of global shape changes on the JIT.

I'm supposed to be asleep, so I'll have to save more thought for later today.

/be
Dave, would you test and measure? I've got other patches on the boil. Igor should vet this patch since he knows the optimized Call object machinery well.

Igor: this may be too expensive, I haven't benchmarked. Alternative is brute force: evolve the cacheResult bool optional parameter to js_DefineNativeProperty into a defineHow flags param, defaulting to 0 but with two enum bits optional:

JSDNP_CACHE_RESULT // aka FROM_INTERPRETER ;-)
JSDNP_DONT_PURGE   // passed by call_resolve to suppress PurgeScopeChain

Thoughts? This might miss other paths that should not purge, but I can't see any right now: call_resolve is the funnel.

/be
Attachment #377134 - Flags: review?(igor)
We should spin off a separate bug blocking this one for comment 35 and comment 36 now that we've identified _a_ problem.  Once we fix it, there's no guarantee that this bug (which is the overall tracker for "this go game is slow") will be fixed.
(In reply to comment #36)
> Created an attachment (id=377134) [details]

The observation from the bug 490364 is that it is only necessary to purge the parent chain for the Call object. For other objects it sufficient to purge just the prototype chain alone. The observation from this patch is that it is necessary to purge the parent chain for the Call object only when adding new properties.

Now, since the Call object does not (and must not!) leak to scripts, the only way it can gain a property is via explicit var or function definition from the eval. So the conclusion is that it is necessary to purge the parent chain only after DEFVAR and DEFFUN called from the eval. 

Did I miss something? For example, presumably a debugger can get access to a Call object and add property there bypassing DEFVAR and DEFFUN. But it should be a debugger bug, right?
(In reply to comment #37)
> We should spin off a separate bug blocking this one for comment 35 and comment
> 36 now that we've identified _a_ problem. 

The bug 490364 is a good place for that.
Depends on: 490364
(In reply to comment #38)
> (In reply to comment #36)
> The observation from this patch is that it is
> necessary to purge the parent chain for the Call object only when adding new
> properties.

New properties not seen by the compiler when the Call object's function was compiled, yes.

> Did I miss something? For example, presumably a debugger can get access to a
> Call object and add property there bypassing DEFVAR and DEFFUN. But it should
> be a debugger bug, right?

Our API made this case work, so it should continue to work at this point in the release cycle. In general, eval (but not indirect eval -- that should use only the eval function object's global scope) is like a debugger eval-in-frame. We should optimize around this for now.

/be
Attachment #377134 - Flags: review?(igor) → review-
Comment on attachment 377134 [details] [diff] [review]
the answer came to me instead of sleep :-/

>+            /*
>+             * This id names a formal parameter or local variable. Therefore
>+             * even if it happens to shadow a global or upvar of the same name,
>+             * the inner binding cannot invalidate any property cache entries
>+             * or derived trace guards for the outer binding. Only eval-created
>+             * var bindings in a Call object could happen after a reference to
>+             * a free or upvar was evaluated.


This assumes that an upvar reference is always resolved at compile-time. But this is not the case, for example, if the static level exceeds JS_DISPLAY_SIZE. So this assumption that only eval-added locals need a scope chain purge does not hold.
(In reply to comment #41)
> This assumes that an upvar reference is always resolved at compile-time. But
> this is not the case, for example, if the static level exceeds JS_DISPLAY_SIZE.
> So this assumption that only eval-added locals need a scope chain purge does
> not hold.

Given this the suggestion from the comment 38 dwould not work as is. It should be combined with purging the parent chain after call_resolve defines a new property.
OK, with brendan's patch (which we should spin off to bug 490364, with all the discussion up through comment 42, sounds like), the initial stones seem to be just as fast as with jit disabled (but not faster, as far as I can tell).  jitstats from running the game for a bit that way:

recorder: started(340), aborted(144), completed(1874), different header(1), trees trashed(0), slot promoted(0), unstable loop variable(543), breaks(3), returns(0), unstableInnerCalls(1)
monitor: triggered(1610101), exits(1610097), type mismatch(0), global mismatch(7)

Note the much lower "started", "completed", and "unstable loop variable" stats.  Note also that triggered and exits are still pretty high...
And the locations and distribution of the exits still more or less match that from comment 32.
We no longer seem to end up bailing out in js_LeaveTrace.  But we still seem to be leaving on the same spots, which I presume are due to shape checks still.  We do have a lot of shape-changes going on for non-global objects.  For example, this code:

  this.eachInt(function(i) {
    cloneBoard.setInt(i.pos.x, i.pos.y, i.clone());
  });

in Board.prototype.clone seems to trigger a shape change... due to js_GetMethod calling js_GetPropertyHelper calling js_FillPropertyCache calling js_MakeScopeShapeUnique calling js_GenerateShape while getting the "clone" method (presumably from "i").  I have no idea why this causes a shape change, exactly...  Brendan?  Igor?
(In reply to comment #45)
> calling js_GetPropertyHelper calling js_FillPropertyCache calling
> js_MakeScopeShapeUnique calling js_GenerateShape while getting the "clone"
> method (presumably from "i").  I have no idea why this causes a shape change,
> exactly...  Brendan?  Igor?

I guess the comments in http://hg.mozilla.org/tracemonkey/file/9caa852c758c/js/src/jsinterp.cpp#l211 should be made more clear. The property cache caches functions by their value so any change in the value of the property must update scope's shape. That shape change prepares for that.
Why's the value changing, though?  "clone" is just a method on the prototype of |i|; there should be no value change I would think.
Comment on attachment 377134 [details] [diff] [review]
the answer came to me instead of sleep :-/

Obsoleting here so patching can continue in the blocking bug.

/be
Attachment #377134 - Attachment is obsolete: true
(In reply to comment #47)
> Why's the value changing, though?  "clone" is just a method on the prototype of
> |i|; there should be no value change I would think.

What changes is the cache and derived trace guards -- evaluating the callee in i.clone() makes a value-dependent, not id- or slot-dependent, entry/guard. In order to invalidate correctly, this brands i's scope so any updates to function-valued properties regenerate shape. Then GC_WRITE_BARRIER checks SCOPE_BRANDED on every store into any slot in a live object.

The branding step regenerates i's shape too, in case there are entries/guards that would hang onto stale information -- but this is a bug, nothing can be stale since the scope has not been branded yet. All prior cache entries / trace guards for this object are still valid.

The patch for bug 471214 will address this directly; noting dependency.

/be
Depends on: 471214
(In reply to comment #49)
> The branding step regenerates i's shape too, in case there are entries/guards
> that would hang onto stale information -- but this is a bug, nothing can be
> stale since the scope has not been branded yet. All prior cache entries / trace
> guards for this object are still valid.

This is not true, for the following reason: a direct or immediate prototype hit at the same pc but on a different object (prototype) of the same shape will get the same single method value, but the scope for that unrelated object of same shape will not be branded.

I used regeneration during branding to force branding of all scopes among the cohort of same-shaped objects. In other words, a branded shape must be distinct from unbranded shapes.

More in bug 471214 soon. Focusing on this and bug 492355 (in better shape, it will be fixed sooner).

/be
OK, commenting out the js_MakeScopeShapeUnique call in js_FillPropertyCache when branding seems to also give a visible speedup... or maybe I'm just imagining it.  We're still bouncing on and off of trace a lot, though.  Those stats are not changed, that I can see.
I did some more digging into the aborts here.

It seems that we actually abort the trace any time the following code pattern happens:

  var foo = bar;
  this.eachInt(function() {
    // Use foo here
  });

with the fp->scopeChain is not global or active call object abort.  I've filed bug 492914 on that.

We also abort trace on this interesting edge case where ++ is used to increment a property of an object.  Filed bug 492915 on that.

Once I eliminated all of those (by changing the JS code to avoid them), things got noticeably faster (about 30-50% speedup from those changes plus nixing all the jquery usage).  But we still have many (hundreds of thousands in some cases, millions to tens of millions in others) trace exits.  So we still need to figure out what's going on with all the side-exiting here.  The loop exits are mostly due to the fact that we take a while to actually trace through eachInt, I think.  But for the side exits we need to figure out why they're happening (maybe by trying to minimize the JS some).
Depends on: 492915, 492914
Depends on: 492918
Comment on attachment 377134 [details] [diff] [review]
the answer came to me instead of sleep :-/

I was wrong about r- for this patch - the comments are right and the inner binding can never access the outer binding hidden by a local or arguments name. I will incorporate this into the patch for the bug 491126.
Attachment #377134 - Flags: review-
(In reply to comment #53)
> (From update of attachment 377134 [details] [diff] [review])
> I was wrong about r- for this patch - the comments are right and the inner
> binding can never access the outer binding hidden by a local or arguments name.
> I will incorporate this into the patch for the bug 491126.

Oh, I didn't know you r-'ed it for that reason -- re-reading comment 41 though, I see what you were thinking. Just to confirm: local args and vars unless created by eval always shadow outer bindings, no matter how lazily resolved. By induction this applies to upvars too. So -- excluding eval- or debugger-created "very late" bindings -- we can optimize to avoid reshaping the global even if a local property shadows a global of the same name.

/be
I got tired of runnning a browser.  This prints out the moves as they happen so you can keep track of progress and prints out the stats at the end.  No other output, no canvas, no jquery.
Oh, and I set the randomWeight to 0 in that shell testcase so it always runs exactly the same thing.  Might help with debugging.  ;)
(In reply to comment #56)
> Created an attachment (id=377491) [details]
> Full go game js shell testcase

Thanks for this nice test. 

On my box it shows 25% slowdown with the jit enabled. The patch from the bug 490364 comment 10 (which is based on the patch from the comment 36) has not changed that. With that patch I do see a reduction in number of js_PurgeScopeChain calls, but that did not help the test case.
Some notes on that js shell testcase, running it with jit on and the patches from bug 490364 and bug 492914 applied:

1)  It's about a 20% slowdown from interpreted (145s vs 120s).
2)  Increasing MAX_BRANCHES to 512 gives us interpreter parity, as far as I can
    tell.  Further increasing it to 16384 has no effect.
3)  Increasing BL_ATTEMPTS makes the script die partway through.  See bug
    493068.
4)  Increasing MAXPEERS to 99 while not changing MAX_BRANCHES has no effect.
Depends on: 493068
OK.  So I went back to digging at the branch exits.  Running for 10 turns (5 per player) and then exiting gives me about 400k branch exits. 150k of these are:

leaving trace at /Users/bzbarsky/foo.js:309@71, op=getprop, lr=0x132a6f4, exitType=0, sp=1, calldepth=0

Looking at the corresponding verbose log,

    SideExit=0x132a6f4 exitType=0
    24
    ld4506 = ld $stack0[24]
    JSVAL_TAGMASK
    and2418 = and ld4506, JSVAL_TAGMASK
    2
    eq3619 = eq and2418, 2
    JSVAL_INT
    and2419 = and ld4506, JSVAL_INT
    or534 = or and2419, eq3619
    eq3620 = eq or534, 0
    xt2551: xt eq3620 -> pc=0x30f69f imacpc=0x0 sp+8 rp+0

That looks like the guard in the isNumber case in unbox_jsval.  mrbkap posits that this is the unbox at the end of ::prop, given that the op here is getprop.  Line 309 of the testcase is:

      if(touching[j].chainId != i.chainId) {

The above seems to suggest that the jsval is not coming out as JSVAL_INT or JSVAL_DOUBLE here... but I added some code right before this line to throw if typeof(i.chainId) or typeof(touching[j].chainId) is not "number", and that code doesn't throw.

Anyone have any bright ideas on why we're exiting here?
OK, so I did some digging into ::guard, and the log is just totally misleading.  The SideExit line corresponds to the code _before_ it (even though there's a blank line in the log between them), not to the code that immediately follows it.  I'll file a bug to fix the log to make things clearer.  So the correct guard here (blank line elided) is:

    shape = ld ld4504[16]
    406
    guard(kshape) = eq shape, 406
    xf3364: xf guard(kshape) -> pc=0x30f69f imacpc=0x0 sp+8 rp+0
    SideExit=0x132a6f4 exitType=0

kshape puts us in either ::test_property_cache or ::record_SetPropHit.  I went ahead and edited the guard names in those functions to make them different, and it's the test_property_cache guard.

I don't know which property is being accessed here, exactly... any way to tell that?
(In reply to comment #61)
> I don't know which property is being accessed here, exactly... any way to tell
> that?

I've had success before modifying one or more recording functions to call a builtin that prints out some perf/debugging info. As to reversing property ids back to names, I don't know much about that offhand.
Filed bug 493246 to improve the logging.

Looking at the log some more, we have:

    ld4504 = ld $stack0[0]

That's the thing whose shape we're guarding on, right?  What would it correspond to, if this is the loop in question:

    for(var j = 0; j < touching.length; j++) {
      if(touching[j].chainId != i.chainId) {
        touchingUniqueChainIds[touching[j].chainId] = true;
      }
    }
?
Given that the following guard is the isNumber() guard, I would bet the kshape guard is on the shape of touching[j].

So here's a question... we side-exit on that exact guard 150000 times in a small subpart of the game (first 10 moves).  How come we never create a trace with the new kshape?
How many kshapes are in play? I notice that that code is inside a closure passed to eachInt. Could it be yet another bug where millions of fresh shapes are being created? Otherwise, max-branches could cause that kind of thing, but I think your test with a really big value pretty much proves that it's not a small number of shapes that just barely overflows max-branches.
For what it's worth, I just redid the max_branches test, and while it doesn't give me any change on the full game (140 seconds before and after) on just the first 10 moves the larger max_branches goes from 3 seconds to 15 seconds....  So we do in fact get limited by max_branches, but it actually sorta helps...

I also tried calling gc() after every move, both with the normal max_branches and with the bigger one.  It gave a tiny speedup, but not much of one.  That would have blown away the "stale" trees, right?
OK, a bit more data.  Comment 59 through comment 66 didn't have the "make branding not change scope" patch in my tree.  Adding that makes us get that branch exit a lot less on the "first 10 moves" testcase, though not _that_ much less on the full-game testcase (drops by a few tens of thousands, but still similar order of magnitude).

So thinking about it some more, it might be that the shape we're looking at is not that of the touching[j] but that of the scope we're getting 'j' from...
So if I apply the branding hack, comment out the one assert it triggers and test, then times drop by aboug 2-3 seconds across the board (for the first 10 moves, first 20 moves, and whole game).  In other words, it only helps during the first 10 moves...

I just ran a build with that diff for 20 moves to get a verbose log.  Now the most common exits are loop exits, but we still have a number or branch exits.

Now the most common branch exit is different though.  Or rather, same line, but different guard:

leaving trace at /Users/bzbarsky/foo.js:309@71, op=getprop, lr=0x2eebb4, exitType=0, sp=1, calldepth=0

    About to try emitting guard code for SideExit=0x2eebb4 exitType=0
    shape = ld eq4002[16]
    141
    ld4884 = eq shape, 141
    obj: xf ld4884 -> pc=0x30f69f imacpc=0x0 sp+8 rp+0

where we have:

    eq4002 = ld eq4001[0]
    eq4001 = ld sp[0]

This is definitely a shape guard on that same line.  Not sure for which object.  The next guard looks like a CallIteratorNext guard or some such....
For extra fun, after about 12 moves in the only js_GenerateShape calls I see are coming from Board.prototype.findChainLengthsByChainId (which builds up an object that probably has a different shape every time it's called).

So I have no idea why we keep failing that shape guard above...
Depends on: 493316
I tried using the JS shell to localize the problematic code.  I ended up running into something weird.  This test case has three tests in it:
1) run findChains() 100 times
2) run findChainTouchingCountsByChainIdAndStone() 100 times
3) run both functions 100 times

If I enable all tests, here are the times that I get:
Without JIT:
test 1 (findChains) run time: 456
test 2 (findChainTouching...) run time: 624
test 3 (both) run time: 1087

With JIT:
test 1 (findChains) run time: 430              <- faster
test 2 (findChainTouching...) run time: 749    <- slower
test 3 (both) run time: 1173                   <- slower

Ok, so findChains runs faster under JIT, but findChainTouching... doesn't, right?

If I comment out the call for test 1.
Without JIT:
test 2 (findChainTouching...) run time: 255
test 3 (both) run time: 1096

With JIT:
test 2 (findChainTouching...) run time: 120    <- faster
am3_shelltest.js:140: TypeError: counts[chainId] is undefined

In this case, it looks like findChainTouching... *is* faster under JIT.  Then, for some reason test 3 crashes.  I don't understand why the script would error with the JIT, but not without it (maybe this is known or correct behavior).  It might have something to do with findChainTouching... running before findChains (this would happen by skipping test 1), so I did a test where I run test 2, then test 1--no crash or error.  Oh, but if I run the tests in 2, 1, 3 order, then I do get the error again at test 3.

The error message "counts[chainId] is undefined" confuses me because three lines previously (line 137), I check for undefined (counts[chainId] == undefined) to ensure that counts[chainId] is set.  The condition (typeof counts[chainId] == 'undefined') doesn't work any better.

If I run just test 3 on its own, then the crash goes away and JIT is a little slower (1074 vs 1164).
> I don't understand why the script would error with the JIT, but not without it

Sounds like a serious bug to me!  Can you possibly reduce the testcase that shows the behavior difference as much as possible and file as a bug blocking this one?  And thank you for reducing this, by the way; it's great to have someone who understands the go code in the large doing that.

Sounds somewhat similar to bug 493068, by the way...  Sadly, I suspect the speedup is due to the incorrect execution.
I've started working on trimming down the test case and will continue when I get back from work.  Can someone confirm that they get the same error?  Thanks.
Aaron, I can definitely confirm that behavior.  I filed bug 493657 to track that.  Please attach whatever you can manage to reduce the testcase to there?
Depends on: 493657
I applied gal's patch from attachment 378251 [details] [diff] [review] (bug 493657) to changeset: 28527:aac8141341bf and re-ran some test using attachment 377996 [details] (Testcase to try to localize the problem using reduced go code).

Fortunately, the TypeError appears to be gone, but I'm still seeing slowdowns under JIT with runTest3().

I ran just runTest3() with different maxes in the for loop.  Here are the results:

Without JIT:
max=  100    1111 ms
max= 1000   14134 ms
max=10000  149546 ms

With JIT:
max=  100    1193 ms ( 7% slower)
max= 1000   18273 ms (29% slower)
max=10000  204842 ms (37% slower)

I don't know why the percent would go up so much.  I can't imagine that the startup time would have that much of an effect.
Attached file 31 line testcase (obsolete) —
I've reduced the pretty far and while retaining the slowdown on my build.  Here are the numbers that I get:

Without JIT 
run time: 2103

With JIT 
run time: 2989
Attachment #377996 - Attachment is obsolete: true
Attached file 29 line testcase
This is a slightly modified version of the 31 line testcase.  I removed one anonymous function, and swapped the order of the two main functions.

Without JIT
run time: 1405

With JIT
run time: 2229

This one is quite a bit faster overall than the previous test case, but is still slower under JIT.
Attachment #378554 - Attachment is obsolete: true
That last testcase basically shows the following features:

1) funcA() has a lambda that modifies a variable it closes over.  That can't be
   traced (see bug 492918).  It leads to the following aborts:

    abort: 9486: JSOP_BINDNAME must return global object on trace
    Abort recording of tree /Users/bzbarsky/test.js:5@8 at 
      /Users/bzbarsky/test.js:15@0: bindname.

   (3 times in a row) and 

    Abort recording of tree /Users/bzbarsky/test.js:22@69 at
     /Users/bzbarsky/test.js:14@8: lambda.

   (3 times in a row).

2) This means the outer loop cannot get traced.
3) The loop that calls funcB a bunch of times _does_ get traced, but that loop
   body does almost nothing.
4) The jitstats look like so:

    recorder: started(5), aborted(6), completed(2), different header(0),
              trees trashed(0), slot promoted(0), unstable loop variable(1),
              breaks(0), returns(0), unstableInnerCalls(0)
    monitor: triggered(996009), exits(996009), type mismatch(0),
             global mismatch(0)

So it sounds like the slowdown is just due to the jit entry/exit overhead for the very thin loop....  That matches what I was seeing on the original testcase.  Are there calls to such functions (ones that can be traced but are fast and ones that can't and are slower to start with) right next to each other in the original go game?
Actually, I'm not quite sure what's going on there.  We seem to have a trigger+exit for every iteration of the |for(var prop in obj)| loop.  We do complete two traces, though.  Do we record a trigger+exit even if the inner loop exits into the outer loop?  Most of the exits (almost all, in fact) are loop exits for the for/in loop...
And in particular, if I inline the loop around the funcB() call into the main loop, then I only get 1000 trace starts/exits.  If the loop is inside the manyTimes function, I get 1,000,000 as above.  If I use manyTimes to call funcB but don't use it in funcA, then I get 1000 trace starts/exits.

So it sounds like in this case we fail to trace the "call manyTimes with funcB" thing as a single trace start/stop.  we do claim to have competed two loops, though.  I have no idea what they could be other than the manyTimes loop and the for/in loop...  And if that's what they are, how come we didn't glue them together?
Note that the upvar in funcA, named n:

function funcA() {  // was findChains
    var n = 0;
    manyTimes(function() {
        n += 1;
    });
}

is accessed from a lambda called while funcA is still active. This is a case we can trace, akin to how we handle upvars accessed from non-escaping (non-funarg, i.e., Algol-procedure-like functions) above the static level of the trace's entry frame. I encouraged dmandelin to steal the bug from my list.

That bug is really two bugs: one for the case shown above, where a funarg sets an upvar but the funarg is called only while the upvar's containing function is still active; and the other (which I claim is rare) where funcA returned and later (via some heap reference to a function), the lambda was invoked.

/be
> (which I claim is rare)

Agreed (and in particular nonexistent in the go game).  Does fixing the we-think-common case for 1.9.1 feasible?

Note that while that would help this reduced testcase it might not help the go game that much; I seem to recall that when I hacked all the upvars to use expicit closure args to functions I was still getting a slowdown over interp.

I'm more interested in the trace-gluing behavior (or misbehavior) on that last testcase...
In reply to the last question of comment 77.

Attachment 378554 [details] has more similar code structure to the original go game.
Here's how things line up:
manyTimes = eachInt
funcA = doCaptures
funcB = findChainTouchingCountsByChainIdAndStone
test for loop is most similar to (doCaptures or callees) and their callees 
(both of these functions call the equivalant of funcA and funcB right in a row)

doCaptures is called by isValidMove (which is called in an eachInt)
doCaptures is also called by updateState
updateState is called by makeMove which is called by evalMove which is called
in an eachInt by chooseMove

So, to answer the question: Yes, I think.
(In reply to comment #80)
> Note that the upvar in funcA, named n:
> is accessed from a lambda called while funcA is still active.

Do we have a separated bug to optimize escaping upvars? It should be straightforward to implement something similar to js_GetUpvar that looks into the corresponding Call object and not the frame after the frame is put.
Boris: we can do something for 1.9.1 if we have a bit more time than just today. I agree the failure to weld inner and outer trees and avoid too many traces is a bad bug too.

IMHO we must block on any worse-than-noise regression from 3.0.x perf. 3.0.x is no longer competitive in the JS performance wars, so this makes me sad, but 2x worse is unacceptable.

Igor: we're using bug 492918 for both mutating an upvar and escaping function uses an upvar. We can split if necessary, but it looks like one patch will fix both aspects.

/be
For what it's worth, with bug 458838 fixed we're at 1.3x slowdown, not 2x.  Still worse-than-noise, but not nearly as bad...
I did a bit more of poking around at the full go testcase, and I see a bunch of aborts like:

  Abort recording of tree /Users/bzbarsky/test.js:704@3 at
    /Users/bzbarsky/test.js:590@11: lambda_fc.

It looks like that opcode always aborts.  I have no idea why we hit it here, nor why we end up trying to trace that loop so many times and fail each time on that lambda_fc...
(In reply to comment #86)
> I did a bit more of poking around at the full go testcase, and I see a bunch of
> aborts like:
> 
>   Abort recording of tree /Users/bzbarsky/test.js:704@3 at
>     /Users/bzbarsky/test.js:590@11: lambda_fc.

We need a bug on this one. I didn't trace it because forming a flat closure on trace needs the display maintained by the interpreter, or a longer trip up the static link (AKA call object parent slot, if not optimized away). Been talking to dmandelin, who has been developing the machinery to support this.

> It looks like that opcode always aborts.  I have no idea why we hit it here,

A flat closure is one whose upvars are initialized only, and the initialization dominates the closure. No intervening escaping functions, but layers of Algol-like procedures are cool.

> nor why we end up trying to trace that loop so many times and fail each time on
> that lambda_fc...

That seems like a failure to blacklist. Andreas or Graydon may be able to say.

/be
I am looking at LAMBDA_FC but I am not sure we want to block on that. Its quite a bit of extra code (doable though).
Filed bug 494269 on lambda_fc (comment 87).
Depends on: 494269
this is more under control now, and andreas says fixing it fully is going to be hard(tm). we have to ship without it.
Flags: blocking1.9.1+ → blocking1.9.1-
Flags: wanted1.9.1?
Depends on: 496240
So fundamentally, we seem to have a problem with this part:

Board.prototype.eachInt = function(func) {
  for(var y = 0; y < this.size; y++) {
    for(var x = 0; x < this.size; x++) {
      func(this.getInt(x, y));
    }
  }  
};

This is called all over in this game, with different values of |func| depending on what's being done to the board.  Since we guard on callee, each invocation with a different function has to trace new branches for these loops, right?  And that means aborting whatever is recording in the caller now, etc.
Component: JavaScript Engine → jemalloc
Component: jemalloc → JavaScript Engine
(In reply to comment #92)
> This is called all over in this game, with different values of |func| depending
> on what's being done to the board.  Since we guard on callee, each invocation
> with a different function has to trace new branches for these loops, right? 

Yes. Bug 518487 is for a PIC, which could be useful here, although it would require further work to be able to 'un-inline' the call.
Depends on: TM:PIC
Depends on: 563852
Depends on: 590037
For what it's worth, numbers as of today look like this (all measurements in opt shells using the attached shell testcase):

Interp: 733ms/move
-m:     428ms/move
-j:     694ms/move
-m -j:  427ms/move
v8:     155ms/move
Depends on: 599035
Depends on: 601106
Board.prototype.getInt starts like this:

if(typeof(x) == "number" && typeof(y) == "number") {

Board.prototype.getInt is called almost 50 million times. 
Blocking bug 591172 because of this.
Depends on: 591172
Current Performance Numbers:

JS Shell Testcase [attachment 350176 [details]]
Interp: sample01 runs: 10 min: 73 avg: 75 max: 82
JM: sample01 runs: 10 min: 11 avg: 11 max: 12
JM+TI: sample01 runs: 10 min: 9 avg: 10 max: 10
d8: sample01 runs: 10 min: 6 avg: 7 max: 7

Full JS Shell Testcase [attachment 377491 [details]]
Interp: 1481
JM: 496
JM+TI: 397
d8: 265

29 Line Testcase [attachment 378561 [details]]
Interp: loop count: 1000, run time: 688
JM: loop count: 1000, run time: 62
JM+TI: loop count: 1000, run time: 48
d8: loop count: 1000, run time: 35
Summary: go game benchmark is 2x slower with tracemonkey enabled → JM+TI: go game benchmark is ~1.4x slower than v8
Current Performance Numbers:

JS Shell Testcase [attachment 350176 [details]]
Interp: sample01 runs: 10 min: 77 avg: 77 max: 78
Baseline: sample01 runs: 10 min: 8 avg: 9 max: 9
IM: sample01 runs: 10 min: 2 avg: 3 max: 4
d8: sample01 runs: 10 min: 4 avg: 4 max: 5

Full JS Shell Testcase [attachment 377491 [details]]
Interp: 1203
Baseline: 460
IM: 279
d8: 185

29 Line Testcase [attachment 378561 [details]]
Interp: loop count: 1000, run time: 526
Baseline: loop count: 1000, run time: 93
IM: loop count: 1000, run time: 24
d8: loop count: 1000, run time: 17

So, we're faster on the first testcase, but still 50% and 40% slower on the others
Summary: JM+TI: go game benchmark is ~1.4x slower than v8 → IM: go game benchmark is ~1.5x as slow as v8
The numbers have improved very slightly (from 279 to 258 for the full, and from 24 to 22 for the 29 line testcase), but the overall situation is essentially unchanged. Except for v8 dropping from 185 to 156 for the full testcase.
Assignee: dmandelin → nobody
Component: JavaScript Engine → JavaScript Engine: JIT
Summary: IM: go game benchmark is ~1.5x as slow as v8 → go game benchmark is ~1.5x as slow as v8
My numbers on the full, in-browser benchmark, averaged over 5 runs, are:
FF Nightly: 71.72 ms/turn
Chrome 49:  86.05 ms/turn
Full JS Shell Testcase [attachment 377491 [details]]
SM: 235ms
d8: 160ms
jsc: 107ms(!)

29 Line Testcase [attachment 378561 [details]]
SM: 16ms
d8: 21ms
jsc: 36ms
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.