See report here: http://hacks.mozilla.org/2009/07/tracemonkey-overview/#comment-2035 If you run 'example 3' in that blog post over and over again the time it takes increases with every run, starting at 3ms and eventually topping out at around 200ms. If you keep running it eventually resets and starts running at 3ms again. (Blowing a cache, eventually?) Tested here on the 3.5.1 final release on Windows 7.
Repro'd here as described on Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) Was also able to reproduce with a bleeding-edge tracemonkey release-build js shell on a very slow SPARC, except instead of resetting it got VERY slow -- after a slow build to 300ms it jumped to over 30s and stayed there. Now here's where it gets interesting: If I call JS_GC() each time the loop runs, the time intervals stay the same, but instead of getting stuck at > 30s, they jump back down into the 30ms range.
Sorry, JS_MaybeGC() in my loop. With JS_GC() in the loop, it runs at fastest speed every time.
Comment on attachment 389142 [details] js shell testcase that shows the problem Please use text/plain for .js file attachments. Easier to preview and we don't want to run 'em on click in the browser anyway (AFAIK -- anyone disagree?). /be
So what's happening here is that each time we run the loop |mul| has a different value. So we can't reuse our existing trace for that loop and have to create a new one. So things slow down more and more as we end up with more traces compiled for that code location (so presumably it takes longer and longer to realize that none of the traces are relevant to us). This continues until the 34th iteration. At that point, performance _really_ takes a nosedive (to much slower than interp). I'm not sure why, exactly. I tried bumping MAXEXIT, BL_BACKOFF, MAXPEERS, MAX_BRANCHES, and none of those change the "34" number. We should try to figure out what the deal is there. In any case, the reset observed in browser is presumably due to gc clearing all of our cached traces. Brendan, I just forgot to switch from "autodetect" when attaching...
We shouldn't autodetect .js as anything but text/plain here -- it's the dominant case. I wonder if there's a bug on that!
This bug depends on bug 471214 but the eval may be an added nasty. Leaving this dependent rather than a dup until I can verify that a fresh fix for bug 471214 also fixes this bug (thanks to the eval cache). /be
For what it's worth, for that testcase the stats look like this: recorder: started(7), aborted(1), completed(70), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(4), breaks(0), returns(0), unstableInnerCalls(0), blacklisted(0) monitor: triggered(100226), exits(100226), type mismatch(0), global mismatch(0) Here's the relevant part of a full log: leaving trace at /Users/bzbarsky/test.js:10@86, op=call, lr=0x27ef38, exitType=BRANCH, sp=3, calldepth=0, cycles=21237 trying to attach another branch to the tree (hits = 0) And I just checked, and I totally failed to make in the right place when testing in comment 5. MAX_BRANCHES is absolutely where it's at for this testcase, as it should have been.
We should figure out what's going on here other than the eval thing. What seems to be happening is that once we hit MAX_BRANCHES we continue entering the loop and then exiting it on the branch exit. And we do that for every return the the loop header. So in the testcase here, on the 34th iteration of the function, we enter/exit trace for that second loop about 100000 times. If I bump the iteration count by 1, we do that for both the 34th and 35th iteration...
So there seems to be two side exits with the same address here, or something... in any case, the relevant guards are: OBJ_GET_PRIVATE(cx, callee_obj) -2 ld317 = ld $stack0 and69 = and ld317, -2 eq175 = eq and69, OBJ_GET_PRIVATE(cx, callee_obj) xf213: xf eq175 -> pc=0x30d846 imacpc=0x0 sp+24 rp+0 OBJ_GET_PARENT(cx, callee_obj) ld318 = ld $stack0 eq176 = eq ld318, OBJ_GET_PARENT(cx, callee_obj) xf214: xf eq176 -> pc=0x30d846 imacpc=0x0 sp+24 rp+0 Which makes sense: guarding on calling the same function with the same scope chain.
So even if we make this particular case not branch-explode.... is there any way we can avoid trying to execute that tree on every return to the loop header if we keep taking branch exits and we've hit MAX_BRANCHES? I guess we keep hoping that we'll manage to hit one of our existing branches, but in this case that just can't happen. It seems that if we limit it to trees that have hit MAX_BRANCHES, the overhead of the checking might be worth it to avoid slowdown over interp.... On the other hand, if the loop were thicker and the failing guard further down, it might not be a loss, and might even be a win, to try entering trace on each iteration. So not sure what the best course of action is here.
There's no eval thing to figure out. We cache evals, so we get the same JSFunction every time, but different parent (scope chain link) because the lambda uses upvars that are mutated. This is a distinct bug from bug 471214 after all. The trick here is to avoid guarding on the parent of the eval-created function object. If we indeed optimize its prod and i upvars to access active slots of the outer function's Call object, then we shouldn't guard on parent at all. dmandelin, want to take this one? /be
If we compute the maximum UPVAR_FRAME_SKIP when inlining a call to a function fun having upvars that are not optimized such that FUN_FLAT_CLOSURE(fun), and the skip amount is "in range" by one or another of our upvar optimizations, then we can avoid guarding on parent. This could be an important optimization. Have to check the popular benchmarks. /be
So, hearing from dmandelin, gal, and danderson that this is not going to be fixed in time for 1.9.2 as it's a really hard problem. So, to make sure Sayre sees this, I'm leaving as blocking 1.9.2+ and at the same time, noming for blocking1.9.3. Sayre, please update accordingly.
On mozilla-central, I can get the test case to go from 2ms to about 10ms, but then it goes back down again.
What happens here is that _if_ you have JM and jitprofiling both on, after we hit the MAX_BRANCHES point we stop trying to run the loop in the tracer altogether and just run it in the mjit. So we end up doing it 2-4x slower than a single tracer loop run, but faster than interp, of course. No idea whether that makes this bug wfm; I guess we do have bug 504881 filed on the real underlying issue here.