declining TM performance with repeated runs of test case

RESOLVED WORKSFORME

Status

()

Core
JavaScript Engine
P1
normal
RESOLVED WORKSFORME
8 years ago
7 years ago

People

(Reporter: blizzard, Assigned: dvander)

Tracking

Trunk
Points:
---
Bug Flags:
wanted-next +

Firefox Tracking Flags

(blocking2.0 betaN+)

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
See report here:

http://hacks.mozilla.org/2009/07/tracemonkey-overview/#comment-2035

If you run 'example 3' in that blog post over and over again the time it takes increases with every run, starting at 3ms and eventually topping out at around 200ms.  If you keep running it eventually resets and starts running at 3ms again.  (Blowing a cache, eventually?)

Tested here on the 3.5.1 final release on Windows 7.

Comment 1

8 years ago
Repro'd here as described on Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)

Was also able to reproduce with a bleeding-edge tracemonkey release-build js shell on a very slow SPARC, except instead of resetting it got VERY slow -- after a slow build to 300ms it jumped to over 30s and stayed there.

Now here's where it gets interesting: If I call JS_GC() each time the loop runs, the time intervals stay the same, but instead of getting stuck at > 30s, they jump back down into the 30ms range.

Comment 2

8 years ago
Sorry, JS_MaybeGC() in my loop.

With JS_GC() in the loop, it runs at fastest speed every time.
Created attachment 389142 [details]
js shell testcase that shows the problem
Comment on attachment 389142 [details]
js shell testcase that shows the problem

Please use text/plain for .js file attachments. Easier to preview and we don't want to run 'em on click in the browser anyway (AFAIK -- anyone disagree?).

/be
Attachment #389142 - Attachment mime type: application/x-javascript → text/plain
So what's happening here is that each time we run the loop |mul| has a different value.  So we can't reuse our existing trace for that loop and have to create a new one.  So things slow down more and more as we end up with more traces compiled for that code location (so presumably it takes longer and longer to realize that none of the traces are relevant to us).

This continues until the 34th iteration.  At that point, performance _really_ takes a nosedive (to much slower than interp).  I'm not sure why, exactly.  I tried bumping MAXEXIT, BL_BACKOFF, MAXPEERS, MAX_BRANCHES, and none of those change the "34" number.  We should try to figure out what the deal is there.

In any case, the reset observed in browser is presumably due to gc clearing all of our cached traces.

Brendan, I just forgot to switch from "autodetect" when attaching...
We shouldn't autodetect .js as anything but text/plain here -- it's the dominant case.  I wonder if there's a bug on that!
This bug depends on bug 471214 but the eval may be an added nasty. Leaving this dependent rather than a dup until I can verify that a fresh fix for bug 471214 also fixes this bug (thanks to the eval cache).

/be
Depends on: 471214
For what it's worth, for that testcase the stats look like this:

recorder: started(7), aborted(1), completed(70), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(4), breaks(0), returns(0), unstableInnerCalls(0), blacklisted(0)
monitor: triggered(100226), exits(100226), type mismatch(0), global mismatch(0)

Here's the relevant part of a full log:

leaving trace at /Users/bzbarsky/test.js:10@86, op=call, lr=0x27ef38, exitType=BRANCH, sp=3, calldepth=0, cycles=21237
trying to attach another branch to the tree (hits = 0)

And I just checked, and I totally failed to make in the right place when testing in comment 5.  MAX_BRANCHES is absolutely where it's at for this testcase, as it should have been.
No longer depends on: 471214
We should figure out what's going on here other than the eval thing.

What seems to be happening is that once we hit MAX_BRANCHES we continue entering the loop and then exiting it on the branch exit.  And we do that for every return the the loop header.

So in the testcase here, on the 34th iteration of the function, we enter/exit trace for that second loop about 100000 times.

If I bump the iteration count by 1, we do that for both the 34th and 35th iteration...
Depends on: 471214
So there seems to be two side exits with the same address here, or something... in any case, the relevant guards are:

    OBJ_GET_PRIVATE(cx, callee_obj)
    -2
    ld317 = ld $stack0[16]
    and69 = and ld317, -2
    eq175 = eq and69, OBJ_GET_PRIVATE(cx, callee_obj)
    xf213: xf eq175 -> pc=0x30d846 imacpc=0x0 sp+24 rp+0

    OBJ_GET_PARENT(cx, callee_obj)
    ld318 = ld $stack0[12]
    eq176 = eq ld318, OBJ_GET_PARENT(cx, callee_obj)
    xf214: xf eq176 -> pc=0x30d846 imacpc=0x0 sp+24 rp+0

Which makes sense: guarding on calling the same function with the same scope chain.
So even if we make this particular case not branch-explode.... is there any way we can avoid trying to execute that tree on every return to the loop header if we keep taking branch exits and we've hit MAX_BRANCHES?  I guess we keep hoping that we'll manage to hit one of our existing branches, but in this case that just can't happen.

It seems that if we limit it to trees that have hit MAX_BRANCHES, the overhead of the checking might be worth it to avoid slowdown over interp....

On the other hand, if the loop were thicker and the failing guard further down, it might not be a loss, and might even be a win, to try entering trace on each iteration.  So not sure what the best course of action is here.
There's no eval thing to figure out. We cache evals, so we get the same JSFunction every time, but different parent (scope chain link) because the lambda uses upvars that are mutated. This is a distinct bug from bug 471214 after all.

The trick here is to avoid guarding on the parent of the eval-created function object. If we indeed optimize its prod and i upvars to access active slots of the outer function's Call object, then we shouldn't guard on parent at all.

dmandelin, want to take this one?

/be
No longer depends on: 471214
Flags: wanted1.9.2?
Flags: blocking1.9.2?
I filed bug 504881 on comment 11.

Updated

8 years ago
Flags: blocking1.9.2? → blocking1.9.2+
If we compute the maximum UPVAR_FRAME_SKIP when inlining a call to a function fun having upvars that are not optimized such that FUN_FLAT_CLOSURE(fun), and the skip amount is "in range" by one or another of our upvar optimizations, then we can avoid guarding on parent.

This could be an important optimization. Have to check the popular benchmarks.

/be
OS: Windows NT → All
Hardware: x86 → All
So, hearing from dmandelin, gal, and danderson that this is not going to be fixed in time for 1.9.2 as it's a really hard problem.  So, to make sure Sayre sees this, I'm leaving as blocking 1.9.2+ and at the same time, noming for blocking1.9.3.  

Sayre, please update accordingly.
Assignee: general → dvander
blocking2.0: --- → ?
OS: All → Windows NT
Priority: -- → P1
Hardware: All → x86
OS: Windows NT → All
Hardware: x86 → All

Updated

8 years ago
Flags: wanted1.9.2?
Flags: wanted1.9.2+
Flags: wanted-next+
Flags: blocking1.9.2+
Version: 1.9.1 Branch → Trunk

Updated

8 years ago
blocking2.0: ? → beta1
Flags: wanted1.9.2+
Depends on: 510554

Updated

7 years ago
blocking2.0: beta1+ → beta2+

Updated

7 years ago
blocking2.0: beta2+ → betaN+

Comment 16

7 years ago
On mozilla-central, I can get the test case to go from 2ms to about 10ms, but then it goes back down again.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WORKSFORME
What happens here is that _if_ you have JM and jitprofiling both on, after we hit the MAX_BRANCHES point we stop trying to run the loop in the tracer altogether and just run it in the mjit.  So we end up doing it 2-4x slower than a single tracer loop run, but faster than interp, of course.

No idea whether that makes this bug wfm; I guess we do have bug 504881 filed on the real underlying issue here.
You need to log in before you can comment on or make changes to this bug.