mandelbrot "benchmark" is a lot slower with profiling on

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
9 years ago
9 years ago

People

(Reporter: bzbarsky, Assigned: billm)

Tracking

({regression})

Trunk
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 betaN+)

Details

(Whiteboard: [jsperf] [fixed-in-tracemonkey][hardblocker], )

Attachments

(1 attachment, 3 obsolete attachments)

I see 105ms with profiling or jm disabled; 355ms with profiling enabled or tm disabled.

I thought we had this in trace-tests... but that might be as a stress-test only.  :(

Building debug browser now to get some more data.
Note that the core part of the code is this:

        for (var i = 0, curReal = realRange.min;
             i < numCols;
             ++i, curReal += realStep) {
          for (var j = 0, curImag = imagRange.max;
               j < numRows;
               ++j, curImag += imagStep) {
            var n = computeEscapeSpeed(curReal, curImag);
            addPoint(points, n, i, j)
          }
        }

where computeEscapeSpeed runs a loop anywhere between 1 and 105 times.  Most calls to computeEscapeSpeed run the loop 105 times, but most of the early calls run it just once.
So we first start profiling with that inner for loop there (presumably because the loop inside computeEscapeSpeed has never run enough iterations at that point yet to get profiled?).  We end up deciding not to trace it because the nested loop inside computeEscapeSpeed (which has run 0 iterations, according to the "NESTED" spew!) has numSelfOps == numSelfOpsMult == 0, so that isCompilationExpensive for that loop returns true.

Then we later profile the loop inside computeEscapeSpeed and end up aborting profiling (not sure why yet).  So we blacklist that loop.  And that sets numSelfOps to MAX_PROFILE_OPS for that innermost loop, which means all loops containing it get blacklisted too.  So we blacklist everything.
If I change the >= to > in this line:

    if (numSelfOpsMult >= numSelfOps*100000)

then things seem to be good.  Or at least, I only see us start profiling for that inner for loop twice, each time decide to trace it and move on.  Not sure why I never see us profile the outer loop....

Also not sure yet why we ended up aborting profiling.  Trying to figure that out now.
blocking2.0: --- → ?
Keywords: regression
Looks like we abort profiling if we return out of a loop we're currently profiling?  But unlike aborting tracing, which will back off and try again later it looks like aborting profiling permanently blackists the loop?  Is that right?

In any case, the abort happens when we take the return in computeEscapeSpeed, which is the common case for that function.

In fact, it looks like any loop that returns from inside the loop before hitting 16384 jsops will hit the abort and get blacklisted.  In particular, any loop that uses return instead of a break or loop condition and happens to return before it's looped enough times to hit 16384 jsops will hit this case...  Would it make more sense to just decide() at that point instead of aborting, possibly?
Whiteboard: [jsperf]
I think there are a lot of things independently going wrong here. I still have a few things to track down, but here's what I've found so far.

- The problem with the zero-iteration loop in comment 3 was tripping us up. The fix there seems good to me.

- JM wasn't compiling the main script because it used const. I filed bug 607751 to compile those opcodes. The weird thing here is that if JM can't compile a script, then we run in the interpreter, which avoids the profiler entirely. I think this is the correct thing to do: lacking the method jit, we probably should just trace as much as possible, which is what happens without the profiler.

- I don't think that aborting due to a return statement is bad. This was done by design. First, we only abort if the return breaks out of the loop being profiled (as opposed to some nested loop).

Second, I want to emphasize an important part of profiling: If we profile a loop and decide not to trace it, it's not the end of the world. We may end up profiling an outer loop, deciding that we do want to trace it, and then we'll trace the original loop as well. This is why the "un-blacklisting" mechanism was needed for profiling.

In this example, we decide not to trace the computeEscapeSpeed loop due to the return. However, we will still trace it when we decide to trace the |for (var j = 0, ...)| loop.

In general, a return statement in a loop suggests that we won't execute many iterations of the loop, so we probably don't want it to be a top-level trace. It's just another "short loop" indicator.

- This test case also seems to exhibit a problem with tracer integration, where we start recording a loop when the interpreter is running in JSINTERP_SAFEPOINT mode. It seems like this is undesirable.

- I still haven't figured out why we don't trace the outer loop. Once I figure that out, I'll post a patch for these issues.

Anyway, thanks for the great test case!
Posted patch patch (obsolete) — Splinter Review
This patch fixes the issue with loops with loops of zero iterations.

It also fixes the problem with the integration between the profiler and the JITs. With the patch, we trace both the inner and outer main loops and performance with -mjp is that same as with -j.

The problem was that, while running in the interpreter, waiting to reach a safe point, we needed to make important profiling decisions. Previously, the profiler could only be called from the method JIT, never the interpreter. This patch fixes the problem. As a result, you can run in weird modes like -j -p. There's no reason to do this, but it should work now.

I pushed to the tryserver, since it's slightly risky change.
Assignee: general → wmccloskey
Status: NEW → ASSIGNED
Attachment #486474 - Flags: review?(dvander)
Comment on attachment 486474 [details] [diff] [review]
patch

>+    if (prof->profiled) {
>+        if (prof->traceOK) {
>+            return RecordLoopEdge(cx, inlineCallCount, prof->execOK);
>+        } else {
>+            return MONITOR_NOT_RECORDING;
>+        }

No return-after-else, just: 

>+        if (prof->traceOK)
>+            return RecordLoopEdge(cx, inlineCallCount, prof->execOK);
>+        return MONITOR_NOT_RECORDING;
Attachment #486474 - Flags: review?(dvander) → review+
> We may end up profiling an outer loop, deciding that we do want to trace it,
> and then we'll trace the original loop as well. 

See, that's my concern.  Once we _abort_ profiling of a loop X (which is different from just profiling and deciding not to trace), then we won't trace any loops that contain X, as far as I can tell.

In particular, AbortProfiling will set numSelfOps for X to MAX_PROFILE_OPS.  So isCompilationExpensive() for X will return true.  And isCompilationExpensive() for a loop containing X will return true if it returns true for X.

The only way I can see this working is if we sometimes blow away the LoopProfile structure for X and create a new one, before we start profiling the ancestor loop.  As far as I can see this only happens on a ResetJIT call.

Am I just missing something?
(In reply to comment #8)
> Am I just missing something?

No, I was wrong. I forgot about that assignment to numSelfOps in AbortProfiling. I'm pretty sure I added it to avoid tracing something in one of the SunSpider date benchmarks. But I just took out the assignment and I'm not seeing any regressions. It's definitely a good idea to take it out, because it's a silly thing to do in general.

I'll post an updated patch with this change and a fix for another problem I found this morning.
Posted patch additional patch (obsolete) — Splinter Review
This patch applies on top of the previous one. It removes the line in AbortProfiling. It also fixes a problem with PCWithinLoop. The problem was that it didn't work if you had already left the function where the loop was. Now it compares frame pointers. I should have picked up on this when you asked about it, Dave. Oh well.
Attachment #486674 - Flags: review?(dmandelin)
We may still want to set some flag to prevent tracing of the loop being aborted (this time around; we could clear the flag the next time we profile the loop)...  I do agree that a priori loops that return while being profiled should probably not get traced, since they'd just trace-abort if they did the return while recording.
Posted patch additional additional patch (obsolete) — Splinter Review
Good point. I think this should do it.
Attachment #486687 - Flags: review?(bzbarsky)
Comment on attachment 486687 [details] [diff] [review]
additional additional patch

r=me
Attachment #486687 - Flags: review?(bzbarsky) → review+
Comment on attachment 486674 [details] [diff] [review]
additional patch

>-PCWithinLoop(JSScript *script, jsbytecode *pc, T& loop)
>-{
>-    return script != loop.script || (pc >= loop.top && pc <= loop.bottom);
>+PCWithinLoop(JSStackFrame *fp, jsbytecode *pc, T& loop)
>+{
>+    return fp > loop.fp || (fp == loop.fp && pc >= loop.top && pc <= loop.bottom);
> }

|fp > loop.fp| doesn't look right to me. I know it didn't work before the new stacks, but I guess it probably does do the right thing now. Luke, does this seem safe to you? Or should it walk the stack to check? 

>     /*
>      * When we hit a nested loop while profiling, we record where it occurs
>      * and how many iterations we execute it.
>      */
>     struct InnerLoop {
>-        JSScript *script;
>+        JSStackFrame *fp;
>         jsbytecode *top, *bottom;
>         uintN iters;

I'd prefer 'entryFp', 'startFp', or something like that. 

Otherwise looks good.
Attachment #486674 - Flags: review?(dmandelin)
(In reply to comment #14)
> Luke, does this seem safe to you? Or should it walk the stack to check?

The two exceptions I can think of are:
 - the 'floating frames' stored in generators (which are copies of inactive generators, hence, should not appear here)
 - reentering the VM (content -> C++ -> chrome), but it looks like that aborts profiling.

> I'd prefer 'entryFp', 'startFp', or something like that. 

By the by, we've been trying to move toward not capitalizing either of the letters in the register-like acronyms, viz. sp, pc, and fp.  (E.g., fp->prevpc(), prevPc or prevPC)

Updated

9 years ago
blocking2.0: ? → beta8+
Posted patch new patchSplinter Review
Unfortunately, one of the previous patches asserted on the tryserver. I fixed the problem and added some additional assertion coverage. This patch subsumes the previous ones. Could you re-review, Dave? I think the fp comparison is safe, based on what Luke said.
Attachment #486474 - Attachment is obsolete: true
Attachment #486674 - Attachment is obsolete: true
Attachment #486687 - Attachment is obsolete: true
Attachment #489328 - Flags: review?(dmandelin)
Comment on attachment 489328 [details] [diff] [review]
new patch

I just gave it a light review since I had already reviewed it and you checked out the stack size issue. Looks good.
Attachment #489328 - Flags: review?(dmandelin) → review+
http://hg.mozilla.org/tracemonkey/rev/d5257e1f0c8b
Whiteboard: [jsperf] → [jsperf] fixed-in-tracemonkey
Whiteboard: [jsperf] fixed-in-tracemonkey → [jsperf]
http://hg.mozilla.org/tracemonkey/rev/08794e076ded
Whiteboard: [jsperf] → [jsperf] fixed-in-tracemonkey

Comment 21

9 years ago
http://hg.mozilla.org/mozilla-central/rev/d5257e1f0c8b
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
I'm reopening this bug, since the patches that actually landed here, whatever they do, don't fix it (as one can trivially see by trying a Nov 23 nightly, or a current trunk nightly, on the testcase).  I still get 325ms or so, unless I disable methodjit or jitprofiling, in which case I get 100ms.

I'm not sure what should happen with the blocking+ flag here: leave as 8, or move to 9?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [jsperf] fixed-in-tracemonkey → [jsperf]
I landed the lower-risk stuff. This includes some new assertions and bug fixes:

http://hg.mozilla.org/tracemonkey/rev/4d66cf4385f0
http://hg.mozilla.org/tracemonkey/rev/cb3f0ac5494f

More to come later, I hope.

Comment 25

9 years ago
Blocking beta8+? Probably it should be beta9+. Beta8 was released on Dec 21..
Eddward: please do read the bug before commenting.  Comment 23 paragraph two in particular.
blocking2.0: beta8+ → betaN+
http://hg.mozilla.org/tracemonkey/rev/94c6d2a90ae6

This is now running in ~100ms for me.
Whiteboard: [jsperf] → [jsperf] [fixed-in-tracemonkey]
Yeah, looks like at this point -mjp is only about a 10% slowdown over -j here; that's reasonable.  Thanks!

Updated

9 years ago
Whiteboard: [jsperf] [fixed-in-tracemonkey] → [jsperf] [fixed-in-tracemonkey][hardblocker]
http://hg.mozilla.org/mozilla-central/rev/94c6d2a90ae6
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.