Closed Bug 665815 Opened 9 years ago Closed 8 years ago

TI: raytrace much slower with -n

Categories

(Core :: JavaScript Engine, defect)

Other Branch
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: azakai, Assigned: bhackett)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

Attached file raytrace benchmark
Attached is a raytrace benchmark (compiled from c++, and after running closure compiler). It runs much slower on JM with -n than without:

TM -m -j -p      3.841
JM -m -j -p      3.654
JM -m -j -p -n  13.582

TM and JM are very similar without -n, but with -n JM becomes very slow.

All other testcases I tried have shown -n to be much faster on compiled code than without -n, this is the only exception I have found.
Looked a bit at this, the problem is with this huge "function Fa" at line 130. What happens is that we end up in the interpreter (via the interpoline) and we can't re-enter JM since CanMethodJITAtBranch aborts because RecursiveMethodJIT returns true. RECURSIVE_METHODJIT_LIMIT is 10 at the moment, if I change it to 12 we're faster with -n:

js -m -n :  2.481 (with limit 12)
d8       :  3.501
js -m    :  5.685
js -m -n : 17.109 (with limit 10)

RecursiveMethodJIT has this comment ":XXX: should be more robust", now is probably the time to do so ;)
The limit is there because of this stupid behavior we have where if the same stack frame keeps compiling and then getting kicked into the interpoline, we will keep building up VMFrames and eventually exhaust the stack.  What we should do is ensure that each stack frame is the entry frame for at most one VMFrame, should not be too hard.  I should be back and working sometime Thursday, will look then.
Attached patch patch (obsolete) — Splinter Review
Fix for blowing the C stack with lots of Interpret and JaegerShot frames when we keep switching between the interpreter and mjit due to recompilation.  JaegerShot is not necessarily required to finish the entry frame, but can instead finish up at a valid interpreter state anywhere within its entry frame.  Interpret() watches for this when calling JaegerShot at loop boundaries or when making an inline call (one not through Invoke, also this needs a new term now that the JITs do real inlining) and picks up where the JIT left off.

For any series of inline frames, there will be at most one Interpret invocation and at most two JaegerShot invocations.  (Two Jaegershot activations are possible when Invoke calls the JIT directly without going through Interpret; after recompiling the JIT may end up calling Interpret to finish the frame, which can call JaegerShot again).

One effect of this is that when recompiling a script we throw not just that script's frames into the interpreter, but also all of its down frames associated with the same VMFrame.  Those down frames will go back onto the JIT if they hit loop heads again, so this doesn't matter much (no SS effect).
Assignee: general → bhackett1024
Attachment #543805 - Flags: review?(luke)
Comment on attachment 543805 [details] [diff] [review]
patch

Review of attachment 543805 [details] [diff] [review]:
-----------------------------------------------------------------

Sweet, glad to see this getting fixed.

::: js/src/jsinterp.cpp
@@ +2442,5 @@
> +             * the new frame and its return value offset by one value from
> +             * their placement in a direct call. When popping the frame, we
> +             * need to watch for this and correct the return value and sp.
> +             */
> +            if ((op == JSOP_FUNCALL || op == JSOP_FUNAPPLY) && cx->typeInferenceEnabled()) {

Holy moly this is hacky.  There must be a less hacky way -- something that is "local" to the call/apply optimization.  Ideally the compiler would make it so that errors were handled correctly and, failing that, js_InternalInterpret, which already has cases for each stage of LOWERED_CALL/APPLY.

::: js/src/methodjit/MethodJIT.cpp
@@ +883,3 @@
>      fp->scopeChain();
>      if (fp->isFunctionFrame() && fp->script()->usesArguments)
>          fp->ensureCoherentArgCount();

I thought we talked about removing both of these (end of bug 657412 comment 28)...

@@ +914,5 @@
> +            ? JSINTERP_SKIP_TRAP
> +            : JSINTERP_REJOIN;
> +        ok = Interpret(cx, fp, mode);
> +        fp->putActivationObjects();
> +        fp->returnValue();

The enclosing ContextStack pop operation will handle putActivationObjects.  To avoid confusion from the markActivationObjectsAsPut below, it seems like you could just return early (after asserting 'fp == cx->fp()').  Also I'm not sure what purpose 'returnValue' is serving here, so perhaps it can go too

::: js/src/methodjit/MethodJIT.h
@@ +387,5 @@
>      JSC::ExecutableAllocator *execAlloc_;    // allocator for jit code
>      Trampolines              trampolines;    // force-return trampolines
>      VMFrame                  *activeFrame_;  // current active VMFrame
> +    JaegerStatus             lastUnfinished_;// result status of last VM frame,
> +                                             // if unfinished

Can we just have JaegerTrampoline return the JaegerStatus directly instead of adding a global state?  The only interesting lastUnfinished_ values flow out of js_InternalInterpret which only gets called from specific trampolines, so this seems doable.
Attachment #543805 - Flags: review?(luke)
> ::: js/src/jsinterp.cpp
> @@ +2442,5 @@
> > +             * the new frame and its return value offset by one value from
> > +             * their placement in a direct call. When popping the frame, we
> > +             * need to watch for this and correct the return value and sp.
> > +             */
> > +            if ((op == JSOP_FUNCALL || op == JSOP_FUNAPPLY) && cx->typeInferenceEnabled()) {
> 
> Holy moly this is hacky.  There must be a less hacky way -- something that
> is "local" to the call/apply optimization.  Ideally the compiler would make
> it so that errors were handled correctly and, failing that,
> js_InternalInterpret, which already has cases for each stage of
> LOWERED_CALL/APPLY.

Yeah, felt pretty dirty writing this but the basic problem is that there is no real way to localize this to the call/apply optimization: recompilation of a frame deep inside the call/apply could happen, the recompiler and InternalInterpret would know nothing about the lowered frame on the stack, and the frame ends up getting popped by the interpreter which was not aware of the optimization either.  The fundamental issue is that wherever possible the method JIT should leave a valid interpreter state; that doesn't hold here, but I don't see a way for the optimization to leave a valid state without creating worse problems.

That said, I'll work on making this cleaner with a flag on the callee frame indicating it was from a lowered call/apply, which the interpreter can inspect when popping it.  That needs more changes on the mjit's call path, but shouldn't slow things down (the caller is responsible for writing the callee's flags).
(In reply to comment #4)
> > +        fp->putActivationObjects();
> > +        fp->returnValue();
> 
> The enclosing ContextStack pop operation will handle putActivationObjects. 
> To avoid confusion from the markActivationObjectsAsPut below, it seems like
> you could just return early (after asserting 'fp == cx->fp()').  Also I'm
> not sure what purpose 'returnValue' is serving here, so perhaps it can go too

Both the putActivationObjects and returnValue calls are necessary because of the different ABI between JaegerTrampoline() and Interpret().  mjit frames always produce the return value and JaegerTrampoline always writes them to fp->rval without setting the HAS_RVAL flag, and will also put the activation objects without unsetting the HAS_ARGS_OBJ or HAS_CALL_OBJ flags.

In contrast, Interpret() does not always set fp->rval and does not put the activation objects of its entry frame.

We set HAS_RVAL below (so the caller doesn't think the rval is undefined), and mark the activation objects as put (so the caller doesn't double put).  For this to work after calling Interpret the frame needs to be updated to match this state.
Attached patch bigger, cleaner patch (obsolete) — Splinter Review
This adds a LOWERED_CALL_APPLY flag for lowered call/apply frames, which is passed around when constructing call frames.  Generalizes MaybeConstruct to InitialFrameFlags to capture this info, so complexity doesn't increase much but a fair amount of code is touched.

This fixes the frame coherency stuff in EnterMethodJIT (sorry, forgot to do that earlier), but keeps the lastUnfinished() thing in JaegerCompartment.  It's a little hard to transmit this state in the Interpoline's return path because InternalInterpret can also return a code pointer, and I'd like to avoid changing the 7 or 8 interpolines we have now (not all of which work the same way) when this bit of crud is pretty well isolated.  Can fix the next time we need to make a more significant change to all the interpolines.
Attachment #543805 - Attachment is obsolete: true
Attachment #544351 - Flags: review?(luke)
(In reply to comment #6)
> Both the putActivationObjects and returnValue calls are necessary because of
> the different ABI between JaegerTrampoline() and Interpret().

I was aware of the difference; my point was that it shouldn't matter: our callers have to deal with the fact that RunScript might called either JaegerShot or Interpret so they already handle both cases (activation objects put or no).

> We set HAS_RVAL below (so the caller doesn't think the rval is undefined),

'returnValue' doesn't set the bit, it just initializes fp->rval_.  However, on return from Interpret, fp->rval_ and the HAS_RVAL bit should be in a coherent state.

> and mark the activation objects as put (so the caller doesn't double put). 

that's why I said return early

Just to test whether I'm full of it I made this change:

diff --git a/js/src/methodjit/MethodJIT.cpp b/js/src/methodjit/MethodJIT.cpp
@@ -901,18 +901,18 @@ mjit::EnterMethodJIT(JSContext *cx, Stac
         ok = Interpret(cx, fp, mode);
-        fp->putActivationObjects();
-        fp->returnValue();
+        JS_ASSERT(fp == cx->fp());
+        return ok ? Jaeger_Returned : Jaeger_Throwing;
     }

and jit_tests.py --jitflags=mn seems to pass.  This isn't exactly a proof, but can you come up with a counter-example?
(In reply to comment #8)
> 'returnValue' doesn't set the bit, it just initializes fp->rval_.  However,
> on return from Interpret, fp->rval_ and the HAS_RVAL bit should be in a
> coherent state.

I was referring to the markReturnValue() call, which sets the bit without writing fp->rval_.

> Just to test whether I'm full of it I made this change:
> 
> diff --git a/js/src/methodjit/MethodJIT.cpp b/js/src/methodjit/MethodJIT.cpp
> @@ -901,18 +901,18 @@ mjit::EnterMethodJIT(JSContext *cx, Stac
>          ok = Interpret(cx, fp, mode);
> -        fp->putActivationObjects();
> -        fp->returnValue();
> +        JS_ASSERT(fp == cx->fp());
> +        return ok ? Jaeger_Returned : Jaeger_Throwing;
>      }
> 
> and jit_tests.py --jitflags=mn seems to pass.  This isn't exactly a proof,
> but can you come up with a counter-example?

You should get test failures with --jitflags=mna (there is much less recompilation activity with --jitflags=mn).

I'll add the premature return, though I don't understand the strong preference for it (though it would be nice if Interpret and JaegerTrampoline behaved exactly the same way and none of this was necessary).
Great job with the frame flag solution to the lowered-call/apply business!  Yeah, thinking a bit more about it, I think you are right that the only reasonable solution is to special-case it in the interp.  Plus, this frame flag can be used in StackIter.

Btw, looking at StackIter just now, I see an #if 0 around the js_ReconstructStackDepth assertion... any plan to remove that?  If there are short intervals where the stack is incoherent you can silence it with cx->stackIterAssertionEnabled.  Should I file that separately or would you?

One more general comment: it seems like the only point in the code that needs the generality of InitialFrameFlags is in the UncachedCall mjit code and pushInlineFrame (which UncachedCall calls).  With that understanding, I think what you want to do to avoid generalizing most of the code unnecessarily is to keep using MaybeConstruct everywhere except in these few places where INITIAL_LOWERED can actually flow.  Then, use a simple injection function:
  InitialFrameFlags ToInitialFrameFlags(MaybeConstruct)
where a MaybeConstruct value flows into an InitialFrameFlags parameter (I think the only place this will be is when JSOP_CALL calls pushInlineFrame).

That should shrink the patch and keep this lowered business mostly out of the way.
(In reply to comment #9)
> You should get test failures with --jitflags=mna (there is much less
> recompilation activity with --jitflags=mn).

No failures with --jitflags=mna

> I'll add the premature return, though I don't understand the strong
> preference for it

b/c its strictly redundant code (not caring about perf, but from a logical perspective so the reader doesn't have to ask, as I did, "why is this necessary?").

> (though it would be nice if Interpret and JaegerTrampoline
> behaved exactly the same way and none of this was necessary).

David and I have a plan (he considers it blocking IonMonkey) to remove putActivationObjects altogether, that will tidy up half of this sad business.  The return-value business just stinks altogether and probably we can just clean it up by design in IM by changing the calling convention.
Attachment #544351 - Flags: review?(luke)
(In reply to comment #10)
> Btw, looking at StackIter just now, I see an #if 0 around the
> js_ReconstructStackDepth assertion... any plan to remove that?  If there are
> short intervals where the stack is incoherent you can silence it with
> cx->stackIterAssertionEnabled.  Should I file that separately or would you?

Argh, before merging to TM I disabled this and pushed to try to test if it was causing the Windows debug timeouts (since those timeouts are debug-only, came in on a merge from TM and this assert had slowed down other stuff I was suspicious).  It wasn't, and I thought I removed the #if 0 but apparently not.  Go ahead and remove the '#if 0' (or I can).  Sorry!
(In reply to comment #11)
> (In reply to comment #9)
> > You should get test failures with --jitflags=mna (there is much less
> > recompilation activity with --jitflags=mn).
> 
> No failures with --jitflags=mna

Weird, I was getting failures (figured I would need to move those calls in from the tail of InternalInterpret, which didn't have the premature-return option, but ran without them to double check they were actually needed).  Not to harp on it but the reason they are needed is, again, the ABI.  After the Interpret call fp->rval_ may not be coherent (if HAS_RVAL is unset), and after we do markReturnValue the bit is set and the frame will look to the caller like it is returning trash.
(In reply to comment #11)
> David and I have a plan (he considers it blocking IonMonkey) to remove
> putActivationObjects altogether, that will tidy up half of this sad
> business.  The return-value business just stinks altogether and probably we
> can just clean it up by design in IM by changing the calling convention.

Anything filed, or want to talk about it in person / on IRC?  This interacts pretty heavily with the reentrant closures stuff, which generalizes putActivationOUbjects into fp->functionPrologue.
Attached patch patch v3Splinter Review
This still uses InitialFrameFlags everywhere, UncachedInlineCall can still access most of the call machinery so the only place where a lowered function frame cannot be pushed is Invoke.  Also good to avoid fragmentation from having another enum and another enum value aliased to StackFrame::CONSTRUCTING, and I suspect in the future we will want to be able to set other initial flags on Invoke frames.
Attachment #544351 - Attachment is obsolete: true
Attachment #544573 - Flags: review?(luke)
(In reply to comment #13) 
> Not to harp on it but the reason they are needed is, again, the ABI.  After
> the Interpret call fp->rval_ may not be coherent (if HAS_RVAL is unset), and
> after we do markReturnValue the bit is set and the frame will look to the
> caller like it is returning trash.

But markReturnValue doesn't get called when you return early... your patch is doing all this right so I have no complaints... so what are you harping on?  If you are explaining why those calls were necessary when you *didn't* return early, then don't worry, I got that -- in my original comment, even.

(In reply to comment #14)
> Anything filed, or want to talk about it in person / on IRC?  This interacts
> pretty heavily with the reentrant closures stuff, which generalizes
> putActivationOUbjects into fp->functionPrologue.

At the moment there is just an informal plan (in particular, how to handle unexpected evalInFrame) bit it would be great to know how it interacts with your stuff and see if you are thinking along the same lines.

(In reply to comment #16)
> This still uses InitialFrameFlags everywhere, UncachedInlineCall can still
> access most of the call machinery so the only place where a lowered function
> frame cannot be pushed is Invoke.  Also good to avoid fragmentation from
> having another enum and another enum value aliased to
> StackFrame::CONSTRUCTING, and I suspect in the future we will want to be
> able to set other initial flags on Invoke frames.

I know its mostly just for the benefit of Invoke, but Invoke is a rather central piece of the JS engine so I'd like to keep it cleaner than average.  In particular, from an interface perspective, MaybeConstruct makes sense ("is this Invoke constructing?") and InitialFrameFlags does not ("frame? flags?").  If new frame flags are needed in the future, I'd like to approach it by separately considering interface changes to Invoke and representation changes in StackFrame::initCallFrame.  I know I'm being fussy here, but could you pretty please just make the minimal change to keep MaybeConstruct in js::Invoke's interface?  (Define MaybeConstruct right above js::Invoke in jsinterp.h and set CONSTRUCT = INITIAL_CONSTRUCT and NO_CONSTRUCT = INITIAL_NONE.)
Comment on attachment 544573 [details] [diff] [review]
patch v3

Review of attachment 544573 [details] [diff] [review]:
-----------------------------------------------------------------

With those and these nits fixed, looks great.

::: js/src/jsinferinlines.h
@@ +395,5 @@
> +UseNewTypeAtEntry(JSContext *cx, StackFrame *fp)
> +{
> +    return fp->isConstructing() && cx->typeInferenceEnabled() &&
> +        fp->prev() && fp->prev()->isScriptFrame() &&
> +        UseNewType(cx, fp->prev()->script(), fp->prev()->pcQuadratic(cx->stack, fp));

style is to align 'fp->prev()' with 'fp->isConstructing()'.

::: js/src/methodjit/InvokeHelpers.cpp
@@ +442,5 @@
>      f.script()->types.monitor(f.cx, f.pc(), args.rval());
>  }
>  
>  void
> +stubs::UncachedCallHelper(VMFrame &f, uint32 argc, bool lowered, UncachedCallResult *ucr)

Seems like it would be simpler below to just take an InitialFrameFlags arg.

::: js/src/vm/Stack-inl.h
@@ -142,5 @@
>                             HAS_SCOPECHAIN |
>                             HAS_ANNOTATION |
>                             HAS_HOOK_DATA |
> -                           HAS_CALL_OBJ |
> -                           HAS_ARGS_OBJ |

nice

@@ +284,5 @@
>  StackFrame::numActualArgs() const
>  {
> +    /*
> +     * args.nactual is always coherent, except for method JIT frames where the
> +     * callee does not access its arguments. The JIT requires that all frames

"except where the callee does not access its arguments" + " and the number of formal arguments matches the number of actual arguments".

@@ +287,5 @@
> +     * args.nactual is always coherent, except for method JIT frames where the
> +     * callee does not access its arguments. The JIT requires that all frames
> +     * which do not have an arguments object and use their arguments have a
> +     * coherent args.nactual (even though the below code may not use it), as
> +     * JIT code may access the field directly.

Really?  I thought we only set it if FixupArity is called which only happens if there is an arity mismatch.  Also, where does args.nactual get accessed directly from JIT code?

::: js/src/vm/Stack.h
@@ +815,5 @@
>       *
> +     * The method JIT requires that HAS_SCOPECHAIN be set for all frames which
> +     * use NAME or related opcodes that can access the scope chain (so it does
> +     * not have to test the bit). To ensure this, we always instantiate the
> +     * scope chain when pushing frames in the VM, and only instantiate it when

s/instantiate/initialize/
Attachment #544573 - Flags: review?(luke) → review+
(In reply to comment #18)
> Really?  I thought we only set it if FixupArity is called which only happens
> if there is an arity mismatch.  Also, where does args.nactual get accessed
> directly from JIT code?

This used to be the case, but with the lazy arguments stuff we now access args.nactual directly from jitcode (for JSOP_LENGTH and for bounds checking lazy args accesses), and in scripts which have/use lazy arguments we always set it during the prologue if there was no argc mismatch.
(In reply to comment #19)
Thanks. Aha, I see that I was looking at generatePrologue in mozilla-central :P
hg.mozilla.org/projects/jaegermonkey/rev/4bb2b60db2e2

Will pull to TM after checking stability.
Whiteboard: fixed-in-jaegermonkey
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: fixed-in-jaegermonkey
You need to log in before you can comment on or make changes to this bug.