Closed Bug 587707 Opened 10 years ago Closed 10 years ago

JM: Inline frame creation

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: dvander, Assigned: bhackett1024)

References

Details

Attachments

(2 files, 8 obsolete files)

With bhackett's work in bug 586533 we can start moving pieces of CreateLightFrame into method headers.
Attached patch gutted call path (obsolete) — Splinter Review
This patch adds an inline call path, jury rigs it in using the native call MIC, and strips out pretty much everything that isn't necessary to run the microbenchmark below.  Stuff that's missing and needs to go back for things to work includes checks for stack overflow, and writes to ncode/savedPC/cx->fp (maybe others).

For this microbenchmark:

var start = new Date();

function bar() { var a; }
function foo() {
  for (var ind = 0; ind < 1200000; ind++)
    bar();
}
foo();

var end = new Date();
print("Time: " + (end - start));

I get these times:

V8: 6.8ms
JSC: 7.9ms
JM (old): 36ms
JM (new): 9.6ms

If I replace the call with a GETGNAME I get about the same perf difference with JSC, so I think the baseline cost of a call is in line here.
Assignee: general → bhackett1024
Attached patch more gutting (obsolete) — Splinter Review
This removes several additional things.

References to f.scriptedReturn are removed. (not sure how this interacts with bug 587833).  All loads/stores of the return site are through fp->ncode, and scripted call/return are implemented as jumps.

Entry frame return is not special cased.  fp->ncode for the initial frame points to a JaegerTrampolineReturn symbol.

f.fp and f.cx->fp are not maintained for scripted calls, but are written out when doing a stub or native call (the number of scripted/native calls will stay the same, the number of stub calls will trend down).  It would be nice to only need to write one of these.

Still just working on microbenchmarks, and still missing stack limit check.  Time for the loop in comment 1 is cut to 7.5ms, which given loop overhead means faster calls than JSC, similar cost to V8.
Attachment #467269 - Attachment is obsolete: true
(In reply to comment #3)
> This is very pleasing to me.

You, me, the trees, and the bees -- pretty much everyone. Oh, and the JS Hobbits too, down on the JS Shire. Don't forget them ;-).

Yay, bhackett!!

/be
The data above show a current slowdown (relative to JSC) of 23 ms per 1M calls. I'm not sure what machine that is--it could be faster than AWFY. Anyway, we make 1.26M scripted calls in SunSpider, so we should be able to get 30ms or so from this.
This data is from my macbook, which is consistently 20-25% faster than the AWFY machine.  I think we should be able to get 35-40ms total improvement on AWFY for calls.  This depends on scripted call ICs landing (bug 587698) then more pretty straightforward stuff to strip the rest of the call/return path down.  I'll add dependencies to this bug for the things that needs to happen.
Depends on: 589193, 588978
Depends on: 588993
This fixes the gutted stack frame so that 'this' and return values work (including via SETRVAL/POPV), and savedPC is lazily computed using the callee's ncode.  For scavenging after 587698 lands.
Attachment #467400 - Attachment is obsolete: true
Attached patch patch fixing test failures (obsolete) — Splinter Review
jstests and trace-tests pass with this, except a couple tests stressing stack checks and MIC GC safety (which will be fixed by rebasing on 587698).
Attachment #468859 - Attachment is obsolete: true
blocking2.0: --- → ?
This bug overlapped somewhat with bug 539144 and bug 587698.  After those landed, I filed bug 595073 and bug 595074 to deal with specific pieces that remained.  I didn't pour over the patch, though -- are there any other pieces left to land?
I've been working on a patch that does several things from this patch, including both from 595073 and 595074 --- fix rval, fix savedpc (which becomes callerpc, unlike in this patch, so only needs to be written once per frame), turn calls into jumps, only store f.regs.fp on calling into a native or stub.  These are the bits which can be done with no loss to the VM, should have a working-for-x86 patch later today.

More interesting is what to do with lazying fp->fun (aka exec), fp->argc, fp->scopeChain, and even the stack writes of the function and 'this' value.  *all* of these can be skipped in the common cases of CALLGNAME or CALLPROP which goes through a prototype, by embedding into the JIT code assumptions that write method barriers on singleton objects (global objects, prototype objects) do not get overridden (and go through a MIC flush and active-stack-fixup if the write barrier *is* overridden).  Doing a CALLGNAME + CALL would require the stack overflow check, writes to fp->flags, fp->prev, fp->ncode, a few direct jumps to patch things together.  Problem is that making things like exec/scopeChain lazy hurts the VM.  I looked into where these were being accessed in SS/V8 and found three categories.

- Heavyweight functions and NAME/etc. opcodes, related to eval.  Not much to do about these, but this is the smallest category.

- Making instances of builtin classes (Array, Object, ...).  js_GetClassPrototype first goes to the scope chain of the topmost frame, before looking at cx->globalObject.  I don't understand why this is needed; comments on cx->globalObject suggest it should == cx->fp()->scopeChain()->getGlobal().

- Calling Invoke (biggest category).  IIRC most of the SS ones are in string-unpack-code, should be addressed by bug 581893.  There are a few million Invokes between v8-deltablue and v8-raytrace, in deltablue mostly because of slow paths being taken on calls.

I'd like to either fix the latter two cases or know they're fixable before cutting down the call path per the above, need to poke around some more.
>  Making instances of builtin classes (Array, Object, ...). 

For these, in compile-and-go scripts, we should be able to special-case and bake in the global object. There are cases where cx->globalObject != cx->fp()->scopeChain()->getGlobal(), Brendan or Luke could tell you.
Sounds good.  Incidentally, the fp->args.nactual write is already skipped in the common case when argc == fun->nargs.  As for lazifying the scopeChain write, is your plan significantly different than bug 593882 comment 1?  Also, what is your plan to skip fp->fun?  It can easily be rematerialized from fp->formalArgs()[-2], but formalArgs() is computed using fun->nargs.  Recompute from the pc?
(In reply to comment #12)
> Sounds good.  Incidentally, the fp->args.nactual write is already skipped in
> the common case when argc == fun->nargs.  As for lazifying the scopeChain
> write, is your plan significantly different than bug 593882 comment 1?  Also,
> what is your plan to skip fp->fun?  It can easily be rematerialized from
> fp->formalArgs()[-2], but formalArgs() is computed using fun->nargs.  Recompute
> from the pc?

Yes, the plan for scopeChain would be the same as described in bug 593882 comment 1 (the patch in this bug already does what's described there, including making sure frames with JSOP_NAME/etc. opcodes always have scope chains).

fp->fun would be skipped not by trying to rematerialize it from fp->formalArgs()[-2], but from the caller's MIC information.  When you need the frame's function and it's not there, it must have been invoked from a monomorphic call site.  You can find the corresponding callee by doing a binary search of the prev frame's MICs for the current frame's ncode (in the same way this patch recovers the savedpc).  This would allow us to avoid writing both fp->fun and fp->formalArgs()[-2].

fp->formalArgs()[-1] does not have to be written when it is NULL (CALLGNAME etc.) and we know from an IC that the callee doesn't have any JSOP_THIS opcodes.

Again, the problem with all this is that making recovering the function from a frame involve a binary search will only be a net speedup if the function is asked for very rarely.
(In reply to comment #13)
> (the patch in this bug already does what's described there, including
> making sure frames with JSOP_NAME/etc. opcodes always have scope chains).

Oops, didn't read closely enough :/

> fp->fun would be skipped not by trying to rematerialize it from
> fp->formalArgs()[-2], but from the caller's MIC information.

Cool!  So far I've had good luck trading call-path speed for fp-member access, but I've been adding branches, not binary searches ;-)

One idea that I talked about with dvander that we dismissed as too gross is stealing X bits from the flags to store fun->nargs (when, of course, fun->nargs < 2^x), thereby allowing fp->fun to be recovered from fp->formalArgs()[-2].  This of course doesn't let you avoid storing the callee to the stack as you suggested, but it might be a useful compromise if the binary search is too expensive.
Attached file WIP combo patch (obsolete) —
This has the rval/savedpc/fp/calls-are-jumps fixes mentioned in comment 10.  Passes tests and trace-tests, x86 only.  At AWFY scale, this looks to be worth about 4ms on JM SS, 90ms on JM V8, 20ms on JM+TM V8.
Attachment #469713 - Attachment is obsolete: true
4ms is 1/15 of how much we're behind on Sunspider on AWFY.  Sounds pretty good to me!  20ms is 1/8 of how much we're behind nitro on V8, for that matter.
Attached patch x64 patch (obsolete) — Splinter Review
This works in the shell for OS X x86 and x64.  I've modified the Windows and SUN versions to suit, though haven't tested.  Try server will test Windows, what about SUN?  What should I do about ARM?  I don't know how its instruction set works and don't have a machine to test things on.
Attachment #477333 - Attachment is obsolete: true
Attachment #477503 - Flags: feedback?(dvander)
Cc'ing ginn.chen for Sun help. It's not a tier 1 platform, so we sometimes break it but take fixes as soon as they come in.

/be
In TrampolineSUNWX86.s,

+/ void JaegerTrampolineReturn()
+.global JaegerTrampolineReturn
+.type   JaegerTrampolineReturn @function
+JaegerTrampolineReturn:
+    movl  %edx, 0x18(%ebx)
+    movl  %ecx, 0x1C(%ebx)

You missed a comma, should be
+.type   JaegerTrampolineReturn, @function

Besides this typo, it passed tests on Solaris x86.
I didn't try x64, because it is blocked by another bug, we can fix it later.

Thanks!
A few small comments:

Is it valid for FrameRegsIter::operator++ to use fp->prevPC()?  It seems like it could be iterating over frames pushed by the mjit.

The idea behind the initCallFrame{CallerHalf,EarlyPrologue,LatePrologue} trio is that these functions say in C++ what the generated mjit code does.  Since the generated prologue doesn't set fp->prevpc_, it seems like initCallFrameCallerHalf shouldn't either, if possible.

For consistency, could you make pc lowercase in prevPC and setPrevPC?
Comment on attachment 477503 [details] [diff] [review]
x64 patch

Only high-level comment is that emitReturn() has a lot of duplication and complexity now. Especially in that deep if-chain that does register shuffling. We should try to simplify that.
Attachment #477503 - Flags: feedback?(dvander) → feedback+
Attached patch working patch (obsolete) — Splinter Review
This patch is green on try, working for all platforms it tests.  Does that include Windows x64?  If not is there a way to test that (can't get in to the jswindows box)?
Attachment #477503 - Attachment is obsolete: true
Attachment #478245 - Flags: review?(lw)
Attachment #478245 - Flags: review?(dvander)
Attached patch patch fixing comments (obsolete) — Splinter Review
This fixes issues from comments 20 and 21.
Attachment #478245 - Attachment is obsolete: true
Attachment #478265 - Flags: review?(lw)
Attachment #478265 - Flags: review?(dvander)
Attachment #478245 - Flags: review?(lw)
Attachment #478245 - Flags: review?(dvander)
> Does that include Windows x64?

No.
Win64 isn't producing builds right now as it is, so with no try coverage and no builds, I don't think you should especially worry about it tbh.  (Though we should make sure you can get into the jswindows box!)

What do the numbers look like for this patch?
I haven't tested this patch but it should be the same as comment 15.  the ncode changes enable further optimizations which can cut the call path down to the stack overflow check plus 3 writes, in the common case (direct global function call).
blocking2.0: ? → -
status2.0: --- → wanted
Comment on attachment 478265 [details] [diff] [review]
patch fixing comments

Mmm, I like the new JSStackFrame::setPrev.

> FrameRegsIter::FrameRegsIter(JSContext *cx)
> {
>+    this->cx = cx;

Perhaps

 : cx(cx)

instead?

>+    /* Compute the next frame if it wasn't supplied. */
>+    if (!next) {
>+        next = regs->fp;
>+        while (next->prev_ != this) {
>+            JS_ASSERT(next->prev_);
>+            next = next->prev_;
>+        }
>+    }

If 'this' can be *any* frame in the context, then this linear search will not find frames that have been set aside with JS_SaveFrameChain or eval-not-in-frame or skipped with eval-in-frame.  (In general, a context contains a forest of trees of stack frames.)  FrameRegsIter/AllFramesIter also won't work because their linearization of said forest will introduce false 'next' relationships between frames.  Thus, I think you need something that explicitly deals with segments -- something like http://pastebin.mozilla.org/796357.
would *almost* work

Posting partial review now before my video drivers causes my thinkpad to restart...
Comment on attachment 478265 [details] [diff] [review]
patch fixing comments

> would *almost* work

Editing mistake in comment 27: remove the *almost* from "would *almost* work".

> inline void
> PutActivationObjects(JSContext *cx, JSStackFrame *fp)
> {
>     JS_ASSERT(fp->isFunctionFrame() && !fp->isEvalFrame());
> 
>+    if (!fp->hasActivationObjects())
>+        return;
>+
>+    /*
>+     * The method JIT always calls PutActivationObjects before returning,
>+     * and the interpreter may try to call it again.  This check ensures
>+     * we only try to put the objects once in a frame.
>+     */
>+    if (fp->hasPutActivationObjects())
>+        return;
>+    fp->setPutActivationObjects();
>+
>     /* The order is important as js_PutCallObject needs to access argsObj. */
>     if (fp->hasCallObj()) {
>         js_PutCallObject(cx, fp);
>-    } else if (fp->hasArgsObj()) {
>+    } else {
>         js_PutArgsObject(cx, fp);
>     }
> }

Instead of adding a new JSFRAME_PUT_OBJECTS flag, could PutActivationObjects (or js_PutCallObject/js_PutArgsObject) just unset JSFRAME_HAS_CALL_OBJ/JSFRAME_HAS_ARGS_OBJ?

>-    inline void initEvalFrame(JSScript *script, JSStackFrame *prev, uint32 flags);
>+    inline void initEvalFrame(JSScript *script,
>+                              JSStackFrame *prev, jsbytecode *prevpc, uint32 flags);

Could you put 'prev' and/or 'prevpc' on the same line as 'script'?

>+    jsbytecode *prevpc() {
>+        JS_ASSERT(flags_ & JSFRAME_HAS_PREVPC);
>+        return prevpc_;
>+    }

Could you add '&& prev_ != NULL'?

>-JSStackFrame::initEvalFrame(JSScript *script, JSStackFrame *downFrame, uint32 flagsArg)
>+JSStackFrame::initEvalFrame(JSScript *script,
>+                            JSStackFrame *prev, jsbytecode *prevpc, uint32 flagsArg)

Same indentation nit as above.

>-    flags_ = flagsArg | (downFrame->flags_ & (JSFRAME_FUNCTION |
>-                                              JSFRAME_GLOBAL |
>-                                              JSFRAME_HAS_CALL_OBJ));
>+    flags_ = flagsArg | JSFRAME_HAS_PREVPC |
>+        (prev->flags_ & (JSFRAME_FUNCTION |
>+                              JSFRAME_GLOBAL |
>+                              JSFRAME_HAS_CALL_OBJ));

Could you align '(prev->flags_' with 'flagsArg' and align JSFRAME_GLOBAL and HAS_CALL_OBJ with JSFRAME_FUNCTION?

>-    scopeChain_ = &downFrame->scopeChain();
>-    JS_ASSERT_IF(isFunctionFrame(), &callObj() == &downFrame->callObj());
>-    /* savedpc initialized by pushExecuteFrame */
>-    prev_ = downFrame;
>+    scopeChain_ = &prev->scopeChain();
>+    JS_ASSERT_IF(isFunctionFrame(), &callObj() == &prev->callObj());
>+
>+    setPrev(prev, prevpc);

This looks good: its nice to have prev_/prevpc_ initialized in the JSStackFrame::init* function with the rest of stack frame members as opposed to a special case in the StackSpace::push* function.  Previously, I put the prev_/prevpc_-initialization in StackSpace so that you could attempt to reason about savedpc_ locally, but, with your changes, I think it would be way simpler to just set prev_/prevp_ in the JSStackFrame::init* functions.  If its not too much trouble, do you think you could do this?  If not, I can change it later.

>@@ -1237,33 +1237,34 @@ obj_eval(JSContext *cx, uintN argc, Valu
>     JSStackFrame *callerFrame = (staticLevel != 0) ? caller : NULL;
>+    jsbytecode *callerpc = callerFrame ? callerFrame->pc(cx) : NULL;
>     if (!script) {
[snip]
>-                Execute(cx, scopeobj, script, callerFrame, JSFRAME_EVAL, vp);

You added an extra callerpc argument to Execute, presumably so that Execute wouldn't have to do prev->pc(cx) which is quite expensive if prev is a mjit-frame.  However, (I think) the only remotely-hot call to Execute is from obj_eval (the rest pass prev == NULL or are eval-in-frame (debugger-only)), and here you do callerFrame->pc(cx) anyways.  Now, the hot eval case has callerFrame == cx->regs->fp, so callerFrame->pc(cx) should be cheap.  So is there any reason to still add this extra 'callerpc' argument to Execute if Execute can just do prev->pc(cx) at no loss of performance?

>           BEGIN_CASE(JSOP_RETURN)
>           {
>-            FrameEntry *fe = frame.peek(-1);
>-            frame.storeTo(fe, Address(JSFrameReg, JSStackFrame::offsetOfReturnValue()), true);
>-            frame.pop();
>-            emitReturn();
>+            emitReturn(true);
>           }
>           END_CASE(JSOP_RETURN)

can drop the { }

>           BEGIN_CASE(JSOP_RETRVAL)
>-            emitReturn();
>+            emitReturn(false);
>           END_CASE(JSOP_RETRVAL)

With 3 callsites, it may not be worth the trouble, but it would be more readable to change the bool parameter to an enum parameter a la emitReturn(RETURN_TOP_VALUE)/emitReturn(RETURN_FRAME_RVAL), or something.

>-    inline void assemble()
>+    inline DataLabelPtr assemble(void *ncode)

'inline' is redundant here

> static bool
>-InlineReturn(VMFrame &f, JSBool ok)
>+InlineReturn(VMFrame &f, JSBool ok, JSBool storeReturnValue)
> {
[snip]
>-    Value *newsp = fp->actualArgs() - 1;
>-    newsp[-1] = fp->returnValue();
>-    cx->stack().popInlineFrame(cx, fp->prev(), newsp);
>+    if (storeReturnValue) {
>+        Value *newsp = fp->actualArgs() - 1;
>+        newsp[-1] = fp->returnValue();
>+        cx->stack().popInlineFrame(cx, fp->prev(), newsp);
>+    }
> 
>     return ok;
> }

Is it valid to skip the popInlineFrame if !storeReturnValue?  Maybe I'm missing something, but it seems like this is an optimization, not required for correctness, in which case: is it worth the param/test if the only caller passing !storeReturnValue is RunTracer?

>+    void getValueComponents(const Value &val, RegisterID type, RegisterID payload) {
>+    void getValueComponents(const Value &val, RegisterID type, RegisterID payload) {

(After explaining why this is ok w/ doubles) dvander suggested a renaming to loadValueComponents for consistency.

And lastly, a question:

>     void fixScriptStack(uint32 frameDepth) {
>         /* sp = fp + slots() + stackDepth */
>         addPtr(Imm32(sizeof(JSStackFrame) + frameDepth * sizeof(jsval)),
>                JSFrameReg,
>                ClobberInCall);
> 
>         /* regs->sp = sp */
>         storePtr(ClobberInCall,
>-                 FrameAddress(offsetof(VMFrame, regs) + offsetof(JSFrameRegs, sp)));
>+                 FrameAddress(offsetof(VMFrame, regs.sp)));
>+
>+        /* regs->fp = fp */
>+        storePtr(JSFrameReg, FrameAddress(offsetof(VMFrame, regs.fp)));

IIUC, this wins if we call lots of functions that themselves never take a slow path.  OTOH, if most activations take >1 slow paths, this loses.  Arguably, if we are taking a low of slow paths, we are losing anyways, so an extra store is dwarfed.  So the optimization makes sense, but I wanted to know if you have any measurements for the percent of activations that take 0, 1, >1 slow paths?
> Is it valid to skip the popInlineFrame if !storeReturnValue?  Maybe I'm missing
> something, but it seems like this is an optimization, not required for
> correctness, in which case: is it worth the param/test if the only caller
> passing !storeReturnValue is RunTracer?

This patch shifts the responsibility of popping the frame in JM from the callee to the caller; thus, we do not have to do a load/branch to check for the entry frame every time we return from a function.  The activation objects still need to be put because that is the callee's responsibility.  For the other calls to InlineReturn, the call site code which pops the frame will not execute (unwinding frames after an exception, interpreting an un-JIT'able method).

> >     void fixScriptStack(uint32 frameDepth) {
> >         /* sp = fp + slots() + stackDepth */
> >         addPtr(Imm32(sizeof(JSStackFrame) + frameDepth * sizeof(jsval)),
> >                JSFrameReg,
> >                ClobberInCall);
> > 
> >         /* regs->sp = sp */
> >         storePtr(ClobberInCall,
> >-                 FrameAddress(offsetof(VMFrame, regs) + offsetof(JSFrameRegs, sp)));
> >+                 FrameAddress(offsetof(VMFrame, regs.sp)));
> >+
> >+        /* regs->fp = fp */
> >+        storePtr(JSFrameReg, FrameAddress(offsetof(VMFrame, regs.fp)));
> 
> IIUC, this wins if we call lots of functions that themselves never take a slow
> path.  OTOH, if most activations take >1 slow paths, this loses.  Arguably, if
> we are taking a low of slow paths, we are losing anyways, so an extra store is
> dwarfed.  So the optimization makes sense, but I wanted to know if you have any
> measurements for the percent of activations that take 0, 1, >1 slow paths?

Previously, we had to do two memory accesses for the f.fp value on each call --- store the new one on a push, and restore the old one on a pop.  With this mechanism we write once on each stub and native call.  For SS we do about 1.25 million scripted calls, 900k native calls, and a few hundred thousand stub calls (I don't know how many), so this should help.  V8 has ~50 million scripted calls, very few native calls, and I don't know how many stub calls (hopefully not 100 million!).
(In reply to comment #29)
> > Is it valid to skip the popInlineFrame if !storeReturnValue?  Maybe I'm missing
> > something, but it seems like this is an optimization, not required for
> > correctness, in which case: is it worth the param/test if the only caller
> > passing !storeReturnValue is RunTracer?

Ah, I see.  It seems then that this parameter really deserves a good comment and better name.  Again, an enum instead of a bool param type would really help readability -- perhaps RETURN_LIKE_JIT_CODE / RETURN_LIKE_INTERP ?

> For SS we do about 1.25
> million scripted calls, 900k native calls, and a few hundred thousand stub
> calls (I don't know how many), so this should help.V8 has ~50 million
> scripted calls, very few native calls, and I don't know how many stub calls
> (hopefully not 100 million!).

Sounds good to me, thanks!
Attachment #478265 - Attachment is obsolete: true
Attachment #479117 - Flags: review?(lw)
Attachment #479117 - Flags: review?(dvander)
Attachment #478265 - Flags: review?(lw)
Attachment #478265 - Flags: review?(dvander)
Comment on attachment 479117 [details] [diff] [review]
patch fixing comments

Looks good to me.  Only nits:

>+    /* Get the frame whose prev() is fp, which may be in any segment. */
>+    inline JSStackFrame * computeNextFrame(JSStackFrame *fp);

JSStackFrame *computeNextFrame

>+JSStackFrame *
>+JSContext::computeNextFrame(JSStackFrame *fp)
>+{
>+    JSStackFrame *next = NULL;
>+    for (js::StackSegment *ss = currentSegment; ; ss = ss->getPreviousInContext()) {
>+        JS_ASSERT(ss);
>+        JSStackFrame *end = ss->getInitialFrame()->prev();
>+        for (JSStackFrame *f = ss->getCurrentFrame(); f != end; next = f, f = f->prev()) {
>+            if (f == fp)
>+                return next;
>+        }
>+        if (end != ss->getPreviousInContext()->getCurrentFrame())
>+            next = NULL;
>+    }
>+}

If ss is null, we'll crash cleanly/observably/unexploitably on the next line; I've been told the style is to leave off non-null asserts in such cases.

> JS_REQUIRES_STACK inline
> FrameRegsIter::FrameRegsIter(JSContext *cx)
>+    : cx(cx)

SM style is two spaces before the :

>+    /* Compute the next frame if it wasn't supplied. */
>+    if (!next)
>+        next = cx->computeNextFrame(this);

The code reads well; the comment probably isn't necessary.

>     JSFRAME_HAS_ARGS_OBJ       = 0x20000, /* frame has an argsobj in JSStackFrame::args */
>     JSFRAME_HAS_HOOK_DATA      = 0x40000, /* frame has hookData_ set */
>-    JSFRAME_HAS_ANNOTATION     = 0x80000  /* frame has annotation_ set */
>+    JSFRAME_HAS_ANNOTATION     = 0x80000, /* frame has annotation_ set */
>+
>+    /*
>+     * Whether the prevpc_ value is valid.  If not set, the ncode_ value is
>+     * valid and prevpc_ can be recovered using it.
>+     */
>+    JSFRAME_HAS_PREVPC         = 0x100000,
>+
>+    /*
>+     * For use by compiled functions, at function exit indicates whether rval_
>+     * has been assigned to.  Otherwise the return value is carried in registers.
>+     */
>+    JSFRAME_RVAL_ASSIGNED      = 0x200000

This may be a stupid request, but all the other comments are short and line up nicely... do you think maybe these comments can be similarly short?  Most of what makes them long is basically the same as the other lazy members: if flag not set, recompute.

>+/*
>+ * Clean up a frame and return.  popFrame indicates whether to additionally pop
>+ * the frame and store the return value on the caller's stack.  The frame will
>+ * normally be popped by the caller on return from a call into JIT code,
>+ * so must be popped here when that caller code will not execute.  This can be
>+ * either because of a call into an un-JITable script, or because the call is
>+ * throwing an exception.
>+ */
>+static bool
>+InlineReturn(VMFrame &f, JSBool ok, JSBool popFrame = JS_TRUE);
> 
[snip]
> static bool
>-InlineReturn(VMFrame &f, JSBool ok)
>+InlineReturn(VMFrame &f, JSBool ok, JSBool storeReturnValue)
> {

Need to rename in definition.  Great comment, btw.  To make sure it isn't missed, could you put it on the definition?  That's usually where I look when I want to know what an internal function does.
Attachment #479117 - Flags: review?(lw) → review+
Comment on attachment 479117 [details] [diff] [review]
patch fixing comments

This patch is awesome. Just a few comments:

>+        JSC::RepatchBuffer fastRepatch((uint8*)fastNcode.executableAddress() - 32, 64);
>+        fastRepatch.repatch(fastNcode, joinPoint);
>+
>+        /* Patch the write of ncode in the slow path. */
>+        if (callPatches[i].hasSlowNcode) {
>+            JSC::CodeLocationDataLabelPtr slowNcode =
>+                stubCode.locationOf(callPatches[i].slowNcodePatch);
>+            JSC::RepatchBuffer slowRepatch((uint8*)slowNcode.executableAddress() - 32, 64);
>+            slowRepatch.repatch(slowNcode, joinPoint);
>+        }

Pass a |false| as the third parameter to the RepatchBuffer constructor, to avoid flipping memory protection bits here. Or, use LinkBuffer with its (uint8 *, size_t) constructor.

> extern "C" void JS_FASTCALL
> PushActiveVMFrame(VMFrame &f)
> {
>     f.previous = JS_METHODJIT_DATA(f.cx).activeFrame;
>     JS_METHODJIT_DATA(f.cx).activeFrame = &f;
>+
>+    f.regs.fp->setNativeReturnAddress(JS_FUNC_TO_DATA_PTR(void*, JaegerTrampolineReturn));

Nice trick!


>+    /* Store any known return value */
>+    masm.move(Imm32(0), JSReturnReg_Data);
>+    masm.move(ImmType(JSVAL_TYPE_UNDEFINED), JSReturnReg_Type);

Use loadValueAsComponents(UndefinedValue(), ...

>+     * Load a return value from POPV or SETRVAL into the return registers,
>+     * otherwise return undefined.
>+     */
>+    masm.move(Imm32(0), JSReturnReg_Data);
>+    masm.move(ImmType(JSVAL_TYPE_UNDEFINED), JSReturnReg_Type);

Ditto.

> 
>+    /* If this is a RETURN opcode then the top value on the stack is the return value. */
>+    FrameEntry *fe = (JSOp(*PC) == JSOP_RETURN) ? frame.peek(-1) : NULL;

Pass the |fe| as an inparam to emitReturn() instead.

>             /* if (hasCallObj() || hasArgsObj()) stubs::PutActivationObjects() */
>             Jump putObjs = masm.branchTest32(Assembler::NonZero,
>                                              Address(JSFrameReg, JSStackFrame::offsetOfFlags()),
>                                              Imm32(JSFRAME_HAS_CALL_OBJ | JSFRAME_HAS_ARGS_OBJ));
>             stubcc.linkExit(putObjs, Uses(frame.frameDepth()));

Not relevant to this bug, but potential follow-up: I don't think we need to sync the entire frame if !hasCallObj().

>+    /* REVIEW: what is this code here for? */
> #if (defined(JS_NO_FASTCALL) && defined(JS_CPU_X86)) || defined(_WIN64)
>     masm.callLabel = masm.label();
> #endif

This is some hackery to make debug-mode recompilation work on platforms that have special alignment concerns.

r=me with those addressed
Attachment #479117 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/81881086131a

SS:

JM: 4ms (as predicted)
JM+TM: 8.5ms (more than predicted, maybe bogus)
TM: -2.5ms

V8:

JM: 115ms (more than predicted)
JM+TM: 27ms (more than predicted)
TM: -62ms

Is the interpreter calling fp->pc() a lot?  That's the only reason I can think of for why pure TM would be slower with this patch.  The TM regression may be bogus, TM on V8 is pretty chaotic and exhibits random swings >= this frequently.
(In reply to comment #34)
> Is the interpreter calling fp->pc() a lot?

On every function return in popInlineFrame.  This seems like a fair tradeoff to me :)
Well, after a couple more (non-performance) checkins the TM delta is down to -15ms, looks like the regression was mostly a random fluctuation.  The 8.5ms gain on SS for JM+TM is still there (still seems weird).
Duplicate of this bug: 595074
Brian, your check in is bustage on Win64.

ml64.exe -o TrampolineMasmX64.obj  -c /c/Workspace/hg.mozilla.org/tracemonkey/js/src/methodjit/TrampolineMasmX64.asm
Microsoft (R) Macro Assembler (x64) Version 9.00.30729.01
Copyright (C) Microsoft Corporation.  All rights reserved.

MASM : warning A4018:invalid command-line option : -o
 Assembling: c:/Workspace/hg.mozilla.org/tracemonkey/js/src/methodjit/TrampolineMasmX64.asm
c:/Workspace/hg.mozilla.org/tracemonkey/js/src/methodjit/TrampolineMasmX64.asm(123) : fatal error A1010:unmatched block nesting : JaegerTrampoline
Attached patch fixSplinter Review
Attachment #480021 - Flags: review?(dvander)
Attachment #480021 - Flags: review?(dvander) → review+
http://hg.mozilla.org/mozilla-central/rev/81881086131a
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Duplicate of this bug: 551761
You need to log in before you can comment on or make changes to this bug.