Closed Bug 495331 Opened 15 years ago Closed 14 years ago

Trace JSOP_LAMBDA for non-flat, non-null closures

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: dmandelin)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 7 obsolete files)

See bug 492918.
Blocks: 494268
No longer blocks: 492918
Depends on: 471214
I've started working on this, and I have some code that kind of works, but there are some major problems to be solved before this gets very far:

1. Creating a heavyweight closure requires the current scope chain. 

If the *parent function* (the function that immediately contains the closure) is not heavyweight, then we can get current scope chain on trace as the parent of the current callee object (i.e., fp->scopeChain == fp->callee->[PARENT]). That's easy to trace.

If the parent function *is* heavyweight, then we currently have no way to get the scope chain on trace. Possible solutions:

 - The simplest way to fix this would seem to be to put the scope chain into the on-trace frame, as was done for |arguments|.

 - An alternative would be to change the interpreter so that there is a link from a callee object (which is a Function object) to its own call object, if it has one. (Currently, a callee object only has a direct link to the parent of its call object.)

2. Every first call to a new lambda will exit trace.

When we call a function on trace, we guard two things (see guardCallee):

  (a)          callee == record_callee
  (b)          callee->[PARENT] == record_callee->[PARENT]

   where record_callee is the callee Function object observed while recording
     and callee        is the callee Function object on trace

With heavyweight lambda callees, generally we will fail guard (b) every time. (For example, this happens in the go game reduced test case in bug 460050.) The reason is that the parent object will be different for each activation of the parent function. The result will be bad perf as we leave trace and create a new branch for each call. Possible solutions:

  - Omit guard (b). I don't know what the purpose of guard (b) actually is, so I don't know if this is safe or what else would need to be changed to make it safe.

  - Dynamically patch code pertaining to the protected properties. This is not easy to explain, but it's similar to a PIC. The idea is that guard (b) must be protecting some properties or other, that make later code valid. Assume WLOG that a call trace looks like this:

      guard G1 that callee parent == P1 (guard (b))
      code  C1 that embeds P1
      code  D1 that does not embed P1

If we now try to call with P2, we could patch this to:

      guard G2 that callee parent == P2 (guard (b))
      code  C2 that embeds P2
      code  D1 that does not embed P1

Basically, we replace all "references" to P1 with P2.
(In reply to comment #1)
>  - The simplest way to fix this would seem to be to put the scope chain into
> the on-trace frame, as was done for |arguments|.

Just to be extra clear, you mean on the JIT's native stack, not in FrameInfo.

>  - An alternative would be to change the interpreter so that there is a link
> from a callee object (which is a Function object) to its own call object, if it
> has one. (Currently, a callee object only has a direct link to the parent of
> its call object.)

This is a non-starter. You can't safely mutate a function object this way, it'll break recursion. It's also ineffecient to save and restore in the callee's parent fslot. It's also nasty on its face (mutation, I mean).

We have optimized away Call objects in many cases, perhaps there's a third way:

 - Add assignment conversion and heap-box mutated uvpars converting them to "invisible" refs to the boxed ref (see Chez Scheme); and do whatever else it takes to reduce Call object occurrences.

Another optimization that might help some cases (I'm not sure why the heavyweight lambda case is the focus here -- does this bug need resummarizing?) could be the one mentioned in bug 471214 comment 28: a non-escaping closure with mutated upvars but where all are in range of the recorder.

>   - Omit guard (b). I don't know what the purpose of guard (b) actually is, so
> I don't know if this is safe or what else would need to be changed to make it
> safe.

This is related to the point from bug 471214 comment 28. That comment suggests removing guard (b) when we know we don't rely on any outer scope objects, only on active frame slots, even though some may be mutated by closures.

Another way to eliminate guard (b) is to arrange for the right scope chain to be constructed on trace so that the callee's parent is guaranteed to connect to it (as the seed for the callee's Call object if heavyweight). Then there's no need to guard on recording-time callee parent matching. We just need to know that the Call objects that were created by the interpreter, witnessed by the recorder, will be created in the same structure on-trace.

But to do this, we have to create Call objects on-trace, yet they won't ever (while on-trace) have JSStackFrame private data -- they would have null private data with their reserved slots filled in from the JIT's native stack before the activation that created such a Call object returned.

/be


> 
>   - Dynamically patch code pertaining to the protected properties. This is not
> easy to explain, but it's similar to a PIC. The idea is that guard (b) must be
> protecting some properties or other, that make later code valid. Assume WLOG
> that a call trace looks like this:
> 
>       guard G1 that callee parent == P1 (guard (b))
>       code  C1 that embeds P1
>       code  D1 that does not embed P1
> 
> If we now try to call with P2, we could patch this to:
> 
>       guard G2 that callee parent == P2 (guard (b))
>       code  C2 that embeds P2
>       code  D1 that does not embed P1
> 
> Basically, we replace all "references" to P1 with P2.
(In reply to comment #2)
> (In reply to comment #1)
> Another way to eliminate guard (b) is to arrange for the right scope chain to
> be constructed on trace so that the callee's parent is guaranteed to connect to
> it (as the seed for the callee's Call object if heavyweight).

I mean "seed for the callee's Call object's parent". The callee is statically scoped so that its parent becomes the parent of its Call object. Only if it is heavyweight, of course!

Dave, can you classify the causes of heavyweight JSOP_LAMBDAs that you're trying to JIT here? Outer heavyweights due to inner lambda escaping *and* using mutable upvars, I get. But inner lambda being heavyweight, I'm interested in the details of why.

/be
Summary: Trace JSOP_LAMBDA → Trace JSOP_LAMBDA for heavyweight closures
(In reply to comment #3)
> Dave, can you classify the causes of heavyweight JSOP_LAMBDAs that you're
> trying to JIT here? Outer heavyweights due to inner lambda escaping *and* using
> mutable upvars, I get. But inner lambda being heavyweight, I'm interested in
> the details of why.

I might be wrong about the terminology here. Here is an example of what I want to make trace, from the go benchmark:

    function funcA() {  // was findChains
        var n = 0;
        manyTimes(function() {
            n += 1;
        });
    }

Here, we get an abort on the creation of the inner function. The function disassembles to this:

flags: LAMBDA
main:
00000:  bindname "n"
00003:  dup
00004:  getxprop "n"
00007:  one
00008:  add
00009:  setname "n"
00012:  pop
00013:  stop

I guess this isn't a heavyweight. And so I guess it doesn't need a call object, which may make the scope chain a bit easier to compute. But I think it does need a real scope chain.

(Thinking about this helped--I was confused about what functions actually were heavyweight before.)
Summary: Trace JSOP_LAMBDA for heavyweight closures → Trace JSOP_LAMBDA for non-flat, non-null closures
(In reply to comment #2)
> (In reply to comment #1)
> >  - The simplest way to fix this would seem to be to put the scope chain into
> > the on-trace frame, as was done for |arguments|.
> 
> Just to be extra clear, you mean on the JIT's native stack, not in FrameInfo.

Yes.

> >  - An alternative would be to change the interpreter so that there is a link
> > from a callee object (which is a Function object) to its own call object, if it
> > has one. (Currently, a callee object only has a direct link to the parent of
> > its call object.)
> 
> This is a non-starter. You can't safely mutate a function object this way,
> it'll break recursion. It's also ineffecient to save and restore in the
> callee's parent fslot. It's also nasty on its face (mutation, I mean).

I'm fine with that--I didn't really want to make a cross-cutting interpreter change like that.

> We have optimized away Call objects in many cases, perhaps there's a third way:
> 
>  - Add assignment conversion and heap-box mutated uvpars converting them to
> "invisible" refs to the boxed ref (see Chez Scheme); and do whatever else it
> takes to reduce Call object occurrences.

That sounds interesting. Per comment 4, maybe we don't have as big an issue with Call objects as I thought, but it may be useful in the future.

> Another optimization that might help some cases (I'm not sure why the
> heavyweight lambda case is the focus here -- does this bug need resummarizing?)
> could be the one mentioned in bug 471214 comment 28: a non-escaping closure
> with mutated upvars but where all are in range of the recorder.

I suspect those are fairly rare. Before I got started on this, we traced jsop_getupvar iff the upvars were in recorder range, but that didn't seem to cover many cases.

> >   - Omit guard (b). I don't know what the purpose of guard (b) actually is, so
> > I don't know if this is safe or what else would need to be changed to make it
> > safe.
> 
> This is related to the point from bug 471214 comment 28. That comment suggests
> removing guard (b) when we know we don't rely on any outer scope objects, only
> on active frame slots, even though some may be mutated by closures.

IIUC, you are saying that guard (b) protects against potential situations where we bake the callee functions' parent into the trace, and so those are the situations we have to look for. As we discussed, I think there may not be any such situations, but I'll have to sit down for a while to come up with a proofoid on that.

> Another way to eliminate guard (b) is to arrange for the right scope chain to
> be constructed on trace so that the callee's parent is guaranteed to connect to
> it (as the seed for the callee's Call object if heavyweight). Then there's no
> need to guard on recording-time callee parent matching. We just need to know
> that the Call objects that were created by the interpreter, witnessed by the
> recorder, will be created in the same structure on-trace.
> 
> But to do this, we have to create Call objects on-trace, yet they won't ever
> (while on-trace) have JSStackFrame private data -- they would have null private
> data with their reserved slots filled in from the JIT's native stack before the
> activation that created such a Call object returned.
> 
> /be
> 
> 
> > 
> >   - Dynamically patch code pertaining to the protected properties. This is not
> > easy to explain, but it's similar to a PIC. The idea is that guard (b) must be
> > protecting some properties or other, that make later code valid. Assume WLOG
> > that a call trace looks like this:
> > 
> >       guard G1 that callee parent == P1 (guard (b))
> >       code  C1 that embeds P1
> >       code  D1 that does not embed P1
> > 
> > If we now try to call with P2, we could patch this to:
> > 
> >       guard G2 that callee parent == P2 (guard (b))
> >       code  C2 that embeds P2
> >       code  D1 that does not embed P1
> > 
> > Basically, we replace all "references" to P1 with P2.
As noted, too many dimensions to keep track of at once:

function weight: light vs. heavy (does not vs. does need Call object)
escaping or not: funarg
uses mutated upvars (prevents flat closure optimization)
uses no free variables (static analysis binds upvars to outer decls, no globals)

Confusingly, null closure uses no free variables but also can't be optimized to a flat closure. It's lightweight by definition.

So there's a trumping order: heavyweight trumps flat trumps null trumps the nameless left-over "interpreted lightweight" function kind. The encoding via JSFUN_ flags has a bit for JSFUN_HEAVYWEIGHT, two bits for FUN_KIND(fun).

Hope this helps.

Agree we want to get rid of that callee parent guard. It is a bug now thanks to your upvar work. File it separately if you like. Thanks,

/be
Attached patch WIP (obsolete) — Splinter Review
This patch needs a lot of cleanup, but it passes trace-test at least. Doing more testing now.
Assignee: general → dmandelin
Status: NEW → ASSIGNED
Attached patch Patch (obsolete) — Splinter Review
Here it is. Design notes for reviewers & posterity:

The basic idea is to do on trace what is done in the interpreter. The key problem that needs to be solved is that for this kind of closure, JSOP_LAMBDA needs to be passed a real scope chain, and we have not so far had real scope chains on trace.

My approach was to add the scope chain to the on-trace frame. When heavyweight functions are called on trace, a Call object is added to the head of the scope chain, just as in the interpreter. When a heavyweight function returns, the variables are imported into the call object, just as in the interpreter.
Attachment #397187 - Attachment is obsolete: true
Attachment #398550 - Flags: review?(jorendorff)
Attached patch Patch 2 (unrotted) (obsolete) — Splinter Review
This rotted a touch since the original review request (hint, hint) so resubmitting.
Attachment #398550 - Attachment is obsolete: true
Attachment #402246 - Flags: review?(jorendorff)
Attachment #398550 - Flags: review?(jorendorff)
I'm really sorry for the lateness of this review. My queue has been a mess, and I keep clearing out the easy stuff. :-P

A couple preliminary comments.

In jsbuiltins.h:
>         { (intptr_t) &name, argtypes, cse, fold, nanojit::ABI_FASTCALL...
> #endif
> 
>+#define _JS_DEFINE_CALLINFO_CDECL(linkage, name, crtype, cargtypes, ar...
>+    _JS_TN_LINKAGE(linkage, crtype) name cargtypes;                   ...
>+    _JS_CI_LINKAGE(linkage) const nanojit::CallInfo _JS_CALLINFO(name)...
>+        { (intptr_t) &name, argtypes, cse, fold, nanojit::ABI_CDECL _J...

Instead of adding this, can we just make js_CloneFunctionObject fastcall? It
doesn't *appear* to be used outside the engine (according to MXR).

There's some precedent for this approach. We even did it for a FRIEND function,
js_CloseIterator.

In jstracer.cpp:
>+JS_DEFINE_CALLINFO_CDECL_3(extern, OBJECT, js_CloneFunctionObject, CONTEXT, FUNCTION, OBJECT, 0, 0)

Just for consistency's sake, please use JS_DECLARE_CALLINFO in jsbuiltins.h and
put this line in jsfun.cpp right after the function definition (like you did
for js_PutCallObjectOnTrace).

In jstracer.cpp:
>+        visitor.setStackSlotKind("scopeChain");
>+        if (!visitor.visitStackSlots((jsval*) &fp->scopeChain, 1, fp))
>+            return false;

Heh, nice. This abstraction breakage deserves a comment.

If we're going to perpetrate this kind of hack anyway, it seems like we might
as well do the same for callobj and argsobj and keep them as fields of type JSObject*.

There are at least a few places where after your patch there's still code like:
  if (fp->callobj)
or:
  fp->callobj = NULL;
If it's going to be a jsval, we should say `if (!JSVAL_IS_NULL(fp->callobj))`
and `fp->callobj = JSVAL_NULL;` Same goes for argsobj.  But I would prefer to
use your fp->scopeChain hack.

And could you please post a patch with `-U8 -p`?
(In reply to comment #10)
> Instead of adding this, can we just make js_CloneFunctionObject fastcall? It
> doesn't *appear* to be used outside the engine (according to MXR).
> 
> There's some precedent for this approach. We even did it for a FRIEND
> function, js_CloseIterator.

Sounds reasonable to me. The last few existing builtins I needed to call from trace were stored in function-pointer-typed fields, so I couldn't make them fastcall. I think that biased my mind away from doing that. There's also the fact that fastcall isn't particularly great on x86. But shortening the patch is always welcome.

> If we're going to perpetrate this kind of hack anyway, it seems like we might
> as well do the same for callobj and argsobj and keep them as fields of type
> JSObject*.

I put callobj back. It doesn't need to be modified for this patch. In the earliest versions, I thought I was going to be importing callobj instead of scopeChain, so I made that change. But changing the type of scopeChain was really unappealing, so I backed off from that, and forgot to undo the callobj change. Thanks for catching that.

argsobj seems fairly happy the way it is, so I'm inclined to leave it be for now.
Attachment #402246 - Attachment is obsolete: true
Attachment #402415 - Flags: review?(jorendorff)
Attachment #402246 - Flags: review?(jorendorff)
In jsfun.cpp:
>+JSBool JS_FASTCALL
>+js_PutCallObjectOnTrace(JSContext *cx, JSObject *scopeChain, uint32 nargs, jsval *argv,
>+                        uint32 nvars, jsval *slots)
>+{
>+    JSObject *callobj = scopeChain;
>+    while (!callobj->hasClass(&js_CallClass)) {
>+        JS_ASSERT(js_IsCacheableNonGlobalScope(callobj));
>+        callobj = callobj->getParent();
>+    }
>+    JS_ASSERT(!callobj->getPrivate());

What's the case where this scopeChain itself isn't the object we're looking
for? Can we do this loop at record time instead of at run time?

In jsfun.h:
>@@ -159,16 +159,18 @@ struct JSFunction {
>+    uintN countVars() const { return u.i.nvars; }

This could assert that FUN_INTERPRETED(this), like countArgsAndVars().

>@@ -3197,16 +3209,29 @@ FlushNativeStackFrame(JSContext* cx, uns
>                 /*
>                  * SynthesizeFrame sets scopeChain to NULL, because we can't calculate the
>                  * correct scope chain until we have the final callee. Calculate the real
>                  * scope object here.
>                  */
>+                if (fp->scopeChain) {
>+                    [...]                    
>+                }
>+
>+                JS_ASSERT(fp->scopeChain);
>                 if (!fp->scopeChain) {
>                     fp->scopeChain = OBJ_GET_PARENT(cx, JSVAL_TO_OBJECT(fp->argv[-2]));

Not feeling confident here?  :)

I'm pretty confident that fp->scopeChain can't be null there, because the
FlushNativeStackFrameVisitor now populates it. When we EnterFrame, we set that
native-stack slot to a non-null value. So please remove the
`if (fp->scopeChain) {}` wrapping that code and delete the `if (!fp->scopeChain)`
block.

> /*
>  * If we have created an |arguments| object for the frame, we must copy the
>  * argument values into the object as properties in case it is used after
>  * this frame returns.
>  */
> JS_REQUIRES_STACK void
> TraceRecorder::putArguments()

I think this should be renamed to putActivationObjects (after
JSStackFrame::putActivationObjects), and the comment should be updated.

>+    if (have_args) {
>+        LIns* argsobj_ins = get(&cx->fp->argsobj);
>         LIns* args[] = { args_ins, argsobj_ins, cx_ins };
>         lir->insCall(&js_PutArguments_ci, args);
>     }

It's not necessary to emit this call to js_PutArguments if nargs is 0, right?

After TraceRecorder::putArguments:
>+JS_DEFINE_CALLINFO_4(extern, OBJECT, js_CreateCallObjectOnTrace, CONTEXT, FUNCTION, OBJECT, OBJECT, 0, 0)

I'm sorry, didn't notice this. Could you move this after the function definition in jsfun.cpp?

>@@ -9232,16 +9308,38 @@ TraceRecorder::record_EnterFrame()
>+    set((jsval *) &fp->scopeChain, INS_NULL(), true);
>+
>+    LIns* callee_ins = get(&cx->fp->argv[-2]);
>+    LIns* scopeChain_ins = stobj_get_parent(callee_ins);
>+
>+    if (cx->fp->fun && JSFUN_HEAVYWEIGHT_TEST(cx->fp->fun->flags)) {
>+        JSAtom *lambdaName = (cx->fp->fun->flags & JSFUN_LAMBDA) 
>+            ? cx->fp->fun->atom : NULL;
>+        if (lambdaName)
>+            ABORT_TRACE("can't call named lambda heavyweight on trace");
>+
>+        LIns* fun_ins = INS_CONSTPTR(cx->fp->fun);
>+
>+        LIns* args[] = { scopeChain_ins, callee_ins, fun_ins, cx_ins };
>+        LIns* call_ins = lir->insCall(&js_CreateCallObjectOnTrace_ci, args);
>+        guard(false, lir->ins_eq0(call_ins), snapshot(OOM_EXIT));
>+
>+        set((jsval *) &fp->scopeChain, call_ins);
>+    } else {
>+        set((jsval *) &fp->scopeChain, scopeChain_ins);
>+    }

The first set() added here is redundant, right? It always gets overwritten.

In the guard, that needs to be lir->ins_peq0(call_ins), peq0 not eq0, according
to dvander.

It seems like a shame to do this store on every function call. Can't we usually
avoid it (i.e. in cases where we know we're not going to need the callobj)?
Have you measured the performance cost?

>@@ -11005,17 +11103,17 @@ TraceRecorder::record_JSOP_GETELEM()
>                     // Guard that the argument has the same type on trace as during recording.
>                     LIns* typemap_ins;
>-                    if (callDepth == depth) {
>+                    if (depth == 0) {
>                         // In this case, we are in the same frame where the arguments object was created.
>                         // The entry type map is not necessarily up-to-date, so we capture a new type map
>                         // for this point in the code.

Hmm, what's this? A bugfix?

>@@ -11266,17 +11364,17 @@ TraceRecorder::record_JSOP_CALLNAME()

> JS_REQUIRES_STACK JSRecordingStatus
> TraceRecorder::record_JSOP_CALLNAME()
> {
>     JSObject* obj = cx->fp->scopeChain;
>     if (obj != globalObj) {
>         jsval* vp;
>         LIns* ins;
>         NameResult nr;
>         CHECK_STATUS(scopeChainProp(obj, vp, ins, nr));
>         stack(0, ins);
>         stack(1, INS_CONSTOBJ(globalObj));
>         return JSRS_CONTINUE;
>     }
>-    LIns* obj_ins = scopeChain();
>+    LIns* obj_ins = entryScopeChain();
>     JSObject* obj2;
>     jsuword pcval;
> 
>     CHECK_STATUS(test_property_cache(obj, obj_ins, obj2, pcval));

This seems wrong to me, in the case where the entry scope chain is in a nested
function where some global names are shadowed, but we currently have
callDepth > 0 and we're in a different function, and in the present lexical
scope those names aren't shadowed.

I'm not sure. It might be that we don't emit CALLNAME in this case, or we don't
trace the cases where it matters. I would try to write a trace-test
illustrating this, but I'm running out of time for today.

I think at this point in the code we know that cx->fp->scopeChain == globalObj,
and I *think* that means we can just
  LIns* obj_ins = global_ins;
but I'm not sure about that either.

>@@ -11757,17 +11855,17 @@
> JS_REQUIRES_STACK JSRecordingStatus
> TraceRecorder::name(jsval*& vp, LIns*& ins, NameResult& nr)
> {
>     JSObject* obj = cx->fp->scopeChain;
>     if (obj != globalObj)
>         return scopeChainProp(obj, vp, ins, nr);
> 
>     /* Can't use prop here, because we don't want unboxing from global slots. */
>-    LIns* obj_ins = scopeChain();
>+    LIns* obj_ins = entryScopeChain();

Same issue here.

In record_JSOP_LAMBDA():
>+        if (OBJ_GET_PARENT(cx, FUN_OBJECT(fun)) != globalObj)
>+            return JSRS_STOP;

The last line should say `ABORT_TRACE(something);`, right?

>+    LIns* call_ins = lir->insCall(&js_CloneFunctionObject_ci, args);
>+    guard(false,
>+          addName(lir->ins2(LIR_eq, call_ins, INS_CONSTPTR(0)),

Style nit:  addName(lir->ins_peq0(call_ins),

> TraceRecorder::record_JSOP_LAMBDA_FC()
> {
>     JSFunction* fun;
>     fun = cx->fp->script->getFunction(getFullIndex());
> 
>-    LIns* scopeChain_ins = get(&cx->fp->argv[-2]);
>-    JS_ASSERT(scopeChain_ins);
>-
>     LIns* args[] = {
>-        scopeChain_ins,
>+        scopeChain(),
>         INS_CONSTFUN(fun),
>         cx_ins
>     };
>     LIns* call_ins = lir->insCall(&js_AllocFlatClosure_ci, args);

In the interpreter, we use cx->fp->scopeChain without bothering to call
js_GetScopeChain. It looks like you're changing this to do the same thing here.
Cool, but is there any reason we can't just parent the new lambda to the global
object in both cases? That would save a load or two on trace.

In trace-test/tests/basic/lambda.js:
>+[...]
>+function h() {
>+  for (var i = 0; i < 10; ++i) {
>+    var vf = f();
>+    assertEq(vf(), 1);
>+    assertEq(vf(), 2);
>+    for (var j = 0; j < 10; ++j) {
>+      assertEq(vf(), j + 3);
>+    }
>+  }
>+}
>+
>+h();
>+
>+checkStats({
>+  recorderAborted: 9,      // Inner tree is trying to grow
>+});

This is flunking "guard (b)", as it's called in comment 1, right?  In other
words, "major problem 2" from comment 1 has not been solved?

How does this patch affect the benchmarks?

>+++ b/js/src/trace-test/tests/basic/namedLambda.js
>@@ -0,0 +1,16 @@
>+var f = function ff() {
>+  var k = 0;
>+  var counter = function q() {
>+      return ++k;
>+  }
>+  return counter;
>+}
>+
>+function g() {
>+  for (var i = 0; i < 10; ++i) {
>+    var c = f();
>+    print(c());
>+  }
>+}
>+
>+g();

This test should be asserting the result of c() instead of printing it, and it should probably checkStats() too.

I don't feel comfortable stamping this without perf measurements.
Attachment #402415 - Flags: review?(jorendorff)
(In reply to comment #12)

Excellent review.

> In jsfun.cpp:
> >+JSBool JS_FASTCALL
> >+js_PutCallObjectOnTrace(JSContext *cx, JSObject *scopeChain, uint32 nargs, jsval *argv,
> >+                        uint32 nvars, jsval *slots)
> >+{
> >+    JSObject *callobj = scopeChain;
> >+    while (!callobj->hasClass(&js_CallClass)) {
> >+        JS_ASSERT(js_IsCacheableNonGlobalScope(callobj));
> >+        callobj = callobj->getParent();
> >+    }
> >+    JS_ASSERT(!callobj->getPrivate());
> 
> What's the case where this scopeChain itself isn't the object we're looking
> for? Can we do this loop at record time instead of at run time?

In general, of course, fp->scopeChain can be something else. But this function is called only on exit from a function, and I think at that point there can't be a with block or anything like that on the scope chain. I changed this loop to an assert and it passes all the shell tests. (I'll be doing try server soon.)

> >+    if (have_args) {
> >+        LIns* argsobj_ins = get(&cx->fp->argsobj);
> >         LIns* args[] = { args_ins, argsobj_ins, cx_ins };
> >         lir->insCall(&js_PutArguments_ci, args);
> >     }
> 
> It's not necessary to emit this call to js_PutArguments if nargs is 0, right?

For the right definition of 'nargs', yes. But I think I have it covered, because have_args is defined as:

    bool have_args = cx->fp->argsobj && cx->fp->argc;

I tried pretty hard to make that function not record any unnecessary work.

[ storing to the scopeChain slot in the native stack frame. ]
> It seems like a shame to do this store on every function call. Can't we usually
> avoid it (i.e. in cases where we know we're not going to need the callobj)?
> Have you measured the performance cost?

It could be done, but it's a bit difficult. The current design for snapshots wants the tracker to have an instruction for every slot in the native stack frame area. It will assert if not. (That's also the reason for the extra store of NULL above the first set. I reduced that cost by pushing the extra store inside the case that needs it.)

I could work on relaxing the tracker requirement if you think it's necessary. I assumed it would be hard and not worth the cost.

> 
> >@@ -11005,17 +11103,17 @@ TraceRecorder::record_JSOP_GETELEM()
> >                     // Guard that the argument has the same type on trace as during recording.
> >                     LIns* typemap_ins;
> >-                    if (callDepth == depth) {
> >+                    if (depth == 0) {
> >                         // In this case, we are in the same frame where the arguments object was created.
> >                         // The entry type map is not necessarily up-to-date, so we capture a new type map
> >                         // for this point in the code.
> 
> Hmm, what's this? A bugfix?

Yes, for a bug I detected while writing this. I don't remember exactly what it was at this point.

> >@@ -11266,17 +11364,17 @@ TraceRecorder::record_JSOP_CALLNAME()
> 
> > JS_REQUIRES_STACK JSRecordingStatus
> > TraceRecorder::record_JSOP_CALLNAME()
> > {
> >     JSObject* obj = cx->fp->scopeChain;
> >     if (obj != globalObj) {
> >         jsval* vp;
> >         LIns* ins;
> >         NameResult nr;
> >         CHECK_STATUS(scopeChainProp(obj, vp, ins, nr));
> >         stack(0, ins);
> >         stack(1, INS_CONSTOBJ(globalObj));
> >         return JSRS_CONTINUE;
> >     }
> >-    LIns* obj_ins = scopeChain();
> >+    LIns* obj_ins = entryScopeChain();
> >     JSObject* obj2;
> >     jsuword pcval;
> > 
> >     CHECK_STATUS(test_property_cache(obj, obj_ins, obj2, pcval));
> 
> This seems wrong to me, in the case where the entry scope chain is in a nested
> function where some global names are shadowed, but we currently have
> callDepth > 0 and we're in a different function, and in the present lexical
> scope those names aren't shadowed.
> 
> I'm not sure. It might be that we don't emit CALLNAME in this case, or we don't
> trace the cases where it matters. I would try to write a trace-test
> illustrating this, but I'm running out of time for today.
> 
> I think at this point in the code we know that cx->fp->scopeChain == globalObj,
> and I *think* that means we can just
>   LIns* obj_ins = global_ins;
> but I'm not sure about that either.

I was just preserving the exact operation of the original code for that part. But I think you are right: in both this case and the one for JSOP_NAME, that line occurs only in the case where the scope chain is the global object.

> > TraceRecorder::record_JSOP_LAMBDA_FC()
> > {
> >     JSFunction* fun;
> >     fun = cx->fp->script->getFunction(getFullIndex());
> > 
> >-    LIns* scopeChain_ins = get(&cx->fp->argv[-2]);
> >-    JS_ASSERT(scopeChain_ins);
> >-
> >     LIns* args[] = {
> >-        scopeChain_ins,
> >+        scopeChain(),
> >         INS_CONSTFUN(fun),
> >         cx_ins
> >     };
> >     LIns* call_ins = lir->insCall(&js_AllocFlatClosure_ci, args);
> 
> In the interpreter, we use cx->fp->scopeChain without bothering to call
> js_GetScopeChain. It looks like you're changing this to do the same thing here.
> Cool, but is there any reason we can't just parent the new lambda to the global
> object in both cases? That would save a load or two on trace.

That won't work. Example:

    var k = 1;
    var r = 10;

    var e = function() {
      eval('var r = 20');

      var f = function() {
	var k = 2;

	var g = function() {
	  print('g_k ' + k);
	  print('g_r ' + r);
	}
	return g;
      }
      return f();
    }

    var g = e();
    dis(g);
    g();


'g' ends being a flat closure that uses JSOP_NAME to read the value of r. If I change it to just use the global object as the scope chain, it prints '10' for r, which is wrong.

> In trace-test/tests/basic/lambda.js:
> >+[...]
> >+function h() {
> >+  for (var i = 0; i < 10; ++i) {
> >+    var vf = f();
> >+    assertEq(vf(), 1);
> >+    assertEq(vf(), 2);
> >+    for (var j = 0; j < 10; ++j) {
> >+      assertEq(vf(), j + 3);
> >+    }
> >+  }
> >+}
> >+
> >+h();
> >+
> >+checkStats({
> >+  recorderAborted: 9,      // Inner tree is trying to grow
> >+});
> 
> This is flunking "guard (b)", as it's called in comment 1, right?  In other
> words, "major problem 2" from comment 1 has not been solved?

Correct. That's bug 510554.

> How does this patch affect the benchmarks?
> 
> >+++ b/js/src/trace-test/tests/basic/namedLambda.js
> >@@ -0,0 +1,16 @@
> >+var f = function ff() {
> >+  var k = 0;
> >+  var counter = function q() {
> >+      return ++k;
> >+  }
> >+  return counter;
> >+}
> >+
> >+function g() {
> >+  for (var i = 0; i < 10; ++i) {
> >+    var c = f();
> >+    print(c());
> >+  }
> >+}
> >+
> >+g();
> 
> This test should be asserting the result of c() instead of printing it, and it
> should probably checkStats() too.

The purpose of this test is simply to ensure that named lambdas, which aren't traced, don't assert. I updated the test to explain that.
 
> I don't feel comfortable stamping this without perf measurements.

I'll put that in a separate comment.
OSX 10.5 SunSpider (TO=with patch):

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.011x as fast    999.7ms +/- 0.8%   989.0ms +/- 0.1%     significant

=============================================================================

  3d:                  -                 148.4ms +/- 0.6%   148.2ms +/- 0.1% 
    cube:              1.012x as fast     41.7ms +/- 0.6%    41.2ms +/- 0.3%     significant
    morph:             *1.018x as slow*   29.4ms +/- 0.5%    30.0ms +/- 0.2%     significant
    raytrace:          -                  77.4ms +/- 0.8%    77.0ms +/- 0.1% 

  access:              1.010x as fast    155.9ms +/- 0.8%   154.3ms +/- 0.1%     significant
    binary-trees:      1.062x as fast     44.1ms +/- 0.9%    41.6ms +/- 0.3%     significant
    fannkuch:          *1.015x as slow*   68.6ms +/- 1.0%    69.6ms +/- 0.2%     significant
    nbody:             ??                 28.7ms +/- 0.8%    28.8ms +/- 0.4%     not conclusive: might be *1.001x as slow*
    nsieve:            -                  14.5ms +/- 1.1%    14.4ms +/- 1.0% 

  bitops:              ??                 41.0ms +/- 1.1%    41.4ms +/- 0.5%     not conclusive: might be *1.011x as slow*
    3bit-bits-in-byte: *1.102x as slow*    1.8ms +/- 7.0%     1.9ms +/- 3.5%     significant
    bits-in-byte:      -                  10.6ms +/- 1.6%    10.4ms +/- 1.4% 
    bitwise-and:       -                   2.3ms +/- 5.8%     2.3ms +/- 5.8% 
    nsieve-bits:       *1.017x as slow*   26.3ms +/- 1.2%    26.7ms +/- 0.5%     significant

  controlflow:         1.031x as fast     38.9ms +/- 1.6%    37.8ms +/- 0.3%     significant
    recursive:         1.031x as fast     38.9ms +/- 1.6%    37.8ms +/- 0.3%     significant

  crypto:              -                  59.9ms +/- 1.1%    59.6ms +/- 0.8% 
    aes:               -                  33.9ms +/- 1.6%    33.5ms +/- 1.2% 
    md5:               ??                 16.7ms +/- 1.1%    16.8ms +/- 0.6%     not conclusive: might be *1.007x as slow*
    sha1:              ??                  9.3ms +/- 1.6%     9.3ms +/- 1.4%     not conclusive: might be *1.004x as slow*

  date:                1.024x as fast    143.1ms +/- 0.6%   139.7ms +/- 0.2%     significant
    format-tofte:      1.031x as fast     74.6ms +/- 0.5%    72.4ms +/- 0.2%     significant
    format-xparb:      1.018x as fast     68.5ms +/- 0.8%    67.4ms +/- 0.2%     significant

  math:                1.014x as fast     35.8ms +/- 0.8%    35.4ms +/- 0.5%     significant
    cordic:            1.057x as fast     11.2ms +/- 1.8%    10.6ms +/- 1.3%     significant
    partial-sums:      -                  17.1ms +/- 0.5%    17.1ms +/- 0.5% 
    spectral-norm:     ??                  7.5ms +/- 1.9%     7.7ms +/- 1.7%     not conclusive: might be *1.021x as slow*

  regexp:              1.010x as fast     46.3ms +/- 0.7%    45.8ms +/- 0.3%     significant
    dna:               1.010x as fast     46.3ms +/- 0.7%    45.8ms +/- 0.3%     significant

  string:              1.011x as fast    330.2ms +/- 1.0%   326.7ms +/- 0.1%     significant
    base64:            -                  14.3ms +/- 1.2%    14.1ms +/- 0.7% 
    fasta:             -                  71.4ms +/- 1.0%    71.0ms +/- 0.2% 
    tagcloud:          1.023x as fast    101.2ms +/- 1.0%    98.9ms +/- 0.2%     significant
    unpack-code:       -                 113.0ms +/- 1.0%   112.4ms +/- 0.1% 
    validate-input:    -                  30.4ms +/- 0.9%    30.2ms +/- 0.4%
V8 baseline:
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1276
DeltaBlue: 63
Crypto: 1016
RayTrace: 260
EarleyBoyer: 361
----
Score: 378
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1278
DeltaBlue: 62
Crypto: 1019
RayTrace: 261
EarleyBoyer: 361
----
Score: 377
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1276
DeltaBlue: 63
Crypto: 982
RayTrace: 242
EarleyBoyer: 323
----
Score: 361

V8 with patch:

david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1248
DeltaBlue: 69
Crypto: 1023
RayTrace: 320
EarleyBoyer: 343
----
Score: 396
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1248
DeltaBlue: 69
Crypto: 1041
RayTrace: 262
EarleyBoyer: 347
----
Score: 382
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1249
DeltaBlue: 69
Crypto: 983
RayTrace: 263
EarleyBoyer: 347
----
Score: 378
david-mandelins-macbook-pro:v8 dmandelin$ ../opt/shell/js -j run.js
Richards: 1249
DeltaBlue: 69
Crypto: 983
RayTrace: 265
EarleyBoyer: 347
----
Score: 379

There are a couple of weirdly outlying runs, but throwing those out, it looks the same overall, but with different subtest results.
Attachment #402415 - Attachment is obsolete: true
Attachment #402930 - Flags: review?(jorendorff)
Awesome patch and review. Good to simplify/constrain things to use the globalObj where possible instead of cx->fp->scopeChain, but curses for the ability of eval to add vars to its dynamic scope (ES5 strict mode cures this).

/be
Comment on attachment 402930 [details] [diff] [review]
Patch 4 (fixes review notes from 3 except as prev noted)

>+js_CreateCallObjectOnTrace(JSContext *cx, JSFunction *fun, JSObject *callee, JSObject *scopeChain)
>+{
>+    JSAtom *lambdaName = (fun->flags & JSFUN_LAMBDA) ? fun->atom : NULL;
>+    JS_ASSERT(!lambdaName);

#ifdef DEBUG these two lines to avoid warnings on some platforms about set but not used lambdaName in non-DEBUG builds.

>+js_PutCallObjectOnTrace(JSContext *cx, JSObject *scopeChain, uint32 nargs, jsval *argv,
>+                        uint32 nvars, jsval *slots)
>+{
>+    JS_ASSERT(scopeChain->hasClass(&js_CallClass));
>+    JS_ASSERT(!scopeChain->getPrivate());
>+
>+    uintN n = nargs + nvars;
>+    if (n != 0) {
>+        JS_LOCK_OBJ(cx, callobj);
>+        memcpy(scopeChain->dslots, argv, nargs * sizeof(jsval));
>+        memcpy(scopeChain->dslots + nargs, slots, nvars * sizeof(jsval));
>+        JS_UNLOCK_OBJ(cx, scopeChain);

Separate bug, pre-existing:

Call objects should never be used (much less mutated) across threads, so the locking here can go. If the call object was ever locked, we could assert that OBJ_SCOPE(scopeChain)->title.ownercx == cx. But we don't know anyone has locked it, and I claim no one should! So at most we might assert ownercx is null and there's no fat lock.

The phear may be that a debugger on another thread is racing to set a new var in the call object (or set a new property of an arguments object, which is also locked similarly). But I suspect we are missing a great deal more thread safety elsewhere in jsfun.cpp.

Igor, dragging you in via addl. review, no need to r+/- but would welcome your thoughts.

Dave, sorry for the drive-by, want to get to the bottom of this, seems like now is a good time ;-).

>+TraceRecorder::scopeChain()
>+{
>+    return cx->fp->callee()
>+        ? get((jsval*) &cx->fp->scopeChain)
>+        : entryScopeChain();

Nit: prevailing style wants three more spaces, so ? and : underhang the 'c' in cx. OR, maybe this all fits on one line without exceeding the modeline's tw= or emacs equiv.?

>+        JSAtom *lambdaName = (cx->fp->fun->flags & JSFUN_LAMBDA) 
>+            ? cx->fp->fun->atom : NULL;

Similar nit.

>+        if (lambdaName)
>+            ABORT_TRACE("can't call named lambda heavyweight on trace");
>+
>+        LIns* fun_ins = INS_CONSTPTR(cx->fp->fun);
>+
>+        LIns* args[] = { scopeChain_ins, callee_ins, fun_ins, cx_ins };
>+        LIns* call_ins = lir->insCall(&js_CreateCallObjectOnTrace_ci, args);
>+        guard(false, lir->ins_peq0(call_ins), snapshot(OOM_EXIT));
>+
>+        set((jsval *) &fp->scopeChain, call_ins);
>+    } else {
>+        set((jsval *) &fp->scopeChain, scopeChain_ins, true);

Why do you need initializing = true here but not in the "then" case?

If you didn't it would be nice to fuse the two set calls; maybe worth it in any event?

/be
Attachment #402930 - Flags: review?(igor)
(In reply to comment #18)
> >+        set((jsval *) &fp->scopeChain, call_ins);
> >+    } else {
> >+        set((jsval *) &fp->scopeChain, scopeChain_ins, true);
> 
> Why do you need initializing = true here but not in the "then" case?

Duh, never mind -- lost the crucial context.

/be
I've got the nits fixed but I'll wait for the reviews before posting so I don't have to cancel the current requests and cause extra confusion. Getting rid of locks is always a good thing.
(In reply to comment #18)
> Call objects should never be used (much less mutated) across threads, so the
> locking here can go. If the call object was ever locked, we could assert that
> OBJ_SCOPE(scopeChain)->title.ownercx == cx. But we don't know anyone has locked
> it, and I claim no one should!

The debugger can access the Call object and then accesses its properties as for ordinary object. Moreover, a debugger can do it from a different thread in principle if its knows that the thread that runs the JS is suspended.
(In reply to comment #21)
> The debugger can access the Call object and then accesses its properties as for
> ordinary object. Moreover, a debugger can do it from a different thread in
> principle if its knows that the thread that runs the JS is suspended.

I meant we should not deoptimize for that case and use locking for Call objects due to potential debugger activities. But we cannot insist that the object is single-threaded.
Comment on attachment 402930 [details] [diff] [review]
Patch 4 (fixes review notes from 3 except as prev noted)

>+JSObject * JS_FASTCALL
>+js_CreateCallObjectOnTrace(JSContext *cx, JSFunction *fun, JSObject *callee, JSObject *scopeChain)
>+{
>+    JSAtom *lambdaName = (fun->flags & JSFUN_LAMBDA) ? fun->atom : NULL;
>+    JS_ASSERT(!lambdaName);

Nit: add a helper inline function to JSFunction like isNamedLambda and use it in js_PutCallObject, TraceRecorder::record_EnterFrame and here to replace the assert with JS_ASSERT(!fun->isNamedLambda())

>+
>+    JSObject *callobj = js_NewObjectWithGivenProto(cx, &js_CallClass, NULL, scopeChain);
>+    if (!callobj ||
>+        !js_EnsureReservedSlots(cx, callobj, fun->countArgsAndVars())) {
>+        return NULL;
>+    }
>+    STOBJ_SET_SLOT(callobj, JSSLOT_CALLEE, OBJECT_TO_JSVAL(callee));
>+    return callobj;

Nit: factor these lines into a separated inline function and use it here and in js_GetCallObject.

>+JSBool JS_FASTCALL

Do we have a bug to support void functions called from the trace ?

>+js_PutCallObjectOnTrace(JSContext *cx, JSObject *scopeChain, uint32 nargs, jsval *argv,
>+                        uint32 nvars, jsval *slots)
>+{
>+    JS_ASSERT(scopeChain->hasClass(&js_CallClass));
>+    JS_ASSERT(!scopeChain->getPrivate());
>+
>+    uintN n = nargs + nvars;
>+    if (n != 0) {
>+        JS_LOCK_OBJ(cx, callobj);
>+        memcpy(scopeChain->dslots, argv, nargs * sizeof(jsval));
>+        memcpy(scopeChain->dslots + nargs, slots, nvars * sizeof(jsval));
>+        JS_UNLOCK_OBJ(cx, scopeChain);
>+    }

Nit: again, factor the common memcpy code here and in js_PutCallObject into inline helper. Also, as discussed, drop the lock/unlock pair as the Call object must not be a subject of concurrent access.


>diff --git a/js/src/jstracer.cpp b/js/src/jstracer.cpp
>@@ -1985,17 +1991,17 @@ NativeStackSlots(JSContext *cx, >         if (fp->argv)
>-            slots += fp->script->nfixed + 1 /*argsobj*/;
>+            slots += fp->script->nfixed + 2 /*argsobj,scopeChain*/;


Add a named constant for that number 2 for better maintenance and to guide the reader that this 2 is not the same as the one below. 

>         if (depth-- == 0) {
>             if (fp->argv)
>                 slots += 2/*callee,this*/ + argSlots(fp);

>@@ -5290,17 +5283,17 @@ SynthesizeFrame(JSContext* cx, const Fra
...
>-           script->nfixed + 1/*argsobj*/;
>+           script->nfixed + 2/*argsobj,scopeChain*/;

Again, use a named constant here. r+ with this addressed.
Attachment #402930 - Flags: review?(igor)
Attachment #402930 - Flags: review?(jorendorff) → review+
Pushed to TM as 9cc88d291fc0.
Whiteboard: fixed-in-tracemonkey
Backed out because of talos sunspider regressions.
Whiteboard: fixed-in-tracemonkey
Blocks: 532696
Attached patch Patch 4 rebased to tip (obsolete) — Splinter Review
This still seems to work, but there's a massive slowdown on SunSpider benchmarks that use recursion. (I developed the original patch before recursion, and it messes with stack frames, so that's not too surprising. It may just break recursion completely.) 

I also get a 5-10% slowdown on the other benchmarks, but I'm not sure that's real. (I sometimes get up to 15% variances on SS for no discernable reason.)

I'll check into the recursion issue to see what we can do.
First try disabled tracing recursion. This one doesn't. I still need to test SunSpider on a desktop to see if it causes perf regressions, but it's looking OK right now.
Attachment #421379 - Attachment is obsolete: true
Attachment #421382 - Attachment is patch: true
Attachment #421382 - Attachment mime type: application/octet-stream → text/plain
Attached patch Patch 6 (patch 5 as intended) (obsolete) — Splinter Review
Patch 5 was missing a pair of local changes.
Attachment #421382 - Attachment is obsolete: true
The latest version of this patch seems to cause no change on SunSpider, but a 3% regression on v8 (as run by the SunSpider harness), mostly in v8-deltablue. Is that landable or not?
(In reply to comment #29)
> The latest version of this patch seems to cause no change on SunSpider, but a
> 3% regression on v8 (as run by the SunSpider harness), mostly in v8-deltablue.
> Is that landable or not?

Why did v8-deltablue slow down?
Attached patch Patch 7Splinter Review
dvander informed me of an opportunity for optimization that might make the v8-deltablue regression go away, and it did.

Investigating the 3% slowdown, I found that we take about 2M trace exits on deltablue, and our JIT perf is 50% worse than interpreted. We figured the slowdown was probably related to increased import/export costs from the larger frame, which are insignificant for 'good' test cases, but could matter if we do 2M exits.

The optimization is to cache the temporary typemap used in LeaveTree inside TraceMonitor, which saves 1-2 |realloc| calls per exit (in deltablue, but that's probably about right in general). As a result, with the new patch, we speed up deltablue by 3%, compared to a 3% slowdown in the previous version. Overall, this new patch has no effect on SS or v8 (actually, it seems like it might speed them both up, but by less than 1%). 

jorendorff already reviewed the basic JSOP_LAMBDA logic, which is unchanged from the version he reviewed, so I'm sending to dvander to evaluate the recursion fixes and typemap caching. In particular, I think the lack of encapsulation is kind of yucky, but it seemed not worthwhile to make C++y wrappers for something that is used in only one place.
Attachment #421701 - Attachment is obsolete: true
Attachment #421747 - Flags: review?(dvander)
Attachment #421747 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/910ee7db07de
Whiteboard: fixed-in-tracemonkey
Blocks: 536564
http://hg.mozilla.org/mozilla-central/rev/910ee7db07de
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 552599
Depends on: 583757
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: