Closed Bug 480132 Opened 11 years ago Closed 11 years ago

SpiderMonkey clones too many blocks into the heap


(Core :: JavaScript Engine, defect)

Not set





(Reporter: jimb, Assigned: jimb)



(Keywords: verified1.9.1, Whiteboard: [tb3needs])


(6 files, 4 obsolete files)

In JavaScript, it's not possible to tell in advance which local variables may be allocated solely on the stack, and which must live in the heap because they have been closed over by a function object.  When possible, it's cheaper to allocate them on the stack only.

SpiderMonkey initially allocates all local variables on the stack, and points fp->blockChain at immutable, compiler-allocated block objects to track them.  If we do close over those blocks, we create clones of those immutable block objects, put the clones on fp->scopeChain, and then copy the variables from the stack into the clones when we pop them.

The first time SpiderMonkey clones any blocks, it sets JSFRAME_POP_BLOCKS in fp->flags.  Once that flag is set, we clone all subsequent lexical blocks as soon as we enter them, whether they are closed over or not.  Thus, once a frame creates a single closure (that actually captures a lexical block), all subsequent lexical block entry and exit in that frame becomes slower and increases GC load --- even if those blocks are never closed over.

However, SpiderMonkey is already doing all the bookkeeping it needs to clone only those blocks that actually serve as parents of closures.  The attached patch eliminates JSFRAME_POP_BLOCKS and clones only when we actually create a closure (or enter a 'with' block, *sigh*).

Taking advantage of that would be independent of the work in bug 452498 ("upvar, part deux"), which pursues a different strategy for reducing block capture.

I also suspect that a lot of the assertion failures like that in bug 477733 are a result of the JIT not being able to tell whether it needs to clone blocks, since JSFRAME_POP_BLOCKS is imprecise, and can be clear when we record, but set later.  If SpiderMonkey were to adopt the technique implemented in this patch, it would be simpler for the tracer to recognize the situations it can't handle.
Assignee: general → jim
Comment on attachment 364108 [details] [diff] [review]
Clone lexical blocks only when needed.  Interpreter-only version.

Jason, would you be willing to look this over?  If you have suggestions for the JIT side, that'd be especially cool.
Attachment #364108 - Flags: review?(jorendorff) didn't notice, but Sunspider says this change is a slowdown.  This was pretty mysterious, as SunSpider doesn't use ENTERBLOCK at all, and so should be unaffected by the changes to js_Interp, and never make it into the changed portions of js_GetScopeChain.  oprofile has a user interface only a mother could love, but I think what's going on is that several other call sites for js_GetScopeChain inline js_GSC's first test, to avoid calling it altogether --- and those tests are now wrong, so js_GetScopeChain is being called more often; we're incurring the function call overhead again.

I'll put in some instrumentation to verify this, and see if I can update the call sites.
I was wrong --- js_GetScopeChain is called exactly as many times with and without the patch.  The heuristics at the call sites mean now something slightly different ("not in the static scope of any blocks" vs. "scope chain is already current"), but they work just as well to protect code that doesn't use let.

The slowdown is due to the new code's reference to js_WithClass.  Mentioning that causes GCC to emit code to find the global offset table:

     c11:	53                   	push   %ebx
     c12:	83 ec 1c             	sub    $0x1c,%esp
     c15:	8b 45 0c             	mov    0xc(%ebp),%eax
     c18:	e8 fc ff ff ff       	call   c19 <js_GetScopeChain+0xd>
			c19: R_386_PC32	__i686.get_pc_thunk.bx
     c1d:	81 c3 02 00 00 00    	add    $0x2,%ebx

All this code disappears if I replace '&js_WithClass' with 'NULL'.  It's annoying that GCC insists on establishing the GOT pointer before it needs it.
(The point being that the above code runs upon entry to js_GetScopeChain, before the first 'if', and thus lies on the fast path even though the GOT pointer is dead on the fast path.)
I've changed the patches so that the js_GetScopeChain fast path no longer tries to compute the GOT address.  It is still 2.3% slower.

I ran SunSpider with these patches applied and not applied, 4800 runs on each, with oprofile collecting data.  The attached graph shows the results of the comparison, looking at time spent in jsinterp.cpp:
- Unpatched SM is on the left, patched SM is on the right.
- Each horizontal row of pixels corresponds to one source line.
- Labels down the sides show line numbers; the two files are aligned a la diff, so the line numbers don't line up.
- Excess cycles on the left show up in green; on the right, in red.
- The horizontal scale is linear, so area is proportional to time.

Since control never reaches a modified line, there are no lines that show counts only on one side or the other.

I've looked at the places where there is significant red.  One is just before line 3900 in the patched SM.  That is the code for JSOP_SUB, a use of BINARY_OP(-).  I've gone through and looked at the disassembly for all six basic blocks that result from the expansion of BINARY_OP(-); they're identical.

So I'm not really sure what to attribute the slowdown to.  I'd love to blame it on code cache line conflicts; I'm going to re-profile with a different event counter and see.
Here oprofile counted L1I_MISSES events (number of instruction fetch misses).  You can see that the patched profile, on the right, is doing a lot worse in this regard.
Attached file microbenchmark showing improvement (obsolete) —
For what it's worth, here's a microbenchmark showing the improvement this patch makes.  SunSpider shows SpiderMonkey to be 2.63x as fast with the patch applied.
Use this one instead.
Attachment #365743 - Attachment is obsolete: true
With the patch above:

TEST           COMPARISON            FROM                 TO             DETAILS


** TOTAL **:   2.25x as fast     251.3ms +/- 2.0%   111.5ms +/- 4.6%     significant


  controlflow: 2.25x as fast     251.3ms +/- 2.0%   111.5ms +/- 4.6%     significant
    blocks:    2.25x as fast     251.3ms +/- 2.0%   111.5ms +/- 4.6%     significant
(In reply to comment #10)
> With the patch above:

can you attach full sunspider results, patch vs. no patch?
Attachment #365994 - Attachment mime type: application/octet-stream → text/plain
In the nsieve-only results, I'm seeing jumps in runtime in JSOP_SETELEM, JSOP_GETLOCAL, and on the END_CASE for JSOP_LOOP.
Here are the profile counts I get for the head of JSOP_SETELEM unpatched:

               :          BEGIN_CASE(JSOP_SETELEM)
               :            rval = FETCH_OPND(-1);
  4793  0.1937 : 80e8fe3:	mov    -0x5c(%ebp),%edx
     6 2.4e-04 : 80e8fe6:	mov    -0x4(%edx),%eax
   152  0.0061 : 80e8fe9:	mov    %eax,-0x44(%ebp)
               :            FETCH_OBJECT(cx, -3, lval, obj);
    73  0.0030 : 80e8fec:	mov    -0xc(%edx),%eax
  5020  0.2029 : 80e8fef:	test   $0x7,%al
               : 80e8ff1:	je     80ed50d <js_Interpret+0x63d5>

Here are the corresponding lines, patched:

               :          BEGIN_CASE(JSOP_SETELEM)
               :            rval = FETCH_OPND(-1);
  5006  0.2023 : 80e8ed9:	mov    -0x5c(%ebp),%edx
 44201  1.7864 : 80e8edc:	mov    -0x4(%edx),%eax
  9404  0.3801 : 80e8edf:	mov    %eax,-0x44(%ebp)
               :            FETCH_OBJECT(cx, -3, lval, obj);
  3112  0.1258 : 80e8ee2:	mov    -0xc(%edx),%eax
  1852  0.0748 : 80e8ee5:	test   $0x7,%al
               : 80e8ee7:	je     80ed63d <js_Interpret+0x6859>

The first instruction is fetching regs.sp; the second is fetching the value at the top of the stack.  I don't see why that second instruction should be significantly slower.
I'm going to do a longer profiling run, to have more confidence I'm not looking at noise.
Is there a pattern in the alignment of regs.sp?
I'm going to try a little VTune on this. 

With the SETELEM thing, it's pretty unobvious what's going on, especially because if the first instruction is slow, it could be making the second one slow. It might even be a symptom of an L1 miss on the first instruction. You could look at indirect branch mispredict rates. You could also try inserting NOPs between the first and second instructions to see if it does or doesn't have an effect.
As of now, I'm not seeing much of any perf difference. I got +8 ms the last time I tried it. I suspect this is just the usual noise in js_Interpret from GCC moving blocks around or frobulating its register allocator or whatever.

Also, there is no good reason to believe this would cause a perf difference: it only touches js_GetScopeChain, and the total execution time of this function is negligible in all the SS benchmarks I looked at.
Out of paranoia, I checked that the bytecodes being executed were not affected by the patch; they're not.
This should get in and bake, no need to worry about noise-level effects.

Igor should prolly review (I owe him two reviews, so I'm ducking; he did the last major improvement to js_GetScopeChain during Firefox 3 era, with mrbkap and me as buddies).

Blocks: 477733
Attachment #364108 - Flags: review?(igor)
Comment on attachment 364108 [details] [diff] [review]
Clone lexical blocks only when needed.  Interpreter-only version.

Does this passes the shell test suite? 

>diff --git a/js/src/jsinterp.cpp b/js/src/jsinterp.cpp
>--- a/js/src/jsinterp.cpp
>+++ b/js/src/jsinterp.cpp
>@@ -663,11 +663,9 @@ js_FreeStack(JSContext *cx, void *mark)
> JSObject *
> js_GetScopeChain(JSContext *cx, JSStackFrame *fp)
> {
>-    JSObject *obj, *cursor, *clonedChild, *parent;
>-    JSTempValueRooter tvr;
>-    obj = fp->blockChain;
>-    if (!obj) {
>+    JSObject *sharedBlock = fp->blockChain;
>+    if (! sharedBlock) {

Nit: no space after !.

>         if (!js_GetCallObject(cx, fp))
>             return NULL;
>+        /* We know we must clone everything on blockChain.  */ 
>+        limitBlock = limitClone = NULL;

Nit: blank line before the comments.

>+        while (OBJ_GET_CLASS(cx, limitClone) == &js_WithClass)
>+            limitClone = OBJ_GET_PARENT(cx, limitClone);
>+        JS_ASSERT(limitClone);

Nit: blank line before the comments.

>+    if (! innermostNewChild)
>+        return NULL;

Nit: no space after !.

>+    JSTempValueRooter tvr;
>+    JS_PUSH_TEMP_ROOT_OBJECT(cx, innermostNewChild, &tvr);

Nit: use JSAutoTempValueRooter here (with JSObject* constructor case added).

>+        if (sharedBlock == limitBlock || ! sharedBlock)
>+            break;

Nit: no space after !.

>+        if (! clone) {

Nit: no space after !.

>+    JS_ASSERT_IF(limitBlock && 
>+                 OBJ_GET_CLASS(cx, limitBlock) == &js_BlockClass && 
>+                 OBJ_GET_PRIVATE(cx, limitClone) == fp,
>+                 sharedBlock != NULL);

Nit: Use just sharedBlock, not sharedBlock != NULL, as the condition.

>+#ifdef DEBUG
>+            JS_ASSERT(fp->blockChain == OBJ_GET_PARENT(cx, obj));
>             /*

Nit: a blank line before the comment.
Attachment #364108 - Flags: review?(jorendorff)
This version should cover Igor's comments.  Passes js/tests and js/src/trace-test.js with JIT enabled and disabled.

A "script block" is an object of class Block allocated by the byte
compiler and associated with a script.  Script blocks are never
modified, and may be used as a prototype for a "closure block":

A "closure block" is an object of class Block that holds variables
that have been closed over (although we actually leave the variables
on the stack until we leave their dynamic scope).  A closure block is
a clone of a script block (its prototype is a script block).

Adjust the meanings of fp->blockChain and fp->scopeChain:

  fp->blockChain is always the innermost script block in whose static
  scope we're executing.

  fp->scopeChain is the current scope chain, including 'call' objects
  and closure blocks for those function calls and blocks in whose
  static scope we are currently executing, and 'with' objects for with
  statements; the chain is typically terminated by a global object.
  However, as an optimization, the young end of the chain omits block
  objects we have not yet needed to clone.

Closures need fully reified scope chains, so have js_GetScopeChain
reify any closure blocks missing from the young end of fp->scopeChain
by cloning script blocks as needed from fp->blockChain.  Thus, if we
never actually close over a particular block, we never place a closure
block for it on fp->scopeChain.

Have JSOP_ENTERBLOCK and JSOP_LEAVEBLOCK always keep fp->blockChain
current.  When JSOP_LEAVEBLOCK pops a block from fp->blockChain that
has been cloned on fp->scopeChain, pop fp->scopeChain as well.

Remove the JSFRAME_POP_BLOCKS flag, as it is no longer needed.

Ensure that the JIT won't have to create closure blocks or call
js_PutBlockObject; it can't handle those things yet.  Note our current
script block when we begin recording.  Abort recording if we leave
that block; we can't tell in advance whether it will need to be "put"
in future trace invocations.  Abort recording if we call
js_GetScopeChain while in the static scope of lexical blocks.  Remove
JIT tests based on JSFRAME_POP_BLOCKS.

Verify that generators capture the correct value for blockChain.

Add a constructor to JSAutoTempValueRooter for rooting JSObject
Attachment #364108 - Attachment is obsolete: true
Attachment #367130 - Flags: review?(igor)
Attachment #364108 - Flags: review?(igor)
Comment on attachment 367130 [details] [diff] [review]
Bug 480132: Clone lexical blocks only when needed.

> JSBool
>@@ -6741,51 +6800,57 @@ js_Interpret(JSContext *cx)
>+            obj2 = fp->scopeChain;
>+            while ((clasp = OBJ_GET_CLASS(cx, obj2)) == &js_WithClass)
>+                obj2 = OBJ_GET_PARENT(cx, obj2);
>+            if (clasp == &js_BlockClass &&
>+                OBJ_GET_PRIVATE(cx, obj2) == fp) {
>+                obj2 = OBJ_GET_PROTO(cx, obj2);

Nit: do not reuse here obj2 for something that have very different meaning. A new local like lastClonedBlock as the target of the last assignment together with an assert that it is a static lexical block would be very helpful to clarify what is going.

And thanks for nice comments in the patch!
Attachment #367130 - Flags: review?(igor) → review+
I pushed this

since the tree was quiet. I think I addressed igor's nit correctly, mea culpa if not. The tree was green and quiet, so worth getting the testing in, I think.
well, that didn't last long:

Looks like the nit fix busted the static analysis compile. At least we'll get a run in on all the boxes.
Revised for Igor's last suggestion; he was concerned about the second use of obj2 in the section he quoted, not the first.  I think this should pass the static analysis, because the new variable is local to a block being jumped over, so the 'error' label isn't within its scope.

That log message... a lot of that should be source comments, not a log message.

Pushed to TM as changeset:
Attachment #367130 - Attachment is obsolete: true
Previous patch failed static analysis checks: we should be leaving trace, not aborting recording.  Simply because we reach js_GetScopeChain somehow while recording doesn't mean that we will or will not reach it from trace --- but we cannot stay on trace after cloning blocks, because JITted code does not call js_PutBlockObject.
Attachment #367355 - Attachment is obsolete: true
Closed: 11 years ago
Resolution: --- → FIXED
Flags: blocking1.9.1+
FYI I was getting "Assertion failure: !(fp->flags & JSFRAME_POP_BLOCKS)" from unit test in my Thunderbird 3.0 (Gecko 1.9.1) debug build on Windows 2003 server. I applied the patch from attachment 367356 [details] [diff] [review] on this bug, and the problem went away. This does not seem to appear in the tinderbox builds for some reason, only on my local build.
(In reply to comment #32)
> FYI I was getting "Assertion failure: !(fp->flags & JSFRAME_POP_BLOCKS)" from
> unit test

FTR this is covered by bug 483857.
Blocks: 483857
Jim: what's the next step to getting this landed on the branch?
Whiteboard: [tb3needs]
(In reply to comment #35)
> Landed in 1.9.1:

Adding fixed1.9.1 keyword...
Keywords: fixed1.9.1
No backout within the last 3 weeks. I think we can safely mark this bug as verified now.
OS: Linux → All
Hardware: x86 → All
Target Milestone: --- → mozilla1.9.2a1
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.