Closed Bug 554626 Opened 10 years ago Closed 10 years ago

Fix for bug 497789 causes big increase in L2 misses

Categories

(Core :: JavaScript Engine, defect, P1)

Other Branch
defect

Tracking

()

RESOLVED FIXED
mozilla1.9.3a4

People

(Reporter: jseward, Assigned: brendan)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 3 obsolete files)

Bug 497789 contains a fix which majorly improves TM performance on
v8-deltablue, but causes slowdowns of 17% on v8-raytrace and 6% on
v8-earley-boyer.

Initial investigations on the lossage for v8-raytrace indicate this is
mostly IPC lossage due to a doubling of the L2 miss rate.  For initial
analysis and a bunch of background numbers see the following:

https://bugzilla.mozilla.org/show_bug.cgi?id=497789#c105
https://bugzilla.mozilla.org/show_bug.cgi?id=497789#c106
https://bugzilla.mozilla.org/show_bug.cgi?id=497789#c107
https://bugzilla.mozilla.org/show_bug.cgi?id=497789#c108
https://bugzilla.mozilla.org/show_bug.cgi?id=497789#c109

From these it is apparent that the fix not only doubles the L2 miss
rate, it also nearly doubles the heap space requirement for running
v8-raytrace.

The attached file contains before/after stats for L2 miss rates broken
down on a per-function basis.  From this it would appear that the
increase is due to maintenance/use of a hash table that has something
to do with the property tree.
(In reply to comment #0)
> 
> From these it is apparent that the fix not only doubles the L2 miss
> rate, it also nearly doubles the heap space requirement for running
> v8-raytrace.

That sounds worrisome. Do we understand why it happened?
hmm, I see the patch in bug 497789 uses JSDHashTable. js::HashTable outperforms JSDHashTable in microbenchmarks, so maybe it will help here. Probably won't do much about the heap space increase, though.
I'm looking into this now.  It seems many of the misses arise
from js::PropertyTree::getChild calling JS_DHashTableOperate
calling ChangeTable.
If most of the overhead is in ChangeTable, then the problem would seem to be that the hash table is growing a lot then shrinking a lot over and over.  A simple fix to reduce the frequency of resizes is to use JS_DHashTableSetAlphaBounds to set a higher maxAlpha and lower minAlpha.
Julian, could you post the stack trace that leads to getChild.
The stack won't tell us enough. I'm looking into this. Shape generation number is too high at end of benchmark, and combined with Julian's data this suggests too many empty scope shapes. More when I know more.

Luke: the table is not shrinking except during GC and there's no GC run during v8-raytrace.js's execution.

There's a followup bug to use jshashtable.h instead of jsdhash.h but it won't help here.

/be
Assignee: general → brendan
Status: NEW → ASSIGNED
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
Target Milestone: --- → mozilla1.9.3a4
Oh, right, I forgot, ChangeTable is also called when the table size stays the same but the number of tombstones grows too high.
(In reply to comment #5)

Err, being very confused by Callgrind here.  But anyway, what I think
I have is this, the numbers being the sum (L2 rd misses + L2 wr
misses):

1498232 js::PropertyTree::getChild

The trail leading to here goes directly back like this:

1498232 JSScope::getChildProperty
1498488 JSScope::addPropertyHelper
1498787 JSScope::putProperty
1498636 js_DefineNativeProperty
1498587 js_DefineProperty
1498289 args_resolve
1499144 js_LookupPropertyWithFlags <cycle 1>

Then splits, there are 5 callers to js_LookupPropertyWithFlags

  78039  js_Interpret
  134321 js_Arguments
  218606 JS_GetElement
  221164 js_GetArgsValue
  835872 js_GetLengthProperty

chasing caller of js_GetLengthProperty:

    835872  js_IsArrayLike

in turn called by

       792170 js_fun_apply

then we're back at js_Interpret.

Does that make any sense?
Interesting. Looks like this is the arguments object resolve hook constantly defining properties.
From what I can see, we touch arguments objects a lot here. Every time we read an indexed property of the arguments object (0, 1, 2, ...) we run the resolve hook and create a property on it on the fly. That seems to be slow. I will try to find out whether this is properly shared in the prop tree, or not even that. In general we should consider a denser (dense array-like) representation of arguments objects (I think).
Comment 10 is right on the money about how we badly deoptimize arguments.length and arguments[i] accesses, the latter in particular from js_fun_apply. I am fixing that.

But the shape proliferation is due to another bug: arguments objects each get their own scope on creation, with its own shape, instead of sharing an emptyScope owned by their proto.

This is because all arguments objects within a given scope-context (scope chain ending in a given global object) have Object.prototype as their proto, and since the class of the newborn differs from the class of its proto, Object.prototype cannot provide a shared emptyScope for each new arguments object to share.

This is important to fix too. Patch soon.

/be
Attached patch proposed patch (obsolete) — Splinter Review
This looks like it'll be a huge perf win.

/be
Attachment #435537 - Flags: review?(igor)
Julian, thanks again for all the help.

/be
Comment on attachment 435537 [details] [diff] [review]
proposed patch

From a quick look so far:

> static JSObject *
> NewArguments(JSContext *cx, JSObject *parent, uint32 argc, JSObject *callee)
> {
>-    JSObject *argsobj = js_NewObject(cx, &js_ArgumentsClass, NULL, parent, 0);
>-    if (!argsobj || !js_EnsureReservedSlots(cx, argsobj, argc))
>+    JSObject *argsobj = js_NewGCObject(cx);
>+    if (!argsobj)
>+        return NULL;
>+
>+    argsobj->map = cx->runtime->emptyArgumentsScope;
>+    cx->runtime->emptyArgumentsScope->hold();

Is this hold call truly necessary since the empty scope persists over runtime lifetime?

>+
>+    JSObject *proto;
>+    if (!js_GetClassPrototype(cx, parent, JSProto_Object, &proto))
>+        return NULL;

>+    argsobj->init(&js_ArgumentsClass, proto, parent, JSVAL_NULL);

The init call must always follow the js_NewGCObject call not to expose uninitialized object to the GC.
Attachment #435537 - Flags: review?(igor) → review-
Comment on attachment 435537 [details] [diff] [review]
proposed patch

More comments:

>@@ -2094,21 +2095,39 @@ js_fun_apply(JSContext *cx, uintN argc, 
...
>+    if (aobj && aobj->isArguments()) {
>+        JSStackFrame *fp = (JSStackFrame *) aobj->getPrivate();
>+        if (fp) {
>+            memcpy(sp, fp->argv, argc * sizeof(jsval));
>+            for (i = 0; i < argc; i++) {
>+                if (aobj->dslots[i] == JSVAL_HOLE) // suppress deleted element
>+                    sp[i] = JSVAL_VOID;
>+            }
>+        } else {
>+            JS_STATIC_ASSERT(JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS);
>+            memcpy(sp, aobj->dslots, argc * sizeof(jsval));
>+            for (i = 0; i < argc; i++) {
>+                if (sp[i] == JSVAL_HOLE)
>+                    sp[i] = JSVAL_VOID;
>+            }

if/else here can be merged into 

            JS_STATIC_ASSERT(JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS);
            memcpy(sp, fp ? fp->argv : aobj->dslots, argc * sizeof(jsval));
            for (i = 0; i < argc; i++) {
                if (sp[i] == JSVAL_HOLE)
                    sp[i] = JSVAL_VOID;
            }

Also the code avoids locking - see below.

>diff --git a/js/src/jsfun.h b/js/src/jsfun.h
>--- a/js/src/jsfun.h
>+++ b/js/src/jsfun.h
> extern JSClass js_ArgumentsClass;
>+
>+inline bool
>+JSObject::isArguments() const
>+{
>+    return getClass() == &js_ArgumentsClass;
>+}

Defining the inline here looks arbitrary. jsobjinlines.h is a better place IMO. Alternatively extern JSClass js_ArgumentsClass can e moved into jsobj.h

> 
> BEGIN_CASE(JSOP_CALLPROP)
>@@ -1885,29 +1892,46 @@ BEGIN_CASE(JSOP_GETELEM)
>             rval = STRING_TO_JSVAL(str);
>             goto end_getelem;
>         }
>     }
> 
>     VALUE_TO_OBJECT(cx, -2, lval, obj);
>     if (JSVAL_IS_INT(rval)) {
>         if (obj->isDenseArray()) {
>-            jsuint length;
>-
>-            length = js_DenseArrayCapacity(obj);
>-            i = JSVAL_TO_INT(rval);
>-            if ((jsuint)i < length &&
>-                i < obj->fslots[JSSLOT_ARRAY_LENGTH]) {
>-                rval = obj->dslots[i];
>+            jsuint idx = JSVAL_TO_INT(rval);

Nit: add a cast from int to jsuint around JSVAL_TO_INT.

>+
>+            if (idx < jsuint(obj->fslots[JSSLOT_ARRAY_LENGTH]) &&
>+                idx < js_DenseArrayCapacity(obj)) {
>+                rval = obj->dslots[idx];
>                 if (rval != JSVAL_HOLE)
>                     goto end_getelem;
> 
>                 /* Reload rval from the stack in the rare hole case. */
>                 rval = FETCH_OPND(-1);
>             }
>+        } else if (obj->isArguments()
>+#ifdef JS_TRACER
>+                   && !GetArgsPrivateNative(obj)
>+#endif
>+                  ) {
>+            uint32 arg = uint32(JSVAL_TO_INT(rval));
>+
>+            if (arg < GetArgsLength(obj)) {
>+                JSStackFrame *afp = (JSStackFrame *) obj->getPrivate();
>+                if (afp) {
>+                    rval = afp->argv[arg];
>+                    goto end_getelem;
>+                }
>+
>+                rval = OBJ_GET_SLOT(cx, obj, JSSLOT_ARGS_COPY_START + arg);

The patch assumes in other places that the argument object is single-threaded so accessing its dynamic slots does not require locking. Thus the code should either use locking everywhere or avoid it completly and use STOBJ_GET_LOT here and in PutArguments. Also, to follow other examples the code should statically assert that JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS and access the dslots directly.

>diff --git a/js/src/jsprvtd.h b/js/src/jsprvtd.h
> 
> /*
>- * Maximum supported value of Arguments.length. It bounds the maximum number
>- * of arguments that can be supplied to the function call using
>- * Function.prototype.apply. This value also gives the maximum number of
>- * elements in the array initializer.
>+ * Maximum supported value of arguments.length. It bounds the maximum number of
>+ * arguments that can be supplied via the second (so-called |argArray|) param
>+ * to Function.prototype.apply. This value also bounds the number of elements
>+ * parsed in an array initialiser.
>  */
> #define JS_ARGS_LENGTH_MAX      (JS_BIT(24) - 1)

Nit: add a static assert that JS_ARGS_LENGTH_MAX <= int jsval max.
(In reply to comment #14)
> > static JSObject *
> > NewArguments(JSContext *cx, JSObject *parent, uint32 argc, JSObject *callee)
> > {
> >-    JSObject *argsobj = js_NewObject(cx, &js_ArgumentsClass, NULL, parent, 0);
> >-    if (!argsobj || !js_EnsureReservedSlots(cx, argsobj, argc))
> >+    JSObject *argsobj = js_NewGCObject(cx);
> >+    if (!argsobj)
> >+        return NULL;
> >+
> >+    argsobj->map = cx->runtime->emptyArgumentsScope;
> >+    cx->runtime->emptyArgumentsScope->hold();
> 
> Is this hold call truly necessary since the empty scope persists over runtime
> lifetime?

It is needed to balance either this drop at end of js_GetMutableScope:

    static_cast<JSEmptyScope *>(scope)->drop(cx);

or this one:

            static_cast<JSEmptyScope *>(scope)->dropFromGC(cx);

from FinalizeObject, in the case where the arguments object never was mutated in a way that gave it its own scope.

> >+    JSObject *proto;
> >+    if (!js_GetClassPrototype(cx, parent, JSProto_Object, &proto))
> >+        return NULL;
> 
> >+    argsobj->init(&js_ArgumentsClass, proto, parent, JSVAL_NULL);
> 
> The init call must always follow the js_NewGCObject call not to expose
> uninitialized object to the GC.

Ok, fixing -- thanks. I'm relying on a newborn root here, aren't I? Yuck.

(In reply to comment #15)
> >@@ -2094,21 +2095,39 @@ js_fun_apply(JSContext *cx, uintN argc, 
> ...
> >+    if (aobj && aobj->isArguments()) {
> >+        JSStackFrame *fp = (JSStackFrame *) aobj->getPrivate();
> >+        if (fp) {
> >+            memcpy(sp, fp->argv, argc * sizeof(jsval));
> >+            for (i = 0; i < argc; i++) {
> >+                if (aobj->dslots[i] == JSVAL_HOLE) // suppress deleted element
> >+                    sp[i] = JSVAL_VOID;
> >+            }
> >+        } else {
> >+            JS_STATIC_ASSERT(JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS);
> >+            memcpy(sp, aobj->dslots, argc * sizeof(jsval));
> >+            for (i = 0; i < argc; i++) {
> >+                if (sp[i] == JSVAL_HOLE)
> >+                    sp[i] = JSVAL_VOID;
> >+            }
> 
> if/else here can be merged into 
> 
>             JS_STATIC_ASSERT(JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS);
>             memcpy(sp, fp ? fp->argv : aobj->dslots, argc * sizeof(jsval));
>             for (i = 0; i < argc; i++) {
>                 if (sp[i] == JSVAL_HOLE)
>                     sp[i] = JSVAL_VOID;
>             }

This is not equivalent, since the first loop in the patch copies from fp->argv but tests aobj->dslots[i] against JSVAL_HOLE.

> Also the code avoids locking - see below.

MT arguments objects? Just say no ;-).

> >+++ b/js/src/jsfun.h
> > extern JSClass js_ArgumentsClass;
> >+
> >+inline bool
> >+JSObject::isArguments() const
> >+{
> >+    return getClass() == &js_ArgumentsClass;
> >+}
> 
> Defining the inline here looks arbitrary. jsobjinlines.h is a better place IMO.
> Alternatively extern JSClass js_ArgumentsClass can e moved into jsobj.h

We already have the JSObject::is<Class> tests farmed out to the .h files where the extern JSClass (sometimes friend, sometimes not) declarations live. I was just following that pattern and prefer it for modularity reasons (ok, you can't have it both ways, but putting all the JSClass externs together seems slightly less modular, with precedent favoring keeping them subsidiary).

> The patch assumes in other places that the argument object is single-threaded
> so accessing its dynamic slots does not require locking. Thus the code should
> either use locking everywhere or avoid it completly and use STOBJ_GET_LOT here
> and in PutArguments. Also, to follow other examples the code should statically
> assert that JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS and access the dslots
> directly.

Ok, good points.

Thanks for the fast turnaround. I will update the patch today, in a few hours.

/be
(In reply to comment #15)
> > #define JS_ARGS_LENGTH_MAX      (JS_BIT(24) - 1)
> 
> Nit: add a static assert that JS_ARGS_LENGTH_MAX <= int jsval max.

You already did, in jsfun.h:

jsfun.h:JS_STATIC_ASSERT(JS_ARGS_LENGTH_MAX <= JS_BIT(30));
jsfun.h:JS_STATIC_ASSERT(jsval((JS_ARGS_LENGTH_MAX << 1) | 1) <= JSVAL_INT_MAX);

But this reminds me: why is JS_ARGS_LENGTH_MAX #define'd in jsprvtd.h instead of in jsfun.h?

$ grep -l JS_ARGS_LENGTH_MAX *.h *.cpp | sort > /tmp/1
$ grep -l jsfun.h *.cpp | sort > /tmp/2
$ comm -3 /tmp/[12]
	jsapi.cpp
	jscntxt.cpp
	jsdbgapi.cpp
	jsdtracef.cpp
	jsemit.cpp
	jsexn.cpp
jsfun.h
	jsgc.cpp
	jsiter.cpp
	jsobj.cpp
	json.cpp
	jsopcode.cpp
jsops.cpp
jsprvtd.h
	jsregexp.cpp
jsscope.cpp
	jsscript.cpp
jsstr.cpp
	jstracer.cpp
	jsxml.cpp

And:

$ grep -l '#include "jsfun.h"' *.h
jsinterp.h
jsscopeinlines.h
jsscriptinlines.h

so it seems JS_ARGS_LENGTH_MAX could be moved to jsfun.h with only two direct #include "jsfun.h" being added (to jsscope.cpp, jsstr.cpp, which already have indirect dependencies on jsfun.h). What do you think?

/be
(In reply to comment #16)
> > Is this hold call truly necessary since the empty scope persists over runtime
> > lifetime?
> 
> It is needed to balance either this drop at end of js_GetMutableScope:
> 
>     static_cast<JSEmptyScope *>(scope)->drop(cx);

Right, I will file a bug to avoid that drop for known shared empty scopes.

> > >+    argsobj->init(&js_ArgumentsClass, proto, parent, JSVAL_NULL);
> > 
> > The init call must always follow the js_NewGCObject call not to expose
> > uninitialized object to the GC.
> 
> Ok, fixing -- thanks. I'm relying on a newborn root here, aren't I? Yuck.

This is not a problem. The real one is that js_NewGCObject returns something that must be initialized before the GC can ever reach it perhaps via newborn scanning.

> This is not equivalent, since the first loop in the patch copies from fp->argv
> but tests aobj->dslots[i] against JSVAL_HOLE.

Yep, I was wrong. Still the code needs comments that at least points into the arguments code in jsfun.cpp.

> 
> > Also the code avoids locking - see below.
> 
> MT arguments objects? Just say no ;-).

Whatever the preference, it should be consistent.
(In reply to comment #17)
> so it seems JS_ARGS_LENGTH_MAX could be moved to jsfun.h with only two direct
> #include "jsfun.h" being added (to jsscope.cpp, jsstr.cpp, which already have
> indirect dependencies on jsfun.h). What do you think?

Yes, lets do it.
Blocks: 555722
Attached patch proposed patch, v2 (obsolete) — Splinter Review
Attachment #435537 - Attachment is obsolete: true
Attachment #435669 - Flags: review?(igor)
Comment on attachment 435675 [details] [diff] [review]
interdiff of two patches (not sure why bugzilla interdiff failed)

> PutArguments(JSContext *cx, JSObject *argsobj, jsval *args)
> {
>     uint32 argc = GetArgsLength(argsobj);
>-    JS_LOCK_OBJ(cx, argsobj);
>     for (uint32 i = 0; i != argc; ++i) {
>-        jsval v = STOBJ_GET_SLOT(argsobj, JSSLOT_ARGS_COPY_START + i);
>+        jsval v = argsobj->dslots[i];
>         if (v != JSVAL_HOLE)
>-            STOBJ_SET_SLOT(argsobj, JSSLOT_ARGS_COPY_START + i, args[i]);
>+            argsobj->dslots[i] = args[i];

The final nit: if we ever change the number of fixed slots in the object it would be necessary to find all those places with hard-coded assumption that JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS. To avoid that and to make the slot access self-documenting I suggest to add a helper method like ArgumentCopySlots(obj) { return obj->dslots; } and always use it when accessing argument slots.
Attachment #435669 - Flags: review?(igor) → review+
(In reply to comment #22)
> (From update of attachment 435675 [details] [diff] [review])
> > PutArguments(JSContext *cx, JSObject *argsobj, jsval *args)
> > {
> >     uint32 argc = GetArgsLength(argsobj);
> >-    JS_LOCK_OBJ(cx, argsobj);
> >     for (uint32 i = 0; i != argc; ++i) {
> >-        jsval v = STOBJ_GET_SLOT(argsobj, JSSLOT_ARGS_COPY_START + i);
> >+        jsval v = argsobj->dslots[i];
> >         if (v != JSVAL_HOLE)
> >-            STOBJ_SET_SLOT(argsobj, JSSLOT_ARGS_COPY_START + i, args[i]);
> >+            argsobj->dslots[i] = args[i];
> 
> The final nit: if we ever change the number of fixed slots in the object it
> would be necessary to find all those places with hard-coded assumption that
> JSSLOT_ARGS_COPY_START == JS_INITIAL_NSLOTS. To avoid that and to make the slot
> access self-documenting I suggest to add a helper method like
> ArgumentCopySlots(obj) { return obj->dslots; } and always use it when accessing
> argument slots.

Sure -- I was really pulling the wrong direction here, given njn's bug 555429.

/be
Attached patch patch to commitSplinter Review
I took the liberty of shortening a few names and using const instead of #define.

/be
Attachment #435669 - Attachment is obsolete: true
Attachment #435675 - Attachment is obsolete: true
Attachment #435726 - Flags: review+
Any word on the resulting perf win?
On my noisy MBP:

    raytrace:     1.96x as fast      2345.6ms +/- 3.6%    1198.9ms +/- 3.7%     significant

/be
>     raytrace:     1.96x as fast      2345.6ms +/- 3.6%    1198.9ms +/- 3.7%    

Awesome.  Is that the only one affected?
(In reply to comment #27)
> >     raytrace:     1.96x as fast      2345.6ms +/- 3.6%    1198.9ms +/- 3.7%    
> 
> Awesome.  Is that the only one affected?

$ grep -wl arguments v8/*.js t/*.js
v8/earley-boyer.js
v8/raytrace.js
t/crypto-md5.js
t/crypto-sha1.js
t/string-unpack-code.js

So earley-boyer might speed up, along with the SunSpider ones that use arguments, but of course it depends on frequency at runtime of arguments property gets and sets.

Here's the full v8-suite comparison:

$ ./sunspider-compare-results --shell=$HOME/nojs --v8-suite


TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      1.073x as fast    24762.3ms +/- 1.0%   23074.4ms +/- 0.9%     significant

=============================================================================

  v8:             1.073x as fast    24762.3ms +/- 1.0%   23074.4ms +/- 0.9%     significant
    crypto:       1.033x as fast     3605.6ms +/- 1.9%    3490.0ms +/- 2.4%     significant
    deltablue:    -                  5198.9ms +/- 1.6%    5124.9ms +/- 3.2% 
    earley-boyer: 1.041x as fast     2654.7ms +/- 3.0%    2549.9ms +/- 2.9%     significant
    raytrace:     1.96x as fast      2345.6ms +/- 3.6%    1198.9ms +/- 3.7%     significant
    regexp:       *1.063x as slow*   1591.5ms +/- 2.5%    1691.8ms +/- 3.1%     significant
    richards:     1.038x as fast     7921.9ms +/- 1.0%    7628.8ms +/- 1.7%     significant
    splay:        -                  1444.1ms +/- 4.9%    1390.1ms +/- 3.4% 

I haven't had time (moving households) to run SunSpider tests, and my MBP is as noisy as anyone's. Maybe someone else can try. I'll rebase the patch now that the great back-out/re-add seems to be over.

/be
http://hg.mozilla.org/tracemonkey/rev/bf9c4630fa38

/be
Whiteboard: fixed-in-tracemonkey
(In reply to comment #28)
> t/crypto-md5.js
> t/crypto-sha1.js

These don't really use arguments, just have "arguments" in some comment text.
http://hg.mozilla.org/mozilla-central/rev/bf9c4630fa38

/be
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
This seems to have broken at least some uses of jquery; see bug 597816
You need to log in before you can comment on or make changes to this bug.