Closed Bug 561506 Opened 14 years ago Closed 14 years ago

JM: optimize adding properties with PIC

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: bhackett1024)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 7 obsolete files)

Attached patch WIP (obsolete) — Splinter Review
Currently, JM adds properties a bit slower than the interpreter, because it has to use the slow path for that (which means doing what the interpreter does, plus a bit of overhead). It should instead do pretty much what the tracer does for that case.

I've started this out, and reached the point of a WIP that can run a simple microbenchmark and show a speedup there. It doesn't look too hard, but it does require a cross-platform method for calling functions with 3 (or maybe more) arguments, which we don't have yet (we just have our standard stub calls and 1-argument calls).
This looks a little tricky to do well, so I slowed down to take some measurements. First, I measured how long it takes to do a property add in various systems:

  JM      63 ns
  JM+TM   25 ns
  SM      50 ns
  JSC     10 ns
  V8      47 ns

JM is slower than SM because right now we do a pure slow stub call, after trying the PIC, so it's basically SM+overhead. Note that JSC is super-fast somehow. I think they are doing roughly what we do.

Second, I measured how many slow stub calls we do. I think most of the ones left are for adds, although I don't know for sure. In SunSpider, there are 125k left, basically all in access-binary trees. In V8, there are about 3M left, 2M in earley-boyer, 1M in raytrace, .7M in deltablue, and .2M in splay.

So, if we can make this as fast in JM as on trace, by doing basically the same thing, then we have left about 5 ms of win on SS and 120 ms on V8. I think this is fairly important to get done at some point, but it doesn't have to be high priority, as nothing depends on it and the perf win is modest and predictable. Also, it will be easier and more effective if bug 558451 is done first.
Depends on: 558451
Now some stats on which paths are taken through the fast-case add. I did this by instrumenting the fast case in jsops.cpp. I measured two dimensions: bump slot vs. realloc slots, and extend scope vs. do AddProperty.

                          slot allocation         scope update
  program                 bump    realloc      extend    addprop         total

  access-binary-trees   126213          0      126213          0        126213
  v8-deltablue          537997     109287      647284          0        647284
  v8-earley-boyer      2126231          0     2126231          0       2126231
  v8-raytrace          1154537       9005     1163542          0       1163542
  v8-splay               60350      24471       84821          0         84821

If we just do the common case, we are only leaving 5ms on V8 of win behind, and the fast case should be simpler and faster.
Attached patch WIP 2 (obsolete) — Splinter Review
This version fully inlines the common path. It still has some bugs.
Attachment #441226 - Attachment is obsolete: true
Great work. We need to get property trees per thread (bug 511591), as adding a property involves taking a lock, which probably accounts for much of our slowdown compared to JSC (confirmation needed).

/be
Do we need to take a lock in the predictable-shape-evolution case?  It seems like, for ST objects, no shared state needs to mutate.
The lock is for adding to the property tree. We tolerate races to lookup and find nothing, then add, so we tolerate dups, but adding a new tree node itself requires a lock or you can race off the end of a kids-chunk list.

So the GC lock is taking for shortish critical sections when adding, but not when looking. If the property tree node already exists, no locking required. So maybe the locking cost isn't the problem -- hard to say without more data.

/be
(In reply to comment #6)
> The lock is for adding to the property tree. We tolerate races to lookup and
> find nothing, then add, so we tolerate dups, but adding a new tree node itself
> requires a lock or you can race off the end of a kids-chunk list.

And add two kids chunks list at the same next link, leaking one.

> So the GC lock is tak[en] for shortish critical sections when adding, but not
> when looking. If the property tree node already exists, no locking required. So
> maybe the locking cost isn't the problem -- hard to say without more data.

Dave, can you profile and find out why add costs us, in detail? Thanks,

/be
(In reply to comment #7)
> Dave, can you profile and find out why add costs us, in detail? Thanks,

My stats are for basic shell builds, so locking doesn't apply. Shark says that of the total time within js_AddProperty, 40% of it is for js_IdIsIndex (via js_AddProperty -> JSScope::extend -> JSScope::updateFlags). Otherwise, the time seems to be distributed across the active parts of the function.

There do seem to be a lot of opportunities for upfront specialization for the hot paths. My WIP 2 still has bugs, so it may be slower when it is done, but currently it is showing a time of about 4ns per add.

Another relevant issue here is that js_GetMutableScope can't be practically inlined into JIT code, so I do the slow path for the case where the base object has a shared mutable scope. Of course, we could still get some speedup by calling out to a simplified builtin, but I think it would be nicer to do after bug 558451.
Attached patch Patch (obsolete) — Splinter Review
OK, this is it. It passes shell tests. My laptop is no good for perf testing, which I want to do before landing.
Attachment #441675 - Attachment is obsolete: true
I'm still getting a few test failures on the v8-v4 benchmarks.
Attached patch Patch 2 (obsolete) — Splinter Review
Forgot to update the shape in the previous version. Strangely, that version actually passed all of our existing shell tests.
Attachment #441950 - Attachment is obsolete: true
Attached patch Patch 3 (obsolete) — Splinter Review
Attachment #442460 - Attachment is obsolete: true
After further analysis, I think I'm going to put this on hold for now. Roughly, it looks like this could be done the easy way for a puny <50ms win on V8, or the hard way for a win of up to 150ms. 50ms is hardly big enough to care about, and 150ms for a lot of effort doesn't seem worth it right now. And it appears that there is essentially no effect on SunSpider no matter how we do this.

I'll document what I learned for when we get back to this. In the interpreter, a property add takes about 50 ns:

  do prop cache stuff           50 ns

In the tracer, it's about 25 ns, which breaks down something like this:

  actually do property add       7 ns
  function call overhead         8 ns
  call js_IdIsIndex             10 ns

So we can take 3 steps to speed things up:

1. Do it the tracer way instead of the interpreter way. This is relatively easy; it just means compiling two guards and a call to js_AddProperty.

One thing to note is that the tracer guards on JSRuntime::protoHazardShape, but this seems to be kind of unfortunate. In v8-earley-boyer (in a test with PICs), protoHazardShape changes after only 2000 adds or so (out of 2M), invalidating the PIC. It's probably better to check the shapes up the prototype chain instead.

2. Don't call js_IdIsIndex. This call can be pushed up to PIC generation time; basically we can just not generate a fast path if it returns true.

3. Inline js_AddProperty for non-empty, non-table scopes. The case for table scopes can't be inlined, but that case is rare (nonexistent in the benchmarks). Empty scopes are common (1M of the 2M adds in earley-boyer). It might help if we started objects with a mutable scope if they are created for a constructor that does sets (or we otherwise know statically at object creation point that they will be mutated). Then that case could be inlined as well. Even inlining the easy case is a bunch of work--it requires calls to lock/unlock functions and/or js_AllocSlots in the slow cases.
Assignee: dmandelin → general
We lose about 7ms on adding properties in binary-trees. This should become easier once the scope changes land.
Blocks: JaegerSpeed
This should be pretty easy now. Brian, you game?

/be
Assignee: general → bhackett1024
Sure.  Word of warning, I'll be gone next week (last trip of the year).  Assuming TM merges into JM today, this should be done and stable before then.
Depends on: 592412
Attached patch addprop patch (obsolete) — Splinter Review
This patch gives me 4ms on SS, and 150ms on V8.  Existing tests and trace-tests pass, though this adds another test which breaks, and depends on bug 592412 to figure out the proper fix.
Attachment #442467 - Attachment is obsolete: true
I put up a patch for bug 592412, but it changes only the set-not-add (i.e., set of a pre-existing property) path. Is that relevant here?

/be
The trick here is that this PIC makes addprop look like setprop with some extra tests and assignments.  Since setprop is currently broken for the joined functions, I don't know yet what the right fix for addprop is.  I suspect that both addprop and setprop should slowpath direct assignments of 'function() {}', which will use your patch in bug 592412.  But I'm still working through how the code works, e.g. I don't know why this is correct:

function f() {
  var x, y;
  for (var i = 0; i < 2; i++) {
    x = function() { return 0; };
    if (i == 0)
      y = x;
  }
  print(x + " " + y);
}
f();
er, make that print(x == y), which is 'false' in TM tip (correctly so).
Is the setprop IC broken for joined functions?
Yes, and so is setprop for the interpreter and tracer.
Comment on attachment 470981 [details] [diff] [review]
addprop patch

>diff --git a/js/src/jsobj.h b/js/src/jsobj.h
>--- a/js/src/jsobj.h
>+++ b/js/src/jsobj.h
>@@ -325,18 +325,23 @@ struct JSObject {
>         OWN_SHAPE       = 0x80
>     };
> 
>     enum {
>         JS_NSLOTS_BITS  = 24,
>         JS_NSLOTS_LIMIT = JS_BIT(JS_NSLOTS_BITS)
>     };
> 
>-    uint32      flags: 32-JS_NSLOTS_BITS,   /* flags */
>-                freeslot: JS_NSLOTS_BITS;   /* next free slot in abstract slot space */
>+    union {
>+        struct {
>+            uint32      flags: 32-JS_NSLOTS_BITS,   /* flags */
>+                        freeslot: JS_NSLOTS_BITS;   /* next free slot in abstract slot space */
>+        };
>+        uint32 flagsAndFreeslot;
>+    };
>     uint32      objShape;

Nit: line up declarators if possible, at least flagsAndFreeslot and objShape.

Does the union and struct wrapping lose packing in 64 bits on any targets? If so, I say use big fields and bit twiddling. BUT:

I want to get rid of freeslot. I'm trying to file a bug on it (Minefield crashed and lost my work :-(). There, bug 592556. Want me to prioritize it and make this bug depend on it?

/be
dmandelin came up with a solution for the flags/freeslot bitfield in the shape merge. see JSObject::flagsOffset()
(In reply to comment #24)
> dmandelin came up with a solution for the flags/freeslot bitfield in the shape
> merge. see JSObject::flagsOffset()

Wow, that is something.

Getting rid of freeslot would be good, especially if it's not just an Imm32 write to update the flags/freeslot (i.e. two objects with the same shape have the same flags/freeslot.  I know this doesn't hold for DELEGATE, just realized that is partly broken in this patch.  Are there other JSObject::flags which could differ between two objects with the same shape?)

I'd be surprised if the flagsAndFreeSlot union changed the struct layout on a 64-bit target (union is between two 32-bit quantities), but I haven't tested.
JSObject::flagsOffset is clearly IS_LITTLE_ENDIAN-only -- please ifdef it with a #error in the #else clause.

DELEGATE and SYSTEM look to me (and I should know :-) like the flag bits that can change without any necessary shape change. However, SYSTEM may be saved because of disjoint system/non-system prototypes, and I hope SYSTEM is going away with compartments (mrbkap would know).

/be
(In reply to comment #19)
> The trick here is that this PIC makes addprop look like setprop with some extra
> tests and assignments.  Since setprop is currently broken for the joined
> functions, I don't know yet what the right fix for addprop is.

I would separate cases since a program location is usually either adding or setting, rarely doing both depending on object -- and if so, the object shapes for the two cases will be disjoint sets.

> I suspect that
> both addprop and setprop should slowpath direct assignments of 'function() {}',
> which will use your patch in bug 592412.  But I'm still working through how the
> code works, e.g. I don't know why this is correct:
> 
> function f() {
>   var x, y;
>   for (var i = 0; i < 2; i++) {
>     x = function() { return 0; };
>     if (i == 0)
>       y = x;
>   }
>   print(x + " " + y);
> }
> f();

The joined function object optimization applies only to setprop (not to assignment to an unqualified name, whether SETLOCAL or other) and initprop. See the JSOP_SETMETHOD and JSOP_INITMETHOD bytecodes. This tier of specialization is done by the compiler, but other tiers (funobj is a null closure with the right parent, obj->canHaveMethodBarrier()) await runtime.

/be
Attached patch updated patch (obsolete) — Splinter Review
This fixes the addprop IC to work correctly on SETMETHOD, and disables the setprop IC for SETMETHOD on a non-method shape, so that when bug 592412 is fixed JM should be fixed too.  The addprop IC is also fixed to always guard on the flagsAndFreeslot value, to catch delegate/system objects slipping in (this guard could be removed if making an object delegate/system were to change its shape).

I tested the flagsAndFreeslot thing and found no effect on the offsets/size of JSObject in either x86 or x64 on OS X.  JSObject::flagsOffset isn't enough for this patch, as the addprop IC needs to be able to read the combined value for guarding on.

I also tested the effect of purging PICs at every GC (as this patch does).  On V8 this affects the time by maybe 2-4ms, i.e. not worth bothering to improve right now.
Attachment #470981 - Attachment is obsolete: true
Attachment #471201 - Flags: review?(dmandelin)
Comment on attachment 471201 [details] [diff] [review]
updated patch

>diff --git a/js/src/jsobj.h b/js/src/jsobj.h
>--- a/js/src/jsobj.h
>+++ b/js/src/jsobj.h
>-    uint32      flags: 32-JS_NSLOTS_BITS,   /* flags */
>-                freeslot: JS_NSLOTS_BITS;   /* next free slot in abstract slot space */
>+    union {
>+        struct {
>+            uint32 flags: 32-JS_NSLOTS_BITS,   /* flags */
>+                   freeslot: JS_NSLOTS_BITS;   /* next free slot in abstract slot space */
>+        };
>+        uint32  flagsAndFreeslot;
>+    };
>     uint32      objShape;                   /* copy of lastProp->shape, or override if different */

This is clearly better than the search-for-the-bit address computation. Can you change that function to use offsetof(JSObject, flagsAndFreeslot) and see if the getelem stub generator is OK with that?

>diff --git a/js/src/methodjit/Compiler.cpp b/js/src/methodjit/Compiler.cpp
>--- a/js/src/methodjit/Compiler.cpp
>+++ b/js/src/methodjit/Compiler.cpp
>-        if (pics[i].kind == ic::PICInfo::SET) {
>+        if (pics[i].kind == ic::PICInfo::SET ||
>+            pics[i].kind == ic::PICInfo::SETMETHOD) {

I think the repeated checks for a kind of SET or SETMETHOD would be better as a new "isSet" or "isSetKind" method on the pic class. But I don't want to delay this patch over such a minor issue, so change that only if it is quick.

Also, the generateStub function is getting too long (probably was too long before this patch anyway), but I don't want to delay over that either.

>diff --git a/js/src/methodjit/PolyIC.cpp b/js/src/methodjit/PolyIC.cpp
>--- a/js/src/methodjit/PolyIC.cpp
>+++ b/js/src/methodjit/PolyIC.cpp
>+                for (size_t ind = 0; ind < chainLength; ind++)
>+                    masm.loadPtr(Address(pic.shapeReg, offsetof(JSObject, proto)), pic.shapeReg);

|i| is the SM canonical variable name here.

>+            if (pic.kind == ic::PICInfo::SETMETHOD) {
>+                /*
>+                 * Guard that the value is equal to the shape's method.
>+                 * We already know it is a function, so test the payload.
>+                 * REVIEW: is this necessary?  This mismatch does not seem to
>+                 * occur anywhere in tests/trace-tests.
>+                 */
>+                JS_ASSERT(shape->isMethod());
>+                JSObject *funobj = &shape->methodObject();
>+                if (pic.u.vr.isConstant) {
>+                    JS_ASSERT(funobj == &Valueify(pic.u.vr.u.v).toObject());
>+                } else {
>+                    Jump mismatchedFunction =
>+                        masm.branchPtr(Assembler::NotEqual, pic.u.vr.u.s.data, ImmPtr(funobj));
>+                    if (!slowExits.append(mismatchedFunction))
>+                        return false;
>+                }
>+            }

I'm not entirely sure, but I think JSOP_SETMETHOD is generated only for statements of the form:

  [object expr].m = [function expr]

where the function doesn't need to be cloned (at least not right away). So, I think isConstant will always be true, so all we really need is the assertion.

Might want to check with brendan on that.

>+                /* Check capacity. */
>+                masm.load32(Address(pic.shapeReg, -sizeof(Value)), pic.shapeReg);
>+                Jump overCapacity = masm.branch32(Assembler::LessThanOrEqual, pic.shapeReg,
>+                                                  Imm32(shape->slot));
>+                if (!slowExits.append(overCapacity))
>+                    return false;

I think here you need to use loadPrivate instead of load32: the capacity value is a private uint32.

>+            /* REVIEW: object locking not necessary for natives, right? */
>+            if (!obj->ensureClassReservedSlots(f.cx))
>+                return false;

Correct.
Attachment #471201 - Flags: review?(dmandelin)
(In reply to comment #29)
> |i| is the SM canonical variable name here.

Thanks -- canonical names are important for grep'ability and consistency. Squash any other |ind| deviations, rs=me.

> >+            if (pic.kind == ic::PICInfo::SETMETHOD) {
> >+                /*
> >+                 * Guard that the value is equal to the shape's method.
> >+                 * We already know it is a function, so test the payload.
> >+                 * REVIEW: is this necessary?  This mismatch does not seem to
> >+                 * occur anywhere in tests/trace-tests.
> >+                 */
> >+                JS_ASSERT(shape->isMethod());
> >+                JSObject *funobj = &shape->methodObject();
> >+                if (pic.u.vr.isConstant) {
> >+                    JS_ASSERT(funobj == &Valueify(pic.u.vr.u.v).toObject());
> >+                } else {
> >+                    Jump mismatchedFunction =
> >+                        masm.branchPtr(Assembler::NotEqual, pic.u.vr.u.s.data, ImmPtr(funobj));
> >+                    if (!slowExits.append(mismatchedFunction))
> >+                        return false;
> >+                }
> >+            }
> 
> I'm not entirely sure, but I think JSOP_SETMETHOD is generated only for
> statements of the form:
> 
>   [object expr].m = [function expr]
> 
> where the function doesn't need to be cloned (at least not right away).

Correct, although runtime checking is required since [object expr] can't be proven to be well-behaved (have no magic setter, e.g., that copies the joined function object).

> So, I think isConstant will always be true, so all we really need is the
> assertion.

The function is definitely constant -- it comes from JSOP_LAMBDA and it has not been unjoined (cloned), as proven by shape->isMethod().

/be
(In reply to comment #30)
> Correct, although runtime checking is required since [object expr] can't be
> proven to be well-behaved (have no magic setter, e.g., that copies the joined
> function object).

The shape guards should cover this, right?  For addprop we know there's no such property in the object, and no scripted setter in the prototype chain.
(In reply to comment #31)
> (In reply to comment #30)
> > Correct, although runtime checking is required since [object expr] can't be
> > proven to be well-behaved (have no magic setter, e.g., that copies the joined
> > function object).
> 
> The shape guards should cover this, right?  For addprop we know there's no such
> property in the object, and no scripted setter in the prototype chain.

Yes, sorry I didn't confirm this. My comment was about setprop vs. initprop. We won't fill the relevant cache with the joined function object in the first place if the runtime checks fail, we'll clone and fill differently (different shapes). So shape guards suffice.

/be
Attached patch updated patchSplinter Review
This removes the check on the incoming object in SETMETHOD adds.  It also adds a few more disables in update().
Attachment #471201 - Attachment is obsolete: true
Attachment #471280 - Flags: review?(dmandelin)
(In reply to comment #29)
> I think here you need to use loadPrivate instead of load32: the capacity value
> is a private uint32.

Private uint32s and pointers are represented differently on 32 and 64 bit systems.  On 64 bits there is an extra shift for pointers, but not for uint32s.  loadPrivate is doing the pointer version, so I think load32 is correct here (there needs to be a payloadOf though, will fix that).
Attachment #471280 - Flags: review?(dmandelin) → review+
(In reply to comment #34)
> (In reply to comment #29)
> > I think here you need to use loadPrivate instead of load32: the capacity value
> > is a private uint32.
> 
> Private uint32s and pointers are represented differently on 32 and 64 bit
> systems.  On 64 bits there is an extra shift for pointers, but not for uint32s.
>  loadPrivate is doing the pointer version, so I think load32 is correct here
> (there needs to be a payloadOf though, will fix that).

I'm confused about the question now, because the updated patch does use loadPrivate. In any case, the only important considerations are:

- That there is some abstraction layer involved in reading out the capacity in dslots[-1] to protect against changes, and
- The code for doing that in the addprop stubs uses the same idiom as the code that reads dslots[-1] in JSOP_GETFCSLOT in Compiler.cpp.

Anything with those properties is good.
(In reply to comment #35)
> I'm confused about the question now, because the updated patch does use
> loadPrivate. In any case, the only important considerations are:
> 
> - That there is some abstraction layer involved in reading out the capacity in
> dslots[-1] to protect against changes, and
> - The code for doing that in the addprop stubs uses the same idiom as the code
> that reads dslots[-1] in JSOP_GETFCSLOT in Compiler.cpp.
> 
> Anything with those properties is good.

Oops, that shouldn't be loadPrivate.  I don't see anywhere GETFCSLOT loads from dslots[-1], nor anywhere else in JM.  Capacity checks on dense arrays are accessing a private uint32, and those just use masm.payloadOf.
(In reply to comment #36)
> (In reply to comment #35)
> > I'm confused about the question now, because the updated patch does use
> > loadPrivate. In any case, the only important considerations are:
> > 
> > - That there is some abstraction layer involved in reading out the capacity in
> > dslots[-1] to protect against changes, and
> > - The code for doing that in the addprop stubs uses the same idiom as the code
> > that reads dslots[-1] in JSOP_GETFCSLOT in Compiler.cpp.
> > 
> > Anything with those properties is good.
> 
> Oops, that shouldn't be loadPrivate.  I don't see anywhere GETFCSLOT loads from
> dslots[-1], nor anywhere else in JM.  Capacity checks on dense arrays are
> accessing a private uint32, and those just use masm.payloadOf.

You're right. GETFCSLOT is doing something else; the dense array check is the right thing to match.
Tried landing this but backed out as compilers that aren't gcc-4.2 or msvc don't like anonymous structures.  So back to the old JSObject::flagsOffset (at least until bug 592556 lands).  There were a couple assert failures too, so retrying on the try server.
Last tryserver push didn't crash but timed out on a few mochitests.  Disabling the pic when overriding a prototype property fixes things locally so will tryserver that, but this should be revisited
 (10ms or so on v8-splay).
http://hg.mozilla.org/tracemonkey/rev/fc7630d987f8

Fixed a stupid bug in the patch where objects with addProperty hooks weren't filtered out, and a trickier bug in existing code where resetting a PIC could use the wrong slow path address with insufficient stack syncing, leading to later incorrect behavior.

For JM on AWFY this looks like 7ms on SS and 240ms on V8, for JM+TM 3ms and 180ms.
Backing this out because of intermittent Talos failures (don't show up on tryserver) which I don't know how to debug.  If someone wants to pick this up feel free, otherwise I'll look at it again next week.
Relanded as http://hg.mozilla.org/tracemonkey/rev/9a8b156c7396 and seems to have stuck. I suspect bug 593918 was the problem.
Whiteboard: fixed-in-tracemonkey
Yay! Now, JSObject::freeslot is *mine* (bug 592556).

/be
Blocks: 592554
Blocks: 593124
Blocks: 594686
http://hg.mozilla.org/mozilla-central/rev/9a8b156c7396
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.