547327 - Estimate object size for JS objects

Reporter

Description

•

14 years ago

Dynamically typed languages have the problem that the actual object size is unknown. Each object has the same size at the beginning and is modified as needed during runtime. Allocating big objects results in memory waste and too small objects are a performance bottleneck since the dynamic allocation is expensive.
The idea is to create a feedback loop based on object creation location.
Two assumptions have to be confirmed:
1) objects are allocated in loops. So they have the same PC location.
2) the objects grow at the beginning of their lifetime.

Ideally what we want to see is that objects in loops are modified before the next object with the same program location is created. 
We can monitor the object size and allocate the next object with a different size.

Gregor Wagner [:gwagner]

Reporter

Updated

•

14 years ago

Assignee: general → anygregor

Gregor Wagner [:gwagner]

Reporter

Comment 1

•

14 years ago

Attached image Facebook — Details

Lets start to measure how many objects are created within a loop

Results for Facebook:
We see that we have "loopy" code where we have about 10 PC locations that allocate more than 1000 objects. 2 of them allocate over 7000 objects.

Gregor Wagner [:gwagner]

Reporter

Comment 2

•

14 years ago

Attached image Canopy benchmark — Details

The canopy benchmark also creates objects in a loop.
9 PC locations create more than 9000 objects.

Brendan Eich [:brendan]

Comment 3

•

14 years ago

This is great!

/be

Andreas Gal :gal

Comment 4

•

14 years ago

I wanted to combine this with increasing the size of objects so we can store a link back to the creating pc which we then patch as objects grow.

Brendan Eich [:brendan]

Comment 5

•

14 years ago

(In reply to comment #4)
> I wanted to combine this with increasing the size of objects so we can store a
> link back to the creating pc which we then patch as objects grow.

That's silly. If this approach wins, you get 1e6 objects all uselessly wasting a word to point to the same pc.

Better to use a PIC-like approach (could even use the property cache in the bytecode interpreter) and map pc to likely shape or shapes. If shape does not correlate with true size (due to private data or reserved slots or what not) then make a "size shape".

But I hope we don't need a birth-pc per object.

/be

Brendan Eich [:brendan]

Comment 6

•

14 years ago

(In reply to comment #5)
> (In reply to comment #4)
> > I wanted to combine this with increasing the size of objects so we can store a
> > link back to the creating pc which we then patch as objects grow.
> 
> That's silly. If this approach wins, you get 1e6 objects all uselessly wasting
> a word to point to the same pc.

Sorry, I meant silly in the nicest possible way ;-). To explain better, the 1e6 objects would all be small, so the optimization wouldn't win and the overhead of the pc would be relatively high.

With something analogous to a PIC, you get MRU recycling so the hot objects hog the limited cache space, and those are the ones you care about. You could even bias things toward the big hot objects, so the small hot objects drop out of the cache fast.

/be

Andreas Gal :gal

Comment 7

•

14 years ago

Yeah, PIC sounds fine, too. I liked the idea of having a creation site id in objects because its the closest you can get to a notion of "type". Objects with the same creation type should behave similarly, probably even more similar than objects that are accidentally compatible at a structural level. But yeah PIC, origin pointer, whatever, just patch me that new so we don't grow and copy all the time ;)

Brendan Eich [:brendan]

Comment 8

•

14 years ago

I've seen enough JS that does not construct same-sized objects at the same pc. It's easy to make an object then throw it down one of several control flow paths that add different numbers of properties.

All good to measure and study in real code, though -- for that, a test patch adding a pc per object to make the experiment easy and complete (not lossy like a cache) would be fine. Whatever's easiest.

/be

Gregor Wagner [:gwagner]

Reporter

Comment 9

•

14 years ago

Yes I saw very "dynamic" objects created at the same PC location. I am working on some graphs that also include mean and standard deviation for objects created at the same pc location.

Gregor Wagner [:gwagner]

Reporter

Comment 10

•

14 years ago

Attached file dslot allocation and free benchmark — Details

I tried to measure the overhead introduced by dslots allocation.
For creating 600000 Objects in a loop and set properties I got following numbers:
0 slots alloc: 65 free: 25
1 slots alloc: 163 free: 88
2 slots alloc: 182 free: 88
3 slots alloc: 195 free: 87
4 slots alloc: 277 free: 151
5 slots alloc: 291 free: 150

We can clearly see the jump after the initial 3 slots.
The outcome of this patch ideally should linearize the allocation time and remove the overhead for free.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 11

•

14 years ago

FWIW, V8 seems to specially mark functions that consist of simple assignments to this.props, which would let them easily pre-allocate (and pre-shape) at least to the extent that the constructor is involved.  Might be worth looking at.

Brendan Eich [:brendan]

Comment 12

•

14 years ago

See bug 536564, which led to these flags:

/*
 * Flag signifying that the current function seems to be a constructor that
 * sets this.foo to define "methods", at least one of which can't be a null
 * closure, so we should avoid over-specializing property cache entries and
 * trace inlining guards to method function object identity, which will vary
 * per instance.
 */
#define TCF_FUN_UNBRAND_THIS   0x100000

/*
 * "Module pattern", i.e., a lambda that is immediately applied and the whole
 * of an expression statement.
 */
#define TCF_FUN_MODULE_PATTERN 0x200000

We don't insist on straight-line "constructor" code, so we don't try to guess shape at compile time. All we want here is to avoid overspecializing. For the general case, bhackett's static analysis patch (bug 557407) is the way to go. To take advantage of that we need Luke's bigger values patch.

/be

Luke Wagner [:luke]

Comment 13

•

14 years ago

Bug 589398 moves 'this' construction to the jitted prologue of scripted constructors and primitive-to-object-return-value logic to the jitted epilogue.  With this, one simple way to predict object size (specifically, with bug 584917, FinalizeKind) for calls to scripted constructors is to have the constructor epilogue patch the prologue based on the dynamic number of slots in the returned object.

Gregor Wagner [:gwagner]

Reporter

Comment 14

•

14 years ago

We have dynamic object size so lets try it!

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 15

•

14 years ago

Attached patch WIP patch (obsolete) — Details — Splinter Review

Here's my attempt at doing this. It's kind of preliminary.

It uses a field in the constructor's JIT data structure to store the estimated object size. The constructor allocates objects of this size. Then when it finishes, it fills in the field based on the object's current capacity. I had to change SLOT_CAPACITY_MIN to 2 so that we don't initially grow by too much in the constructor. I don't know if this is a good idea or not.

Additionally, I incorporated an idea from Brian. When we call JSObject::growSlots, the patch tries to find the constructor using the prototype object's constructor property. If it can find it, it fills in the JIT field with the larger capacity. This takes care of objects where the constructor doesn't fill in all the fields (which apparently happens in v8-splay).

My main questions:
1. Is it okay to set SLOT_CAPACITY_MIN to 2?
2. Looking up the constructor property seems potentially slow (although growSlots is presumably a fairly expensive operation as it is). Is there a better way to do this? Can we guarantee that this property will be in a particular slot?

Attachment #483327 - Flags: feedback?(lw)

Luke Wagner [:luke]

Comment 16

•

14 years ago

This looks like a good start.  A few comments:

1. the heuristic added to growSlots seems like it could be pretty expensive, in particular the getProperty.  I would measure the effect of that in isolation on top of the rest of the patch.
2. to avoid computing GetGCObjectKind on every 'new', you could do it once up front and store the FinalizeKind, not the number of slots.
3. instead of loading JITScript::ctorNumSlots in every new prologue and storing to it in every epilogue, it seems like it would be faster to bake a constant into the prologue that gets set only once by the epilogue.  That you would first generate an epilogue that called a stub which (1) patched the prologue based on JSObject::capacity and then (2) nop out the stub call.

Brendan Eich [:brendan]

Comment 17

•

14 years ago

(In reply to comment #16)
> 1. the heuristic added to growSlots seems like it could be pretty expensive, in
> particular the getProperty.  I would measure the effect of that in isolation on
> top of the rest of the patch.

Not all objects are creaeted by new on a constructor, notably object and array initialisers. One idea is to allocate a JSObject flag to set if and only if the object was created by new F for some function object F.

Another observation is that such objects are by default (and almost always) of class Object, with their proto also object class Object but not Object.prototype. So instead of proto->getProperty, do a proto->nativeLookup and cut through a lot of the layering. You don't want to risk wandering up the proto's proto to Object.prototype in the case of someone deleting F.prototype.constructor.

/be

Andreas Gal :gal

Comment 18

•

14 years ago

Why not associate the size field with the pc and store the pc in the object?

Luke Wagner [:luke]

Comment 19

•

14 years ago

> Not all objects are creaeted by new on a constructor, notably object and array
> initialisers.

Right, I think the goal was just to attack a subset of situations where we have a good (although, of course, not perfect), efficient indicator.  Object/array initializers seem like an even simpler case where we can just bake in the FinalizeKind without any runtime sampling.

Perhaps a more general strategy -- where we collect size data in growSlots and feed it into a pc-keyed cache which is consulted for future allocations -- would totally subsume these hard-coded solutions, but perhaps also the general solution would impose a general overhead on object allocation/growth which, from what I'm hearing, is one of the next hot paths we need to attack stack-frame-evisceration-style.

Brian Hackett [Laid off!]

Comment 20

•

14 years ago

bug 584917 already takes care of object sizes for object/array initializers (which, if non-empty, are presumed not to grow further), Array(N), and Array(a,b,c).

Keying things to the PC would help with bare Object() and scripted new objects where the constructor is called in multiple places with different usage profiles (weird code).  Not sure that's worth an extra word per object.  Also, bug 557407 adds a TypeObject field to JSObject which has that info for bare objects/arrays (not scripted new objects)

I think the .constructor traversal could be avoided by storing the data in the JSObject flags of the prototype object itself, which can be accessed easily from both the new objects and the JIT code for the function.  The prototype needs to be fetched anyways during construction, so use a few bits and mask the finalize kind.

Brendan Eich [:brendan]

Comment 21

•

14 years ago

(In reply to comment #20)
> I think the .constructor traversal could be avoided by storing the data in the
> JSObject flags of the prototype object itself, which can be accessed easily
> from both the new objects and the JIT code for the function.  The prototype
> needs to be fetched anyways during construction, so use a few bits and mask the
> finalize kind.

This sounds good.

/be

Luke Wagner [:luke]

Comment 22

•

14 years ago

comment 17 might be saying 'no' -- I'm not quite sure what 'such objects' refers to -- but wouldn't the prototype of most objects be Object.prototype and, if so, would this cause a collision problem for storing predicted FinalizeKind on the prototype?

Brian Hackett [Laid off!]

Comment 23

•

14 years ago

Comment 20 only addresses objects created with scripted new (which is what this bug should focus on, as objects other than bare Object(), {} and [] should be pretty well covered).  For these, barring bizarre behavior like 'Foo.prototype = Object.prototype', different functions should have different objects in their .prototype property.  Those objects will eventually inherit from Object.prototype through their __proto__ chains, but the bits would only be added to the immediate __proto__ of objects which allocSlots.

Brendan Eich [:brendan]

Comment 24

•

14 years ago

(In reply to comment #22)
> comment 17 might be saying 'no' -- I'm not quite sure what 'such objects'
> refers to

Objects created by user-defined constructor functions.

> -- but wouldn't the prototype of most objects be Object.prototype

No, the proto of each new instance is F.prototype for user-defined constructor function F. The proto of F.prototype is Object.prototype. As Brian notes in comment 23, this is "usually" not "always", so needs checking.

/be

Luke Wagner [:luke]

Comment 25

•

14 years ago

Ah, I missed that we were talking about non-scripted new, thanks.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 26

•

14 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Since we're not going to do bug 606631 right now, I thought I'd put some more effort into this one. It improves our v8-bench score by about 150. It doesn't have any effect on SunSpider.

The patch is pretty small. It uses a few flag bits to store the estimated object kind in the prototype, as Brian suggested. I also added a little bytecode analysis to guess the size of an object based on its constructor. The reason for this is to avoid allocating the first object with a different shape than remaining objects. Doing so caused us to hit MAX_BRANCHES in v8-richards, which was a big performance problem.

Attachment #483327 - Attachment is obsolete: true

Attachment #492478 - Flags: review?(lw)

Attachment #483327 - Flags: feedback?(lw)

Luke Wagner [:luke]

Comment 27

•

14 years ago

Comment on attachment 492478 [details] [diff] [review]
patch

Nice patch and great idea with the js_GuessObjectSizeFromConstructor.

>+    while (pc < end) {
...
>+        const JSCodeSpec *cs = &js_CodeSpec[JSOp(*pc)];
>+        ptrdiff_t oplen = cs->length;
>+        if (oplen < 0)
>+            oplen = js_GetVariableBytecodeLength(pc);
>+        pc += oplen;
>+    }

We should really factor this out into a bytecode iterator one of these days... I count at least 4 copies of this loop.

>+            proto->setStoredGCKind(kind+1);

The +1/-1 stuff is a bit confusing (before you realize its avoiding confusing with gckind == 0).  Perhaps you store the gckind directly and represent 'unset' with some larger-than-max-finalizekind constant?  It might also be a touch faster, saving the +/-.

>@@ -2893,17 +2928,29 @@ js_CreateThisFromTrace(JSContext *cx, Cl
>+        unsigned stored = proto->getStoredGCKind();
>+        if (stored == 0) {
>+            uintN count = js_GuessObjectSizeFromConstructor(ctor);
>+            kind = gc::GetGCObjectKind(count);
>+            proto->setStoredGCKind(kind+1);
>+        } else {
>+            kind = gc::FinalizeKind(stored-1);
>+        }

Can you factor out this then-branch into some some GetOrComputeStoredGCKind (or whatever name makes sense to you) helper?

>+            slots = JS_MAX(actualCapacity, GetGCKindSlots(gc::FinalizeKind(stored-1)));

I added non-macro js::Max/js::Min in jstl.h you could use.  I mention here since CSE might not catch the switch statement in GetGCKindSlots.

>+        GCKIND_BIT0     = 0x20000000,
>+        GCKIND_BIT1     = 0x40000000,
>+        GCKIND_BIT2     = 0x80000000
>     };
> 
>+    const static int32 FLAGS_GCKIND_SHIFT = 29;
>+    const static int32 FLAGS_GCKIND_MASK = 0xe0000000;

Perhaps add:

  JS_STATIC_ASSERT(GCKIND_BIT0 == 1 << FLAGS_GCKIND_SHIFT);
  JS_STATIC_ASSERT(FLAGS_GCKIND_MASK == (GCKIND_BIT0 | GCKIND_BIT1 | GCKIND_BIT2));

r+ these changes

Attachment #492478 - Flags: review?(lw) → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 28

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/d75da3b12098

Status: NEW → ASSIGNED

Whiteboard: fixed-in-tracemonkey

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 29

•

14 years ago

Backed out due to static_assert failure. I guess this will wait until post-FF4.

http://hg.mozilla.org/tracemonkey/rev/8bb016a281c0

Whiteboard: fixed-in-tracemonkey

Andreas Gal :gal

Comment 30

•

14 years ago

I really love this patch, but it is not blocking and has no approval. It shouldn't land right now, probably, or we need to discuss approval (I am all for it, btw).

Gregor Wagner [:gwagner]

Reporter

Comment 31

•

13 years ago

I did some measurements with this patch in the browser because the benchmarks don't seem to benefit.

alloc: malloc calls in JSObject::allocSlots(JSContext *cx, size_t newcap)
grow: realloc calls in JSObject::growSlots(JSContext *cx, size_t newcap)
shrink: realloc calls in JSObject::shrinkSlots(JSContext *cx, size_t newcap)

trunk
         alloc ,    grow,  shrink 
V8:      903155,  114175,     560
kraken: 2219090, 1708672,  517938
SS:      164097,   24229,    2409

with patch
V8:      170698,  236086,    1676
kraken: 2213440, 2915037,  520887
SS:      163319,   39044,    3909

For the V8 benchmarks, Richards and DeltaBlue benefit the most because 70% of all alloc calls in the trunk build happen during Richards, DeltaBlue and Crypto.
I also see a slight regression in the splay benchmark.
The kraken benchmark suffers from an increased number of realloc calls.
The SS benchmark doesn't really benefit and also shows an increased number of realloc calls.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 32

•

13 years ago

Attached patch patch v2 — Details — Splinter Review

Thanks for the measurements, Gregor. I tracked down some problems and fixed them. I'll post some new data in the next attachment.

Luke, I'm asking for re-review since the patch has changed somewhat. It's still pretty small, though. I realized it would be nice to have this for Firefox 5.

Assignee: anygregor → wmccloskey

Attachment #492478 - Attachment is obsolete: true

Attachment #524498 - Flags: review?(luke)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 33

•

13 years ago

Attached file measurements — Details

This includes data on slot array allocation (same as Gregor's) and timing.

The slot array data shows three pairs of columns. The first column in the pair shows how many allocations there were on trunk. The second column shows the change with the patch. A negative change is good. There were a lot of benefits for V8 and almost none for anything else. I spot-checked the allocations that weren't eliminated, and they seem to be mostly arrays.

The timing data is pretty standard. I left out SS since it's unaffected. V8 gets a 2% improvement. Kraken gets worse by a little (< 1%). The Kraken regression is caused by some weird behavior in ai-astar. It allocates a ton of objects and then later adds some properties to them. This totally defeats the heuristics in this patch. I filed bug 648321 on this problem. For now I think we should just accept the regression, since it's small.

David Mandelin [:dmandelin]

Comment 34

•

13 years ago

(In reply to comment #33)
> V8 gets a 2% improvement. Kraken gets worse by a little (< 1%). The Kraken
> regression is caused by some weird behavior in ai-astar. It allocates a ton of
> objects and then later adds some properties to them. This totally defeats the
> heuristics in this patch. I filed bug 648321 on this problem. For now I think
> we should just accept the regression, since it's small.

Hmmm, to me, with the regression, this patch looks like kind of a wash. And I would imagine the Kraken pattern happens in real code, too. I'd vote for not taking the patch as-is given the current perf numbers. Is there anything simple that might fix this up?

Brian Hackett [Laid off!]

Comment 35

•

13 years ago

I can put a patch together for bug 648321 quickly and we can see what the performance story is then.  I suspect ai-astar will improve with both patches applied, as then the hot loop will access properties as inline slot rather than with with an extra indirection.

(I'm going to be doing some TI property stuff in the next couple days, and will be working on top of this and bug 648321).

Even though that adds back some of the complexity bug 584917 removed, this shouldn't be a big deal as one of the things bug 584917 did was encapsulate slot accesses so we can change the object slot layout without much pain.

Luke Wagner [:luke]

Updated

•

13 years ago

Attachment #524498 - Flags: review?(luke)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 36

•

13 years ago

It looks like TI took care of this.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

Facebook 14 years ago Gregor Wagner [:gwagner] 114.48 KB, image/png		Details
Canopy benchmark 14 years ago Gregor Wagner [:gwagner] 112.16 KB, image/png		Details
dslot allocation and free benchmark 14 years ago Gregor Wagner [:gwagner] 1.13 KB, application/x-javascript		Details
WIP patch 14 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 15.10 KB, patch		Details \| Diff \| Splinter Review
patch 14 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.17 KB, patch	luke : review+	Details \| Diff \| Splinter Review
patch v2 13 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 9.79 KB, patch		Details \| Diff \| Splinter Review
measurements 13 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 6.31 KB, text/plain		Details