561506 - JM: optimize adding properties with PIC

Reporter

Description

•

15 years ago

Attached patch WIP (obsolete) — Details — Splinter Review

Currently, JM adds properties a bit slower than the interpreter, because it has to use the slow path for that (which means doing what the interpreter does, plus a bit of overhead). It should instead do pretty much what the tracer does for that case. I've started this out, and reached the point of a WIP that can run a simple microbenchmark and show a speedup there. It doesn't look too hard, but it does require a cross-platform method for calling functions with 3 (or maybe more) arguments, which we don't have yet (we just have our standard stub calls and 1-argument calls).

David Mandelin [:dmandelin]

Reporter

Comment 1

•

15 years ago

This looks a little tricky to do well, so I slowed down to take some measurements. First, I measured how long it takes to do a property add in various systems: JM 63 ns JM+TM 25 ns SM 50 ns JSC 10 ns V8 47 ns JM is slower than SM because right now we do a pure slow stub call, after trying the PIC, so it's basically SM+overhead. Note that JSC is super-fast somehow. I think they are doing roughly what we do. Second, I measured how many slow stub calls we do. I think most of the ones left are for adds, although I don't know for sure. In SunSpider, there are 125k left, basically all in access-binary trees. In V8, there are about 3M left, 2M in earley-boyer, 1M in raytrace, .7M in deltablue, and .2M in splay. So, if we can make this as fast in JM as on trace, by doing basically the same thing, then we have left about 5 ms of win on SS and 120 ms on V8. I think this is fairly important to get done at some point, but it doesn't have to be high priority, as nothing depends on it and the perf win is modest and predictable. Also, it will be easier and more effective if bug 558451 is done first.

Depends on: 558451

David Mandelin [:dmandelin]

Reporter

Comment 2

•

15 years ago

Now some stats on which paths are taken through the fast-case add. I did this by instrumenting the fast case in jsops.cpp. I measured two dimensions: bump slot vs. realloc slots, and extend scope vs. do AddProperty. slot allocation scope update program bump realloc extend addprop total access-binary-trees 126213 0 126213 0 126213 v8-deltablue 537997 109287 647284 0 647284 v8-earley-boyer 2126231 0 2126231 0 2126231 v8-raytrace 1154537 9005 1163542 0 1163542 v8-splay 60350 24471 84821 0 84821 If we just do the common case, we are only leaving 5ms on V8 of win behind, and the fast case should be simpler and faster.

David Mandelin [:dmandelin]

Reporter

Comment 3

•

15 years ago

Attached patch WIP 2 (obsolete) — Details — Splinter Review

This version fully inlines the common path. It still has some bugs.

Attachment #441226 - Attachment is obsolete: true

Brendan Eich [:brendan]

Comment 4

•

15 years ago

Great work. We need to get property trees per thread (bug 511591), as adding a property involves taking a lock, which probably accounts for much of our slowdown compared to JSC (confirmation needed). /be

Mike Shaver (:shaver emeritus)

Comment 5

•

15 years ago

Do we need to take a lock in the predictable-shape-evolution case? It seems like, for ST objects, no shared state needs to mutate.

Brendan Eich [:brendan]

Comment 6

•

15 years ago

The lock is for adding to the property tree. We tolerate races to lookup and find nothing, then add, so we tolerate dups, but adding a new tree node itself requires a lock or you can race off the end of a kids-chunk list. So the GC lock is taking for shortish critical sections when adding, but not when looking. If the property tree node already exists, no locking required. So maybe the locking cost isn't the problem -- hard to say without more data. /be

Brendan Eich [:brendan]

Comment 7

•

15 years ago

(In reply to comment #6) > The lock is for adding to the property tree. We tolerate races to lookup and > find nothing, then add, so we tolerate dups, but adding a new tree node itself > requires a lock or you can race off the end of a kids-chunk list. And add two kids chunks list at the same next link, leaking one. > So the GC lock is tak[en] for shortish critical sections when adding, but not > when looking. If the property tree node already exists, no locking required. So > maybe the locking cost isn't the problem -- hard to say without more data. Dave, can you profile and find out why add costs us, in detail? Thanks, /be

David Mandelin [:dmandelin]

Reporter

Comment 8

•

15 years ago

(In reply to comment #7) > Dave, can you profile and find out why add costs us, in detail? Thanks, My stats are for basic shell builds, so locking doesn't apply. Shark says that of the total time within js_AddProperty, 40% of it is for js_IdIsIndex (via js_AddProperty -> JSScope::extend -> JSScope::updateFlags). Otherwise, the time seems to be distributed across the active parts of the function. There do seem to be a lot of opportunities for upfront specialization for the hot paths. My WIP 2 still has bugs, so it may be slower when it is done, but currently it is showing a time of about 4ns per add. Another relevant issue here is that js_GetMutableScope can't be practically inlined into JIT code, so I do the slow path for the case where the base object has a shared mutable scope. Of course, we could still get some speedup by calling out to a simplified builtin, but I think it would be nicer to do after bug 558451.

David Mandelin [:dmandelin]

Reporter

Comment 9

•

15 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

OK, this is it. It passes shell tests. My laptop is no good for perf testing, which I want to do before landing.

Attachment #441675 - Attachment is obsolete: true

David Mandelin [:dmandelin]

Reporter

Comment 10

•

15 years ago

I'm still getting a few test failures on the v8-v4 benchmarks.

David Mandelin [:dmandelin]

Reporter

Comment 11

•

15 years ago

Attached patch Patch 2 (obsolete) — Details — Splinter Review

Forgot to update the shape in the previous version. Strangely, that version actually passed all of our existing shell tests.

Attachment #441950 - Attachment is obsolete: true

David Mandelin [:dmandelin]

Reporter

Comment 12

•

15 years ago

Attached patch Patch 3 (obsolete) — Details — Splinter Review

Attachment #442460 - Attachment is obsolete: true

David Mandelin [:dmandelin]

Reporter

Comment 13

•

15 years ago

After further analysis, I think I'm going to put this on hold for now. Roughly, it looks like this could be done the easy way for a puny <50ms win on V8, or the hard way for a win of up to 150ms. 50ms is hardly big enough to care about, and 150ms for a lot of effort doesn't seem worth it right now. And it appears that there is essentially no effect on SunSpider no matter how we do this. I'll document what I learned for when we get back to this. In the interpreter, a property add takes about 50 ns: do prop cache stuff 50 ns In the tracer, it's about 25 ns, which breaks down something like this: actually do property add 7 ns function call overhead 8 ns call js_IdIsIndex 10 ns So we can take 3 steps to speed things up: 1. Do it the tracer way instead of the interpreter way. This is relatively easy; it just means compiling two guards and a call to js_AddProperty. One thing to note is that the tracer guards on JSRuntime::protoHazardShape, but this seems to be kind of unfortunate. In v8-earley-boyer (in a test with PICs), protoHazardShape changes after only 2000 adds or so (out of 2M), invalidating the PIC. It's probably better to check the shapes up the prototype chain instead. 2. Don't call js_IdIsIndex. This call can be pushed up to PIC generation time; basically we can just not generate a fast path if it returns true. 3. Inline js_AddProperty for non-empty, non-table scopes. The case for table scopes can't be inlined, but that case is rare (nonexistent in the benchmarks). Empty scopes are common (1M of the 2M adds in earley-boyer). It might help if we started objects with a mutable scope if they are created for a constructor that does sets (or we otherwise know statically at object creation point that they will be mutated). Then that case could be inlined as well. Even inlining the easy case is a bunch of work--it requires calls to lock/unlock functions and/or js_AllocSlots in the slow cases.

Assignee: dmandelin → general

David Mandelin [:dmandelin]

Reporter

Comment 14

•

14 years ago

We lose about 7ms on adding properties in binary-trees. This should become easier once the scope changes land.

Blocks: JaegerSpeed

Brendan Eich [:brendan]

Comment 15

•

14 years ago

This should be pretty easy now. Brian, you game? /be

Assignee: general → bhackett1024

WIP 15 years ago David Mandelin [:dmandelin] 11.18 KB, patch		Details \| Diff \| Splinter Review
WIP 2 15 years ago David Mandelin [:dmandelin] 15.70 KB, patch		Details \| Diff \| Splinter Review
Patch 15 years ago David Mandelin [:dmandelin] 17.06 KB, patch		Details \| Diff \| Splinter Review
Patch 2 15 years ago David Mandelin [:dmandelin] 17.48 KB, patch		Details \| Diff \| Splinter Review
Patch 3 15 years ago David Mandelin [:dmandelin] 17.50 KB, patch		Details \| Diff \| Splinter Review
addprop patch 14 years ago Brian Hackett [Laid off!] 14.23 KB, patch		Details \| Diff \| Splinter Review
updated patch 14 years ago Brian Hackett [Laid off!] 19.88 KB, patch		Details \| Diff \| Splinter Review
updated patch 14 years ago Brian Hackett [Laid off!] 21.39 KB, patch	dmandelin : review+	Details \| Diff \| Splinter Review