Open Bug 650284 Opened 9 years ago Updated 5 years ago

Investigate performance of accesses to own and inherited properties

Categories

(Core :: JavaScript Engine, defect, minor)

x86_64
Windows 7
defect
Not set
minor

Tracking

()

People

(Reporter: daniel.baulig, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0
Build Identifier: 4.0

The above linked jsPerf test case indicates that accessing an objects own member is considerable slower than if the same member is inherited by the objects prototype chain. Considering how ECMAScript works/should work? this is unlogical and should not be the case. For that reason I suppose there is a bug in the ES implementation.

Reproducible: Always

Steps to Reproduce:
1. Go to http://jsperf.com/prototype-chain/4
2. Run Test
3.
Actual Results:  
Accessing Proto.member and Proto.method is considerable slower than accessing Level1Child or Level2Child members.

Expected Results:  
Accessing Proto.method and Proto.member should be at least as fast as accessing Level1Child and Level2Child members.

As you can see the other browsers give the expected results. I also considered that JIT might cause the difference, since Proto members are accessed first and then later again through the prototype chain. If the first test causes JIT to trigger then maybe the codepath is already compiled when the other two tests hit. This however is not the case. You can verify this by relaoding the page and then running the test cases one by one manually in reverse order. The effect is the same.
Attached file Shell testcase
This is only an issue in tracemonkey.  The numbers I see reported for this testcase look like this:

Interp: 11871 11183
JM: 1246 1305
TM: 808 305

The difference for TM is that the first number includes a large number of samples under js::MethodWriteBarrier which are not present in the second number.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Accessing object own members slower, than accessing members inherited by prototype → TM: Accessing object own method members is slower than accessing members inherited from prototype
Summary: TM: Accessing object own method members is slower than accessing members inherited from prototype → TM: Accessing object own members is slower than accessing members inherited from prototype
In fact, you don't need the method() calls at all to reproduce this

What's happening is that every write to the "member" own property of an object that is branded or has a method barrier (which Proto is in this case) needs to call methodWriteBarrier on the object.  The relevant comments in TraceRecorder::nativeSet are:

       // Setting a function-valued property might need to rebrand the
       // object. Call the method write barrier. Note that even if the
       // property is not function-valued now, it might be on trace.

I wonder whether checking for "function-valued" on trace would be cheaper than unconditionally calling the write barrier.

I also wonder whether this is something type inference would just help.

Daniel, this isn't exactly a bug.  The two objects you're setting properties on have very different behavior on set after optimizations, since one of them has function-valued properties and the other does not and therefore the former can't optimize property writes as well as the latter.  If you add a function-valued property directly on Level1Child in my testcase, you will see the times become equal.  Note that after jitting what the browser actually does is quite different from what the spec says to do, because it short-circuits a bunch of the steps the spec calls for.  You can see that by comparing the "TM" times above to the "Interp" times (which _also_ short-circuit the prototype chain lookup, by the way).
Depends on: TypeInference
(In reply to comment #2)
> I also wonder whether this is something type inference would just help.

The idea here for type inference is that we should know call targets statically (simple in this case for the method() calls) and not need branding to cut the number of memory ops needed to resolve call targets dynamically.

branding has the dual problem that, as shown in the comment you quoted, it can slow down things which aren't involved with method properties at all.  I hit this when optimizing SETGNAME in JM+TI (we know the slot we're accessing and the value we're writing, but in the future the property may be method-valued and with branding we can't overwrite it without a shape change), and resolved this by disabling branding when inference is enabled.  Going forward I think we will want to remove branding entirely.

FWIW, the times I get from JM+TI:

js -m -n  499 1121
js -j     1276 410
js -m     1104 1147

The Level1Child case is comparable to plain -m because it is made with Object.create, whose result currently has totally unknown properties (should get fixed sometime).  If I doctor the testcase to use the more typical scripted 'new':

function Foo() {}
Foo.prototype = Proto;
var Level1Child = new Foo();

I get these times (-j and -m are unchanged):

js -m -n  488 290
Current js shell numbers for the attached testcase:
Interp: 34312 28926
-j:     34478 29039
-m:     1543  1510
-m -n:  610   1600

Given that TM is obsolete at this point, I'm changing the title to reflect the future JM+TI work discussed in comment #3.
Summary: TM: Accessing object own members is slower than accessing members inherited from prototype → JM+TI: Accessing object own members is slower than accessing members inherited from prototype
On Nightly

Ion off
Local 49,022,280 ±0.16% fastest
Member 28,845,034 ±37.94% 57% slower
Child 5,813,390 ±4.10% 89% slower
GrandChild 5,889,825 ±1.29% 88% slower

Ion on
Local 69,643,123 ±0.33% fastest
Member 53,768,355 ±5.40%27% slower
Child 1,413,836 ±0.16%98% slower
GrandChild 1,420,589 ±0.15%98% slower

A lot slower on the last 2. Expected? Not important?
> A lot slower on the last 2. Expected? Not important?

Maybe to both. There certainly are some weird things going on, here.

First the good news: accessing own members of an object isn't slower, anymore. Well, it when running in the interpreter, but not in Baseline or Ion. However, it is *much* slower than in v8. For the attached testcase, I get these timings:

SpiderMonkey (with IM):
direct: 1514
__proto__: 1586
Object.create: 1700
function instance with prototype: 1560

d8:
direct: 162
__proto__: 176
Object.create: 191
function instance with prototype: 177

So where ~9x as slow for all these scenarios as v8 is.

Note also that the Object.create version is ~10% slower than the others. That might nor might not be expected, I don't know.

So far, so bad. Now for the really weird part. The jsperf test paints a *very* different picture. There, accessing the inherited properties is much slower than the own property - almost 50x as slow. I modified the test in http://jsperf.com/prototype-chain/5 to reflect the scenarios in the attached shell testcase, with results resembling those of the previous version.

From the results in http://jsperf.com/prototype-chain/4, it looks like we regressed this somewhere between FF6 and FF9.

At least, we're ~3x as fast as Chromium on the local access part in the jsperf harness - again going completely against the shell results.

I guess that leaves us with three questions:
- Why are we that much slower than v8 in the shell testcase?
- What on earth is going on in that jsperf test?
- When and why exactly did those performance numbers in the jsperf test change?
In case that wasn't clear, this bug has now officially morphed into something entirely different. Sorry about that.
Summary: JM+TI: Accessing object own members is slower than accessing members inherited from prototype → Investigate performance of accesses to own and inherited properties
jsperf, unlike the shell testcase:

1)  Runs in a browser.
2)  Runs inside a |new Function()|, not at global scope.

So sometimes it will measure things quite differently from a near-equivalent shell testcase...
(In reply to Boris Zbarsky (:bz) from comment #8)
> jsperf, unlike the shell testcase:
> 
> 1)  Runs in a browser.

One of these days, we should show measurements on awfy that are actually relevant to our users ...

> 2)  Runs inside a |new Function()|, not at global scope.

Ah, that's a good hint. Since bug 646597, that should not matter anymore, but it probably does.

> 
> So sometimes it will measure things quite differently from a near-equivalent
> shell testcase...

Sure. It's just that this changed behavior in interesting ways that seem worthwile to investigate.
Right.  I was just pointing out possible starting avenues for investigation.  ;)
Blocks: 885526
Just re-discovered this bug. Current numbers follow.

First test case:
v8:
211
178

SpiderMonkey:
3426
3436


Second test case:

v8:
direct: 163
__proto__: 185
Object.create: 184
function instance with prototype: 181

SpiderMonkey:
direct: 1497
__proto__: 1493
Object.create: 1526
function instance with prototype: 1554

So, we're still between 8x and 15x slower than v8 here.

efaust, this looks like it might interest you.
Flags: needinfo?(efaustbmo)
> So, we're still between 8x and 15x slower than v8 here.

You should try replacing this bit:

  const LOOPCOUNT = 10000000;

with:

  var LOOPCOUNT = 10000000;

because afaict we just fail to ion-compile a getgname for a const for some reason (or at least JIT inspector claims no ion code).  I'm not sure why it fails, exactly.  I thought of bug 956072, but that's fixed...

Alternately, wrap the code in a function so that we're not doing a gname.  That also makes it faster (and the const/var difference goes away)...
NI myself for the const vs var issue; that seems silly.
Flags: needinfo?(jdemooij)
Depends on: 973526
Filed bug 973526 for the const issue, Ion should compile JSOP_SETCONST.
Flags: needinfo?(jdemooij)
Blocks: 835306
(In reply to Boris Zbarsky [:bz] from comment #12)
> > So, we're still between 8x and 15x slower than v8 here.
> 
> You should try replacing this bit:
> 
>   const LOOPCOUNT = 10000000;
> 
> with:
> 
>   var LOOPCOUNT = 10000000;
> 
> because afaict we just fail to ion-compile a getgname for a const for some
> reason (or at least JIT inspector claims no ion code).  I'm not sure why it
> fails, exactly.  I thought of bug 956072, but that's fixed...

Oh, right - I should've seen the const.

With that fixed, the numbers are:

SpiderMonkey:
direct: 106
__proto__: 540
Object.create: 130
function instance with prototype: 123

v8:
direct: 167
__proto__: 197
Object.create: 185
function instance with prototype: 180

jsc:
direct: 35
__proto__: 36
Object.create: 45
function instance with prototype: 73


Two observations:
- JSC's get/setprop is *fast*
- the competition can apparently deal with __proto__ being set at least once without deoptimizing.
Waldo, you worked on the __proto__ stuff and recently added a warning about it being slow. Is there any chance to get the set-once-in-the-beginning case fast?
Flags: needinfo?(efaustbmo) → needinfo?(jwalden+bmo)
(In reply to Till Schneidereit [:till] from comment #15)
> Waldo, you worked on the __proto__ stuff and recently added a warning about
> it being slow. Is there any chance to get the set-once-in-the-beginning case
> fast?

It seems unlikely. setting __proto__ almost certianly nukes TI for a whole bunch of things, making later optimizations hard.
> - JSC's get/setprop is *fast*

Sort of.  If you actually look at how much time it's taking per iteration, and compare to a empty loop, you discover that JSC is doing 10 get/sets in about 3 instructions on my hardware.

In other words, a sufficiently smart compiler optimized away your microbenchmark.  Simply moving the "Proto.member = 0;" line (and equivalent for other loops) out of the loop makes JSC take a lot more time.  Now it's _still_ faster at that point than we are, but not by nearly as much...
(In reply to Boris Zbarsky [:bz] from comment #17)
> > - JSC's get/setprop is *fast*
> 
> Sort of.  If you actually look at how much time it's taking per iteration,
> and compare to a empty loop, you discover that JSC is doing 10 get/sets in
> about 3 instructions on my hardware.

Hum, I thought I had checked for that by removing single lines from the 10 `+= 1` part. Doing that reduced the time JSC takes for the changed test by 1.5ms/removed line.
LOOPCOUNT is 10000000 = 1e7.  1.5ms/LOOPCOUNT is therefore 0.15ns.  I suspect your hardware is clocked in the 2-3 GHz range, so 0.3-0.5ns per cycle.  Since you're updating the same memory location, it's unlikely you're getting funky multiple-instructions-per-cycle benefits from out of order execution and whatnot.  More likely, you're spending 1.5ms/line in JSC's JIT warmup period...

You could check by increasing LOOPCOUNT by a factor of 10 and seeing whether you then win 15ms per removed line.
(In reply to Boris Zbarsky [:bz] from comment #19)
> You could check by increasing LOOPCOUNT by a factor of 10 and seeing whether
> you then win 15ms per removed line.

Pretty exactly 20ms/cycle, in fact. Compared to 110ms for us. But in any case, you're right: moving the `member = 0` out of the loop reduces they speed advantage to about 33%. Stupid microbenchmark is stupid. As was to be expected.
(In reply to Till Schneidereit [:till] (on vacation March 1st to 23rd) from comment #15)
> Waldo, you worked on the __proto__ stuff and recently added a warning about
> it being slow. Is there any chance to get the set-once-in-the-beginning case
> fast?

I'm not sure __proto__ has much of anything to do with this, it's really the underlying |SetClassAndProto| function at issue.

It looks like the issue might be that when we have the object whose [[Prototype]] is being mutated, we walk its [[Prototype]] chain and generate new shapes, or setUncacheableProto, on each -- because any object with any of those on its [[Prototype]] chain suddenly will look different to some property accesses.  That's perfectly reasonable.

What's not so reasonable is that we include the starting object in that loop.  But the starting object *doesn't* necessarily have any objects whose lookup behavior will change, if the starting object changes.  If the starting object's a singleton, no biggie, we just generate a new shape, which probably doesn't hurt much.  But if it's not, and I don't see why |Proto| would be in your shell testcase, then we |setUncacheableProto|, which probably makes JITs go sadfaces.

So probably that loop should start with |obj->getProto()|.  And as far as |obj| is concerned, we probably should only futz with it if |obj->isDelegate()|.
Flags: needinfo?(jwalden+bmo)
Assignee: general → nobody
You need to log in before you can comment on or make changes to this bug.