Closed Bug 641047 Opened 13 years ago Closed 13 years ago

Analyze ai-astar performance


(Core :: JavaScript Engine, defect)

Not set





(Reporter: dmandelin, Assigned: dmandelin)




(1 file)

Bill says V8 is faster here, and it's probably because of cache behavior. This bug is to study performance on ai-astar and see what there is to learn about what slows us down and how it relates to cache behavior.
More specifically, V8 takes ~280ms and we take ~1070ms.  That accounts for over half of the lead that V8 has on us for all of Kraken.

The following is copied from

Key features. The benchmark spends almost all of its time in the loop in the following function.

Array.prototype.findGraphNode = function(obj) {
    for(var i=0;i<this.length;i++) {
        if(this[i].pos == obj.pos) { return this[i]; }
    return false;

Mozilla-specific things.  Removing unnecessary function guards gains ~1.15x (Bug 606892).

LICM also helps by another ~1.2x (Bug 545406). 

So just improving LIR with those two patches would give ~1.38x, taking us down to ~775ms.  The LICM patch as implemented reorders guards in a way that I'm not certain is valid, however.
Bill got some basic numbers for SpiderMonkey/V8/Crankshaft from linux 'perf'. They are copied here with some analysis (numbers are in millions):

               SM       V8      CS
    L1       1658     1660     644      L1 read requests
    L2        168       75      73      L2 read requests
    L3        155       50      46      L3 read/write requests

 Insts       3511     4800    1663      instructions retired
Cycles       3395     2536    1043      CPU cycles

1. V8 vs. Crankshaft

The difference seems to be entirely due to compiler optimizations. Reasoning: Crankshaft cuts down instructions to 35% of V8. Similarly, L1 loads, which we expect to see for local variables, temporaries, and such, are cut to 40% of V8. Those are the kinds of things we expect compiler opts to cut. L2 and L3 are unchanged; those are the ones we expect to see for object graph traversal.

2. SM vs. V8

The difference is likely due to cache misses. That's just because that's the only thing that's worse in SM: L1 loads are the same, and instructions are much better. With our lower instruction count we'd expect to be 600M cycles better than V8 or so, but we are instead 800 worse (1400M cycles slower that are unexplained). An L3 hit costs 30-35 cycles more than an L1 hit, so 100M extra L3 hits can explain up to 3000M cycles of slowdown, more than enough to explain the difference. I also found in a simple test of iterating over arrays that extra L3 hits seem to cost more like 15M cycles each in practice (presumably due to overlap), which is very consistent with the slowdown seen here.

Now, in the benchmark, we have 47M iterations of the loop in comment 1. Note that in V8/CS, the number of L3 hits is about equal to loopIters, and in SM it's about 3 * loopIters. This suggests that V8 takes 1 L3 hit per iteration and we take 3. Clearly, random access stuff is expected to generate at least 1 L3 hit per in order to read this[i].pos. For SM, to read this[i].pos, we need to read the this[i]'s:

  objShape       at offset 12 (on 32-bit)
  slots          at offset 36 (on 32-bit)
  the slot value in heap

That should be 3 cache misses.

Putting objShape and slots next to each other starting on a 16-byte boundary should get those two into the same cache line, which should get rid of at least one cache miss. 

I'm not sure how to get rid of the other extra miss. Getting the object slots into the object might do it, if the prefetcher helps us or something. Currently we start with 4 slots inline, and this object has 5 slots. So sizing objects according to the constructor might do it. Failing that, generational GC that allocates the slot array near the object header might help too.
I like the idea of trying to group together objShape and slots. Also, we could give the patch in bug 547327 a try here. I don't remember how it does on ai-astar.
TIL cache lines on Core/Nehalem are 64 bytes. Our tests were on 64-bit, where it looks like objShape is at offset 20 and slots is at offset 60. So they could be on the same cache line if aligned properly. OTOH, if the object is 64-byte-aligned, then slots would split a cache line. That used to be really bad; I'm not sure if it still is, but it would certainly mean an extra miss (beyond the one required to read objShape).

In order to understand this fully I'll need to check on our allocator behavior and see what alignments we get. I also intend to make a C++ model of this code so that we can study this in more detail.

Bug 547327 sounds like a good thing to try.
Attached patch Model programSplinter Review
The attached C++ program models the IC'd/traced version of the hot loop by accessing random objects in a set of 10,000 (same as ai-astar.js), 50M times (vs 47M in ai-astar.js). I tested the effect of various layout changes on a 64-bit machine:

 L3 hits (M)     configuration

 137             Orig: original model
 106             MoveShape: move objShape to just before 'slots'
 100             MoveShape+Inline: make slot array inline (7 slots)
  50             MoveShape+Inline+Pad: pad to get objShape/slots at/after offset 64
  50             MoveShape+Inline+Tiny: make obj header 16 bytes long

Here are the object layouts, giving just the offsets of the 3 fields that are read, in the order they are read.

               Orig    MoveShape    +Inline       +Pad      +Tiny

  objShape       20           60         60         64          0
  slot           64           64         64         72          8
  slots[3]     heap         heap         96        108         32

  sizeof(obj)   104          104        128        128         64

Cache lines are 64 bytes. Object size is 104 in Orig.

Detailed analysis (recommendations in following comment):

0. Best case. objShape is always a miss, for 50M misses minimum.

1. Orig. slots[3] is in the heap, so that's always a miss.

slots may or may not be a miss. Object size is 104, and gcd(104, 64) is 8, so the objects in the array start at 8 different positions wrt to the start of a cache line. if objShape is at 4 or 12, then slot is at 48 or 56 in the same cache line; otherwise it is in a different cache line. That predicts a miss 3/4 of the time, so 37.5M L1 misses. 

This gives a total of 137.5M predicted L3 hits, which agrees exactly with observation (within the precision of our measurments).

2. MoveShape. The idea here is to move shape close to slots so they are likely to be in the same cache line.

We miss slots[3] always, just as in Orig. 

But now shape and slot are right next to each other, so we miss on slots only if shape is at 60 and slots is at 64. That predicts a miss 1/8 of the time, so 6.25M misses here.

Total prediction is 106.25M L3 hits, again agreeing exactly with observation.

3. MoveShape+Inline. Here we grow the inline slot array so that the object size is a multiple of the cache line size.

For this, I increased in the inline slot array to 7 slots, giving a total object size of 128 and allowing these objects (which have 5 properties) to be represented using the inline array.

Because object size is exactly 2 cache lines, the layout wrt to cache lines is the same for every object. It takes one cache line to get objShape, at 60, and another to get both slots and array[3], at 64 and 96, so 2 cache lines per iter, so 100M L3 hits.

4. MoveShape+Inline+Pad. In MoveShape+Inline, objShape is just barely in a different cache line, so here we pad the beginning to move it to the second line, then take away a fixed slot so we stay at 128 bytes size.

Now the 3 addresses we want to read are in the same cache line (the second of the object), so we need only 1 L3 hit per iteration.

5. MoveShape+Inline+Tiny. Another way to get everything in the same cache line is to make the object only 1 cache line. I did this by assuming we could boil the object down to (map,objShape,slots ptr, 5 slots inline), for a total size of 64. This of course also gets us to 1 L3/iteration.

A. Long-term

A.64. 64-byte cache lines, 64-bit addresses (x64)

+ We should try to make a vanilla object fit exactly in one cache line, so that we get 1 L1/L2 miss per object access.

With 64-byte cache lines and 64-bit pointers, this would mean hopefully 2-5 fixed values and then 3-6 inline slots. So that's about as many inline slots as we get now, with much better cache behavior, as well as smaller objects.

Flexible object sizes don't seem to buy us anything here for cache behavior, but they still help us if the allocator is slow. And 32-byte objects would save memory, so doing that would also be nice.

A.32. 64-byte cache lines, 32-bit addresses (x86)

+ We should try to make one or two vanilla objects fit exactly in one cache line, so that we get 1 L1/L2 miss per object access.

With 32-bit pointers, we could do 2-5 fixed values and then 11-14 inline slots. But that might be overkill on the inline slots: maybe there it is better to do 32-bit objects to save memory for no penalty with caching. We might even get a few more L2 hits for some workloads.

+ Flexible object sizes could help with cache behavior here: if the vanilla object is 32-bytes, making 64-byte objects when we have more properties avoids a cache miss. The benefits for slow allocators also still apply.

B. Short-term

+ We should almost certainly move objShape next to slots.

B.64. 64-byte cache lines, 64-bit addresses (x64)

+ We should consider increasing the slot array so that the object size is 128 bytes. We need to check for performance regressions and excessive memory bloat.

+ We should consider padding the first half of the object to move objShape to the next cache line. We would need to check regressions/memory.

B.32. 64-byte cache lines, 32-bit addresses (x86)

+ We should consider having 3 inline slots, so that objects are 64 bytes. We lose an inline slot doing that, though, which might hurt us more than fixing object size helps.

+ Failing that, squeezing out two fields would help. Maybe 2 of proto, parent, emptyShapes? I'm not sure what's easiest.
Andreas has a patch to remove parent in bug 638316.

proto could go in map, somehow. A dependent load to get to it but could be a don't-care these days due to all the caching of lookups with fast shape guards. More in bug 637933.

Great analysis.

More analysis of the compiler optimizations relevant to astar in bug 649693, (which are in place in JM+TI now).
Depends on: 649693
According to AWFY, we have these times for ai-astar:

- Chrome: 247ms
- JM+TI:  173ms

I declare victory!
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.