Analyze v8 earley-boyer performance

NEW
Assigned to

Status

()

8 years ago
5 years ago

People

(Reporter: dmandelin, Assigned: dmandelin)

Tracking

(Blocks: 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Assignee)

Description

8 years ago
V8/previous (i.e., no Crankshaft) was 4x our score; V8 with Crankshaft is 4.5x our score. The benchmark has two separate tests: Earley and Boyer. We are 5x slower on Earley and 3x slower on Boyer (vs. v8/prev). They combine with geometric mean to get the subscore, so that's how we get 4x. 

This is our worst score vs. v8/prev. We know Crankshaft has a bunch of new compiler optimizations that are coming via type inference, but we need to know what's going on in the ones where we are much slower even in v8/prev.
(Assignee)

Updated

8 years ago
Blocks: 642003
About 10-15% of both combined is in object creation (see bug 638015), but I'm not sure what the earley/boyer breakdown is.
(Assignee)

Comment 2

8 years ago
Created attachment 519992 [details]
JS profiler output
(Assignee)

Comment 3

8 years ago
Initial observations:

- Our high-level breakdown is 27% jitcode, 12% GC, 3% bytecompile/activate, and 58% stub calls, mostly arguments, getCallObject, callname, putActivation, createThis, and lambda. v8/prev is 5x faster, so to reach that speed we probably need to eliminate pretty much all the time spent in GC and stubs.

- 3 basic Scheme functions, sc_Pair, sc_list (which also calls sc_Pair), and sc_append, account for 36% of total time. Within that, arguments is 11% (of total, not the 36%), GC is 4%, and PutActivationObjects is 3%.

- 52% of total time is in 4 different local lambdas, which are moderately big.
(Assignee)

Comment 4

8 years ago
// sc_Pair
0.093433 0.811502   earley-boyer.js          539
                     0.059688 0.639   STUB_CREATETHIS            539
                     0.019956 0.214   RUN_MJITCODE               539
                     0.010885 0.117   STUB_UNKNOWN               539
                     0.002903 0.031   RUN_MJITCODE               540

Here is a microbenchmark that calls sc_Pair the same number of times as 100 iterations of 'earley':

    function sc_Pair(car, cdr) {
        this.car = car;
        this.cdr = cdr;
    }

    function f() {
        var p = new sc_Pair(0, null);

        for (var i = 0; i < 872245; ++i) {
            new sc_Pair(3, p);
            //p = new sc_Pair(4, p);
        }
    }

    f();

It can be run with short-lived or long-lived objects by commenting out either line inside the loop. In the short-lived version, v8/nocs is about 3x faster; with long-lived, v8/nocs is about 1.5x faster. That must be from the generational GC having to copy more. 

We GC during this test, but it doesn't show up in the profile. xperfview shows 10% of our time in a finalizer function, so that must be the "unknown stub call".

The JS profile shows most of our time here in stubs::CreateThis. According to xperf, that's divided among JSObject::init, js_CreateThisForFunctionWithProto, CreateThis, InitScopeForObject, and js_NewGCObject.

So, to do better on this test, we need:

1. generational GC with fast allocator (fixes 1/3 of our slowness)
2. radically simpler object creation path (fixes 2/3 of our slowness)

Eventually, we should also:

3. inline small constructors
4. transform prefixes of constructors that only set properties to something with no guards, that just creates an object with the right size slot array, fills in the slots, and sets the final shape.

Regarding #2, I hope we can shrink objects and then create them with a path something like this:

 allocate obj header from bump allocator
 store map pointer (baked in for ctor function)
 fill in flags fields etc with minimal work also
 allocate right-sized slot array from bump allocator if needed
(Assignee)

Comment 5

8 years ago
// sc_list
0.195755 0.195755   earley-boyer.js          788
                     0.079826 0.408   STUB_ARGUMENTS             789
                     0.041364 0.211   GC                         790
                     0.035015 0.179   RUN_MJITCODE               788
                     0.014877 0.076   IC_DISABLEDGETELEM         791
                     0.008890 0.045   STUB_PUTACTIVATIONOBJECTS   792
                     0.004717 0.024   STUB_UNKNOWN               788
                     0.003628 0.019   STUB_UNKNOWN               792
                     0.003266 0.017   RUN_MJITCODE               790
                     0.002358 0.012   STUB_UNKNOWN               789
                     0.001814 0.009   STUB_UNKNOWN               791

The code:

  787  function sc_list() {
  788      var res = null;
  789      var a = arguments;
  790      for (var i = a.length-1; i >= 0; i--)
  791          res = new sc_Pair(a[i], res);
  792      return res;
  793  }

Here is a microbenchmark version, on which v8/nocs is about 4.5x faster:

    function sc_Pair() {
    }

    function sc_list() {
        var res = null;
        var a = arguments;
        for (var i = a.length-1; i >= 0; i--)
            res = sc_Pair(a[i], res);
        return res;
    }

    var t0 = new Date;
    for (var i = 0; i < 1000000; ++i)
        sc_list(1, 2, 3);
    print(new Date - t0);

The profile shows that the slowdown is almost entirely caused by the implementation of |arguments|. I think there are many things we could do about that, but the big ones are:

5. Do escape analysis for arguments. If it doesn't escape (or, even more restrictively, if it is only used in .length and [expr] expressions), then transform the function so that no arguments object is created. If arguments is not mutated, we can go as far as replacing a.length with a load and a[i] with a guarded jsop_arg/undefined.

6. Specialize functions that use nonescaping/"nice" arguments. First, arguments.length becomes a constant, and the guards can be eliminated. We could also do unrolling, although trying it here by hand didn't help.

The big win is clearly from just not creating |arguments|.
(Assignee)

Comment 6

8 years ago
sc_append is the same story as sc_list: all arguments stuff.

Most of the time outside the 3 big Scheme functions is spent in 4 ugly closures. They are fairly similar, so I will just pick one, a closure 'named' |derive_trees|:

0.105225 0.624456   earley-boyer.js         4257
                     0.031386 0.298   STUB_GETCALLOBJECT        4257
                     0.025218 0.240   STUB_PUTACTIVATIONOBJECTS  4324
                     0.023403 0.222   RUN_MJITCODE              4257
                     0.015965 0.152   STUB_LAMBDA               4260
                     0.003810 0.036   STUB_UNKNOWN              4256
                     0.001633 0.016   STUB_UNKNOWN              4257
                     0.001089 0.010   STUB_UNKNOWN              4324
                     0.001089 0.010   STUB_UNKNOWN              4260
                     0.000907 0.009   RUN_MJITCODE              4260
                     0.000726 0.007   RUN_MJITCODE              4324

First off, we have stubs::GetCallObject and stubs::PutActivationObjects, which take up 55% of the time in this closure between them. The inner function does not escape, so a better escape analysis could avoid the activation objects entirely. Failing that, we would just have to make creating activation objects very fast.

Second, we have 22% of time in jitcode that is charged to the first line of the function. This might be some kind of prologue code. All of these closures have a lot of time charged to mjit on the first line, and some have few arguments, so it's probably not arguments handling. Maybe there is something else slow about our call setup.

  Another thing to note is that many of these functions have tail calls (sometimes recursive, sometimes not), so a transformation there might help us.

Finally, have 12% of time in stubs::Lambda. I think this is another case where we just want to be much smaller and faster. We should be able to boil down function creation to something like creating an object with mostly baked-in attributes and a code pointer.

One feature this function doesn't have that the others do is jsop_callname. That's a big part of our slowdown.

Here's a microbenchmark that has hits callname:

   function f() {
        var cl1 = function() {
            var cl2 = function(z, q) {
                if (z == 0)
                    return;
                cl2(z - 1, q);
            };
            cl2(1000, 0);
        }
        cl1();
    }

    var t0 = new Date;
    for (var i = 0; i < 1000; ++i)
        f();
    print(new Date - t0);

And here's what we need to do:

7. IC callname.
8. Radically simpler function creation path
9. More precise escape analysis for closures so we can eliminate more callobjs
10. Do something with tail recursion
11. Maybe do something with tail calls generally
12. Figure out why we seem to spend more time in prologue code.

For #12, it might be useful to augment the code map to further break down the contents of certain jsops for finer-grained reporting.
Nice analysis!

What does the patch in bug 638015 do for (1) and (2)? The object allocation path is actually pretty simple in the best case and we can specialize a lot more than the C++ helper - it could be much better, of course, but short-term we can get a head start. We have to take a slow path if the free list is empty, but that will go away with the new GC.

For (3), that sounds right, I think Bill saw big gains in bug 606631, but we should probably have a more generalized inlining mechanism so we just get the win naturally.

(5) is a great idea, and doesn't sound terribly hard.

(7) Should be easy.

(8) Do you mean cloning lambdas? Yeah - we should be able to completely inline that, especially in the case where no cloning is needed. (Aside, we also don't track in the FrameState whether an object is a funobj or not, so often we have needless clasp guards.)

For (9) and maybe (12), we could look at reviving bug 592965. Inlining GetCallObject sounds hard until we can get rid of the malloc() call, and at that point maybe allocating the call objects might not be so expensive.
(Assignee)

Comment 8

8 years ago
(In reply to comment #7)
> Nice analysis!
> 
> What does the patch in bug 638015 do for (1) and (2)? The object allocation
> path is actually pretty simple in the best case and we can specialize a lot
> more than the C++ helper - it could be much better, of course, but short-term
> we can get a head start. We have to take a slow path if the free list is empty,
> but that will go away with the new GC.

sc_Pair benchmark:

  js -a -m                    67
  js -a -m w/ bug 638015      52
  v8 --nocrankshaft            9
(Assignee)

Updated

8 years ago
Summary: Analyze earley-boyer performance → Analyze v8 earley-boyer performance

Comment 9

7 years ago
Update: a bunch of things have been inlined/optimized.  Fixing the number of iterations to the number that v8 executes (allowing us to compare time), I get:
 js -m -n           5504ms
 d8 --nocrankshaft  2184ms
 d8                 1523ms

Still focusing on --nocrankshaft, we are 2.5x slower.  Bill has some measurements that predict that reducing GC pause times and cache misses will only get us maybe half the way there, so we've been looking together at what will take us the rest of the way.  Here's some findings:

Manual rdtsc shows that 41% of our time is spent under three functions: stubs::Lambda, stubs::FunctionFramePrologue and stubs::FunctionFrameEpilogue.  In contrast, callgrind shows that V8 (crankshaft or no) spends <1% in C++.  This raises the question: do they just have fast inline paths or are they actually doing some smart closure optimizations?  Instrumenting JM and V8 (crankshaft and no) to count number of Call (Context in V8) and Function (Closure in V8) objects created shows almost the exact same number (11.9M Call objects; 5.9M Function objects).  Thus, it's not closure optimization that is holding us back.

Looking at FastNewClosureStub::Generate, we'll still be behind since our Call objects require relatively more work to create.  Obj-shrink will improve things.  However, we've also had a plan to "redo" scope chains to not just JSObjects and instead have a new GC thing specifically for scopes.  In static contexts (no eval, with, debugMode), creating a new static activation record should be really fast (just "outer" pointer + slots, IIUC).

Lastly, if you look at the "hot" sites in early/boyer, loop3 (1.7M Function objects created, 2.5M Call objects) doesn't need either a call or cloned function object and loop2 (4.2M Call objects created) doesn't need a Call object.  So that is 8.4M/17.8M = 47% of all call/function objects.  Determining this just requires local escape analysis to prove that loop2 only gets invoked at a single static distance from its enclosing function.
Created attachment 563544 [details]
earley-boyer modified to avoid closures

I was interested in how earley-boyer would perform if we could ignore the cost of all the closure stuff. I lifted the most commonly called inner functions to the top level. I passed all the upvars they had been using as arguments. This works because none of the important closures in earley-boyer actually escape. I've attached the modified benchmark.

Here are the running times. Unmodified is the normal benchmark and modified is the one attached here, without closures.

              SpiderMonkey          V8
unmodified           7.32s       2.99s
modified             4.36s       2.51s

So we gain a lot more than V8 does. However, we can't blame everything on closure performance.

These results are also a little optimistic for us. It's unlikely we'll be able to get away without ever creating closure objects. A generational GC helps a lot with these objects, because they die immediately. Here's how our are GC times are affected:

unmodified: 0.358s spent in GC
modified: 0.069s spent in GC

Interestingly, this suggests that 80% of the GC work in earley-boyer is to collect closure-related objects.

We execute 1.78x as many instructions as V8 does (and our running time is 1.75x of V8). We take 65x as many L3 misses. It would be great to get a better idea of why we're still slower. I'm guessing it's partly due to cache misses and partly due to worse code being generated.

Comment 11

7 years ago
Nice experiment!  The SM unmodified/modified times line up with the 41%-in-3-stubs figure in comment 9.  I confirmed in callgrind that the modified earley-boyer spends very little time is spent in stubs: 3.5% in StrictlyEqual, 3% in FixupArity.

What happens to SM vs. V8 relative L3 miss rate (and time) if you make the V8 engine's nursery as big as the SM heap (at the point where we decide to GC)?
(In reply to Luke Wagner [:luke] from comment #11)
> What happens to SM vs. V8 relative L3 miss rate (and time) if you make the
> V8 engine's nursery as big as the SM heap (at the point where we decide to
> GC)?

I'll try to do this, but I'm not sure it will work. Somehow we definitely need a way to simulate a standard mark-and-sweep collector in V8. Unfortunately, every time I make non-trivial changes to V8 it just crashes in generated code.

I was interested in how much of a price we pay for the remaining stubs. I modified the benchmark again to specialize sc_list and sc_append for the number of arguments passed in (it was always 1, 2, or 3). This also allowed me to eliminate uses of arguments. Here's how it looks with this change (on top of the closure thing):

              SpiderMonkey          V8
unmodified           7.32s       2.99s
no-closures          4.36s       2.51s
no-args/closures     3.90s       2.40s

This seems like an optimization we could pull off. Certainly IonMonkey should be able to do it, and maybe even JM (especially if we're already going to inline the function). It doesn't appear to be something V8 does, since they get faster. However, they seem to pay less of a cost for this sort of thing, so we ought to be able to at least match them.
I added some code to the methodjit to get rid of the strict equality stub calls. Doing object/object comparisons takes us down to 3.775s. Doing string/string comparisons (which is easy because both strings are atoms in this case) gets us down to 3.675s.
Oops, I made a pretty silly mistake. I was using a 64-bit SpiderMonkey build. Here are the revised numbers.

unmodified: 7.2s
no-closures: 3.8s
no-args/closures: 3.43s
no-args/closures, fast ===: 3.07s
Also, here's what happens if I split up earley and boyer (using the no-args/closures version):

           SM      V8
earley  1.27s   0.68s
boyer   1.86s   1.78s

So we do reasonably well on boyer and we get killed on earley. Interestingly, although both earley and boyer allocate a lot, boyer allocates 2x as many sc_Pair cells as earley. So there must be something else going on in earley that's making us slow.
Created attachment 563850 [details]
earley-boyer modified with manual closures

I've attached a revised version of the earlier modified earley-boyer. The previous one passed upvars in as additional arguments. This one behaves more realistically. It creates a "closure" object (just an object literal) that holds all the upvars. All local closed-over variables are accessed through the closure. The closure object is passed as an argument to the function being invoked so that it can access its upvars.

I think this mirrors pretty closely how we've discussed implementing call objects in the future. In fact, it's a little pessimistic. Since call objects don't have to be JSObjects, we could avoid filling in all the garbagey fields like proto, lastProp, etc. One thing I'm unsure about is function cloning. I still don't understand the details of when and why this needs to happen, so maybe I've missed something.

This code also contains the modifications I made so that sc_list and sc_append don't need to access |arguments|. I also made a test that includes *just* the sc_list and sc_append changes without the closure stuff. Here are the times for these:

                                     SM      V8
modified-sc_list                  6.43s   2.99s
modified-sc_list-and-closures     3.43s   3.10s

V8 is actually faster without the closure modifications. This isn't too surprising, I think. It makes sense that native closures done right would be faster than manually implemented ones.

In summary, if we can implement closures efficiently *and* we fix the arguments issue, then we could probably be within 10% of V8 on this benchmark. Hopefully the remaining 10% could be made up by objshrink.



Separately, I was interested in how much work we would have to do regarding arguments. So I made a version of the benchmark with only the closure modifications.

                                     SM      V8
baseline                          6.84s   3.00s
modified-sc_list                  6.43s   2.99s
modified-closures                 3.86s   3.21s
modified-sc_list-and-closures     3.43s   3.10s

So the sc_list changes get us about 0.4s and they get V8 about ~0.1s (or less depending on which pair you look at). I tried breaking down the 0.4s a little more. It seems like about half of it is from FixupArgs and the other half is from accessing the arguments object.

Comment 17

7 years ago
Comment on attachment 563850 [details]
earley-boyer modified with manual closures

>/************* GENERATED FILE - DO NOT EDIT *************/

yeah whateva

Comment 18

7 years ago
So, IIUC, your modified earley-boyer creates a commensurate number of call objects (via object literals instead of call objects) but it does not create the 5.9M function objects (I see it only creates ~100K).  To answer your question: all function cloning really does is remember the enclosing call object (via obj->parent) so that, when the function object is called, that call object goes on the scope chain.  In theory, if it wasn't for the fact that functions are real objects that can have properties, a function value would just be a pair (JSObject *closure, JSFunction *code).

So, since your benchmark creates a ton of objects, do you have any estimate on what generational GC would do to our time?
Crap. I'm not sure how I screwed this up, but it looks like all the V8 numbers are wrong. I recompiled V8 to add some GC instrumentation and it got a lot faster. Somehow my build must have been screwed up.

I guess I'll get back to this later. It looks like we have a lot further to go, though.
It turns out my V8 numbers were wrong because I accidentally used a version of V8 where the young generation was fixed to be 512K. Normally it starts at 512K and grows to 8M based on a heuristic.

However, this mistake spurred me to look at V8 running time versus nursery size. Here is the data. The final row shows the results for SpiderMonkey/TI, just for comparison.

Nursery InstrCnt   L1miss    L2miss    L3miss    GCTime   TotalTime
  512   17.026B  175.294M   16.326M    0.195M    1.061s    3.023s
 1024   16.216B  163.877M   12.454M    0.379M    0.884s    2.848s
 2048   14.275B  130.160M    9.041M    0.485M    0.507s    2.449s
 4096   12.974B  114.727M    8.314M    1.190M    0.264s    2.222s
 8192   12.126B   92.158M    5.083M    1.736M    0.094s    2.092s
16384   12.057B   92.218M    5.187M    2.039M    0.074s    2.009s
32768   11.895B   86.108M    4.372M    2.414M    0.029s    1.976s
49152   11.842B   79.849M    4.690M    2.897M    0.016s    2.018s
65536   11.812B   77.747M    4.960M    3.338M    0.007s    1.963s
*SM*    35.396B  767.335M   25.097M   20.107M    0.296s    7.222s

The nursery size is given in kilobytes. The most interesting thing is that the larger you make the nursery, the faster V8 gets (see last column). Using a 64MB nursery doesn't seem like it should be all that different from doing mark and sweep collection. However, it's interesting to notice that they spend a miniscule amount of time in GC with a 64MB nursery. I wonder why that is, and if we could do better? I'll try to understand how they possible GC so little.

The other interesting phenomenon is that L1 and L2 misses go *down* as the nursery gets bigger. I guess this isn't too surprising, since 512K is already larger than my L2 cache (256K). I'm kinda guessing that they take the extra misses during the GC, and so as they spend less time in the GC, they take fewer misses. However, it would be hard to confirm this.

L3 misses go up as the nursery gets bigger, which kinda makes sense. However, they don't seem to be affecting performance at all. If they were on the critical path, they would incur a pretty significant cost (about 1.11s, using my cache latencies). Since they don't, I guess they must not be on the critical path.

It would be interesting to try to separate out these results based on whether mutator or GC code is running. I'm guessing there's a way to do this with a profiler and some scripting.
(Assignee)

Comment 21

7 years ago
Great experiment. 

Isn't the decrease in GC time with larger nursery just the usual thing where the GC time for a copying collector goes to zero as you increase the semispace size? At first I was quite surprised that they spend much more time in GC with a small nursery. But I guess that would be because they end up copying "everything", taking extra work? It's mildly fascinating that the GC time is about the same with the size they actually use.

I compute a different estimate for the effect of L3 misses, or more to the point, main memory accesses. From 512K to 64MB, they take 3.143M more L3 misses. Chris said you had recently found main memory latency to be 150 cycles. L3 hit latency is 30 cycles, so that's 120 cycles additional latency. I don't know your CPU speed but assuming some kind of proportionality I get 3.143e6 * 120 / 2.933e9 = 0.129 seconds "total latency difference"

For the L3 hits, I compute the "total latency difference" as ((16.326e6-.195e6)-(4.960e6-3.338e6)) * 18 / 2.933e9 = 0.090 s, which mostly cancels the main memory total latency difference. The L2 hits come out as ((175.294e6-16.326e6)-(77.747e6-4.960e6)) * 8 / 2.933e9 = 0.234 s. So if the latencies just add up, I would expect the large nursery case to be even faster than it is. 

But it looks like the changes in memory hits actually don't matter much, because the IPC hardly changes: from 1.92 at the top to 2.05 at the bottom. That does suggest that cache latencies are overall blocking things less, which does make sense from the numbers above.

Interestingly, SM's IPC is only 1.67, which suggests all of our extra cache misses are slowing us down, maybe 0-20%. I would tend to think that's mostly from object sizes. Is there a way we can tell if it's from object sizes or from our GC architecture?
You need to log in before you can comment on or make changes to this bug.