Last Comment Bug 580018 - Make our GC not make v8 splay.js slow
: Make our GC not make v8 splay.js slow
Status: RESOLVED WORKSFORME
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: ---
Assigned To: general
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on:
Blocks: WebJSPerf
  Show dependency treegraph
 
Reported: 2010-07-19 13:38 PDT by David Mandelin [:dmandelin]
Modified: 2014-04-25 15:15 PDT (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
add splay.js to js/src/v8 suite (13.12 KB, patch)
2010-07-26 13:42 PDT, Brian Hackett (:bhackett)
no flags Details | Diff | Splinter Review

Description David Mandelin [:dmandelin] 2010-07-19 13:38:54 PDT
Our score on v8-splay is about 1/3 that of JSC (i.e., they run the basic test function in 1/3 the time). We spend about 1/3 of our time on that test in GC, so we cannot have a good score unless we make the GC better there. For the record, JSC also spends about 1/3 of their time in GC on that test, so they are about 3x as fast on both GC and the mutator for that test.

gwagner collected some further numbers on this in bug 556133, pasted below. If I read them right, there are 2 GC cycles, one with a relatively long mark phase, and the other running a lot of object and string finalizers.

splay:
AppTime,  Total,   Mark,  Sweep, FinObj, FinStr,  Destroy,  newChunks, destoyChunks
     0.0,  157.7,  131.3,   26.3,   19.3,    6.9,      0.0,         54,          0 
   373.7,  508.4,    0.8,  507.6,  297.8,  209.7,      0.1,         11,          0

All numbers are cycles 10e6 except newChunks and destroyChunks represent absolute numbers.
Total... 	Total GC time
Mark...  	Total Mark time
Sweep... 	Total Sweep time
FinObj		Time spent in Finalize Object
FinStr		Time spent in Finalize String
Destroy		Time spent in sweep Compartments, Expire GC chunks
newChunks	Chunks allocated since last GC
destroyChunks	Chunks destroyed since last GC
Comment 1 Gregor Wagner [:gwagner] 2010-07-19 15:07:27 PDT
So for comparison I was looking at the webkit GC times for the v8-splay benchmark.
Once again they have lazy finalization and run the finalizer when a new object at the same slot is allocated. 
The GC strategy is to reset all the mark bits and mark all live objects. 
For a new allocation they have to find the first unmarked object slot and run the finalizer for this slot.

reset: 0.859354
reset: 1.464454
reset: 2.792359
reset: 7.718746
reset: 13.524234
reset: 28.450631
reset: 57.795398
reset: 115.305404
reset: 226.262655
destroy: 154.760660

All numbers are 1E6 cycles measured with rdtsc. 
reset ....  reset the heap (remark all objects)
destroy ... destroy the heap

They have more reset cycles and the marking takes longer once the heap gets bigger. If we add all numbers we are not that much slower. I will double check if I got the right numbers but I can't see a 3x performance gap.
Comment 2 Andreas Gal :gal 2010-07-19 15:15:22 PDT
Gregor and I have been playing with the idea of scanning the (last GC's) mark bitmap during allocation instead of assembling free lists. That gets more expensive as the heap fills up, but it avoids touching a ton of pages at the end of the GC phase.
Comment 3 Igor Bukanov 2010-07-20 01:49:00 PDT
(In reply to comment #2)
> Gregor and I have been playing with the idea of scanning the (last GC's) mark
> bitmap during allocation instead of assembling free lists. That gets more
> expensive as the heap fills up, but it avoids touching a ton of pages at the
> end of the GC phase.

xpconnect finalizers should runduring the GC. So to implement the above it would be necessary to separate ordinary objects from objects with external finalizers. On the other hand, if the win in js shell would be clear we can see if the finalizers can be delayed.
Comment 4 Andreas Gal :gal 2010-07-20 01:59:29 PDT
Using bitmaps does not preclude finalization. Its completely orthogonal.

Also, finalizers do not have to run during GC, neither builtin objects, nor DOM/XPConnect objects. Some finalizers have to run on the main thread, but thats about it.
Comment 5 Igor Bukanov 2010-07-20 03:06:41 PDT
gal - do you still have that marking inlining patch around? It showed a nice win the last time.

As for finalization I think we should try to move at least string finalization to a separated thread.
Comment 6 Igor Bukanov 2010-07-20 03:19:32 PDT
(In reply to comment #4)
> Also, finalizers do not have to run during GC, neither builtin objects, nor
> DOM/XPConnect objects. Some finalizers have to run on the main thread, but
> thats about it.

IIRC some finalizers skips locks as they assume that they are run during the GC and everything is serialized. Also JSGC_FINALIZE_END hook from XPCJSRuntime::GCCallback assumes that it runs after finalization, see, for examle, MarkAutoRootsAfterJSFinalize etc.
Comment 7 Brian Hackett (:bhackett) 2010-07-26 08:50:33 PDT
(In reply to comment #0)
> gwagner collected some further numbers on this in bug 556133, pasted below. If
> I read them right, there are 2 GC cycles, one with a relatively long mark
> phase, and the other running a lot of object and string finalizers.

Isn't the second GC cycle running at process teardown, after the end time has already been collected?  When I use Date() to compute the time taken on v8-splay.js, I get this behavior:

get start time
call SplaySetup - 1032ms
--- the first GC cycle is in here, trying to collect pieces of the still-reachable giant splay tree. hence the long mark phase.
call SplayRun - 10ms
call SplayTeardown - 5ms
get end time
--- the second GC cycle is after JS finishes and the process is exiting.  SplayTeardown made everything unreachable so mark is quick, but everything needs to be destroyed so there are many finalizers to call.

Problem: timing v8-splay.js (using the one pulled from webkit/SunSpider/tests/v8-v4 I take it?) is not timing the same thing as the V8 benchmark website.  To improve the former we need to make SplaySetup faster, to improve the latter we need to make SplayRun faster.
Comment 8 David Mandelin [:dmandelin] 2010-07-26 11:32:07 PDT
(In reply to comment #7)
> (In reply to comment #0)
> > gwagner collected some further numbers on this in bug 556133, pasted below. If
> > I read them right, there are 2 GC cycles, one with a relatively long mark
> > phase, and the other running a lot of object and string finalizers.
> 
> Isn't the second GC cycle running at process teardown, after the end time has
> already been collected?

Looks like you're right. I based my comments on a straight Shark run, so there wasn't a super-easy way to isolate just the middle run part.
Comment 9 Brian Hackett (:bhackett) 2010-07-26 13:42:30 PDT
Created attachment 460324 [details] [diff] [review]
add splay.js to js/src/v8 suite

This patch adds splay.js to the suite in js/src/v8, which stresses the inner run() method in the same way as the V8 benchmark website, as opposed to the sunspider harness method of timing the entire thing exactly once.  I like this as it stresses the code in the same way the website does; this makes a big difference for splay.js, not so much the other benchmarks.

For the website version, here is the time breakdown for the inner run() loop on my laptop:

SlowCall: 3.01 ms (12803 count)
NewString: 3.33 ms (211200 count)
Relational.>: 4.24 ms (147983 count)
Relational.<: 5.71 ms (197472 count)
NewArray: 12.57 ms (204800 count)
NewDenseArrayObject: 12.74 ms (204800 count)
String: 16.48 ms (6400 count)
SetName: 25.74 ms (263354 count)
InitPropOrMethod: 34.84 ms (806400 count)
ensureDenseArrayElements: 36.86 ms (204800 count)
NewFinalizableGCThing: 53.02 ms (857680 count)
ConcatN: 66.93 ms (204800 count)
NewInitObject: 76.5 ms (403200 count)
GC: 335.3 ms (2 count)
Comment 10 Brian Hackett (:bhackett) 2010-07-27 15:29:18 PDT
Looking at splay.js some more, this benchmark is pretty messed up.

// This benchmark is based on a JavaScript log processing module used
// by the V8 profiler to generate execution time profiles for runs of
// JavaScript applications, and it effectively measures how fast the
// JavaScript engine is at allocating nodes and reclaiming the memory
// used for old nodes. Because of the way splay trees work, the engine
// also has to deal with a lot of changes to the large tree object
// graph.

...

function SplayRun() {
  // Replace a few nodes in the splay tree.
  for (var i = 0; i < kSplayTreeModifications; i++) {
    var key = InsertNewNode();
    var greatest = splayTree.findGreatestLessThan(key);
    if (greatest == null) splayTree.remove(key);
    else splayTree.remove(greatest.key);
  }
}

After reading these, I thought the idea of this benchmark was to insert and remove nodes from the splay tree more or less at random.  What it actually does is always remove the node it just added.  If I change this line in findGreatestLessThan:

    return this.root_;

to:

    return this.root_.left;

V8's score on splay goes from 11490 to 2292, and its overall score goes from 7333 to 5598.
Comment 11 Brian Hackett (:bhackett) 2010-07-27 15:37:04 PDT
Forgot to say: score for JSC or JM (anything using a mark/sweep collector) is unaffected by this change.
Comment 12 Brendan Eich [:brendan] 2010-07-27 18:40:28 PDT
Nick has had run-ins with splay too, even filed a chromium issue on its bogosity.

/be
Comment 13 Nicholas Nethercote [:njn] 2010-07-27 18:44:32 PDT
(In reply to comment #12)
> Nick has had run-ins with splay too, even filed a chromium issue on its
> bogosity.

Yeah, bug 562553, which links to http://code.google.com/p/v8/issues/detail?id=690

It's a totally synthetic benchmarks.  I'm sick of synthetic benchmarks.  When is the SPEC corporation gonna come out with a JS suite?  SPEC ain't perfect but it's made the world of C and Fortran benchmarking a lot better.  Hennessy and Patterson would be appalled by SunSpider and V8.
Comment 14 Brendan Eich [:brendan] 2010-07-27 22:23:57 PDT
Nick, good news is Andreas and I visited MSR today, working with Ben Livshits, Ben Zorn, and Jan Vitek's student Gregor Richards (interning this summer with the Bens) on JSBench, successor to JSMeter, based on real web workload measurements.

http://research.microsoft.com/en-us/projects/jsmeter/

More when there's more to say.

/be
Comment 15 Gregor Wagner [:gwagner] 2011-07-27 12:59:25 PDT
Our splay score in the browser is now: 6120
With 150 other tabs open: 5764

Chrome canary with a single tab gets: 3251

I guess we can close this bug?
Comment 16 Nicholas Nethercote [:njn] 2011-07-27 17:29:34 PDT
It's also interesting to see that AWFY says that on July 15 V8's perf on splay got a lot worse.  It was getting ~3900 consistently, now it's bimodal, getting either either ~2800 or ~3300.
Comment 17 Gregor Wagner [:gwagner] 2011-07-28 12:44:02 PDT
(In reply to comment #16)
> It's also interesting to see that AWFY says that on July 15 V8's perf on
> splay got a lot worse.  It was getting ~3900 consistently, now it's bimodal,
> getting either either ~2800 or ~3300.

Bill was this your instrumentation patch on 7/15?
Comment 18 Bill McCloskey (:billm) 2011-07-28 13:08:41 PDT
No, I think Nick was talking about V8 (as in the JS engine, not the benchmark).
Comment 19 Gregor Wagner [:gwagner] 2011-07-28 13:21:48 PDT
(In reply to comment #18)
> No, I think Nick was talking about V8 (as in the JS engine, not the
> benchmark).

Ahh. Too many overloaded terms!

Note You need to log in before you can comment on or make changes to this bug.