580018 - Make our GC not make v8 splay.js slow

Reporter

Description

•

15 years ago

Our score on v8-splay is about 1/3 that of JSC (i.e., they run the basic test function in 1/3 the time). We spend about 1/3 of our time on that test in GC, so we cannot have a good score unless we make the GC better there. For the record, JSC also spends about 1/3 of their time in GC on that test, so they are about 3x as fast on both GC and the mutator for that test. gwagner collected some further numbers on this in bug 556133, pasted below. If I read them right, there are 2 GC cycles, one with a relatively long mark phase, and the other running a lot of object and string finalizers. splay: AppTime, Total, Mark, Sweep, FinObj, FinStr, Destroy, newChunks, destoyChunks 0.0, 157.7, 131.3, 26.3, 19.3, 6.9, 0.0, 54, 0 373.7, 508.4, 0.8, 507.6, 297.8, 209.7, 0.1, 11, 0 All numbers are cycles 10e6 except newChunks and destroyChunks represent absolute numbers. Total... Total GC time Mark... Total Mark time Sweep... Total Sweep time FinObj Time spent in Finalize Object FinStr Time spent in Finalize String Destroy Time spent in sweep Compartments, Expire GC chunks newChunks Chunks allocated since last GC destroyChunks Chunks destroyed since last GC

Gregor Wagner [:gwagner]

Comment 1

•

15 years ago

So for comparison I was looking at the webkit GC times for the v8-splay benchmark. Once again they have lazy finalization and run the finalizer when a new object at the same slot is allocated. The GC strategy is to reset all the mark bits and mark all live objects. For a new allocation they have to find the first unmarked object slot and run the finalizer for this slot. reset: 0.859354 reset: 1.464454 reset: 2.792359 reset: 7.718746 reset: 13.524234 reset: 28.450631 reset: 57.795398 reset: 115.305404 reset: 226.262655 destroy: 154.760660 All numbers are 1E6 cycles measured with rdtsc. reset .... reset the heap (remark all objects) destroy ... destroy the heap They have more reset cycles and the marking takes longer once the heap gets bigger. If we add all numbers we are not that much slower. I will double check if I got the right numbers but I can't see a 3x performance gap.

Andreas Gal :gal

Comment 2

•

15 years ago

Gregor and I have been playing with the idea of scanning the (last GC's) mark bitmap during allocation instead of assembling free lists. That gets more expensive as the heap fills up, but it avoids touching a ton of pages at the end of the GC phase.

Igor Bukanov

Comment 3

•

15 years ago

(In reply to comment #2) > Gregor and I have been playing with the idea of scanning the (last GC's) mark > bitmap during allocation instead of assembling free lists. That gets more > expensive as the heap fills up, but it avoids touching a ton of pages at the > end of the GC phase. xpconnect finalizers should runduring the GC. So to implement the above it would be necessary to separate ordinary objects from objects with external finalizers. On the other hand, if the win in js shell would be clear we can see if the finalizers can be delayed.

Andreas Gal :gal

Comment 4

•

15 years ago

Using bitmaps does not preclude finalization. Its completely orthogonal. Also, finalizers do not have to run during GC, neither builtin objects, nor DOM/XPConnect objects. Some finalizers have to run on the main thread, but thats about it.

Igor Bukanov

Comment 5

•

15 years ago

gal - do you still have that marking inlining patch around? It showed a nice win the last time. As for finalization I think we should try to move at least string finalization to a separated thread.

Igor Bukanov

Comment 6

•

15 years ago

(In reply to comment #4) > Also, finalizers do not have to run during GC, neither builtin objects, nor > DOM/XPConnect objects. Some finalizers have to run on the main thread, but > thats about it. IIRC some finalizers skips locks as they assume that they are run during the GC and everything is serialized. Also JSGC_FINALIZE_END hook from XPCJSRuntime::GCCallback assumes that it runs after finalization, see, for examle, MarkAutoRootsAfterJSFinalize etc.

Brian Hackett [Laid off!]

Comment 7

•

15 years ago

(In reply to comment #0) > gwagner collected some further numbers on this in bug 556133, pasted below. If > I read them right, there are 2 GC cycles, one with a relatively long mark > phase, and the other running a lot of object and string finalizers. Isn't the second GC cycle running at process teardown, after the end time has already been collected? When I use Date() to compute the time taken on v8-splay.js, I get this behavior: get start time call SplaySetup - 1032ms --- the first GC cycle is in here, trying to collect pieces of the still-reachable giant splay tree. hence the long mark phase. call SplayRun - 10ms call SplayTeardown - 5ms get end time --- the second GC cycle is after JS finishes and the process is exiting. SplayTeardown made everything unreachable so mark is quick, but everything needs to be destroyed so there are many finalizers to call. Problem: timing v8-splay.js (using the one pulled from webkit/SunSpider/tests/v8-v4 I take it?) is not timing the same thing as the V8 benchmark website. To improve the former we need to make SplaySetup faster, to improve the latter we need to make SplayRun faster.

David Mandelin [:dmandelin]

Reporter

Comment 8

•

15 years ago

(In reply to comment #7) > (In reply to comment #0) > > gwagner collected some further numbers on this in bug 556133, pasted below. If > > I read them right, there are 2 GC cycles, one with a relatively long mark > > phase, and the other running a lot of object and string finalizers. > > Isn't the second GC cycle running at process teardown, after the end time has > already been collected? Looks like you're right. I based my comments on a straight Shark run, so there wasn't a super-easy way to isolate just the middle run part.

Blocks: WebJSPerf
No longer blocks: JaegerSpeed

Brian Hackett [Laid off!]

Comment 9

•

15 years ago

Attached patch add splay.js to js/src/v8 suite — Details — Splinter Review

This patch adds splay.js to the suite in js/src/v8, which stresses the inner run() method in the same way as the V8 benchmark website, as opposed to the sunspider harness method of timing the entire thing exactly once. I like this as it stresses the code in the same way the website does; this makes a big difference for splay.js, not so much the other benchmarks. For the website version, here is the time breakdown for the inner run() loop on my laptop: SlowCall: 3.01 ms (12803 count) NewString: 3.33 ms (211200 count) Relational.>: 4.24 ms (147983 count) Relational.<: 5.71 ms (197472 count) NewArray: 12.57 ms (204800 count) NewDenseArrayObject: 12.74 ms (204800 count) String: 16.48 ms (6400 count) SetName: 25.74 ms (263354 count) InitPropOrMethod: 34.84 ms (806400 count) ensureDenseArrayElements: 36.86 ms (204800 count) NewFinalizableGCThing: 53.02 ms (857680 count) ConcatN: 66.93 ms (204800 count) NewInitObject: 76.5 ms (403200 count) GC: 335.3 ms (2 count)

Brian Hackett [Laid off!]

Comment 10

•

15 years ago

Looking at splay.js some more, this benchmark is pretty messed up. // This benchmark is based on a JavaScript log processing module used // by the V8 profiler to generate execution time profiles for runs of // JavaScript applications, and it effectively measures how fast the // JavaScript engine is at allocating nodes and reclaiming the memory // used for old nodes. Because of the way splay trees work, the engine // also has to deal with a lot of changes to the large tree object // graph. ... function SplayRun() { // Replace a few nodes in the splay tree. for (var i = 0; i < kSplayTreeModifications; i++) { var key = InsertNewNode(); var greatest = splayTree.findGreatestLessThan(key); if (greatest == null) splayTree.remove(key); else splayTree.remove(greatest.key); } } After reading these, I thought the idea of this benchmark was to insert and remove nodes from the splay tree more or less at random. What it actually does is always remove the node it just added. If I change this line in findGreatestLessThan: return this.root_; to: return this.root_.left; V8's score on splay goes from 11490 to 2292, and its overall score goes from 7333 to 5598.

Brian Hackett [Laid off!]

Comment 11

•

15 years ago

Forgot to say: score for JSC or JM (anything using a mark/sweep collector) is unaffected by this change.

Brendan Eich [:brendan]

Comment 12

•

15 years ago

Nick has had run-ins with splay too, even filed a chromium issue on its bogosity. /be

Nicholas Nethercote [inactive]

Comment 13

•

15 years ago

(In reply to comment #12) > Nick has had run-ins with splay too, even filed a chromium issue on its > bogosity. Yeah, bug 562553, which links to http://code.google.com/p/v8/issues/detail?id=690 It's a totally synthetic benchmarks. I'm sick of synthetic benchmarks. When is the SPEC corporation gonna come out with a JS suite? SPEC ain't perfect but it's made the world of C and Fortran benchmarking a lot better. Hennessy and Patterson would be appalled by SunSpider and V8.

Brendan Eich [:brendan]

Comment 14

•

15 years ago

Nick, good news is Andreas and I visited MSR today, working with Ben Livshits, Ben Zorn, and Jan Vitek's student Gregor Richards (interning this summer with the Bens) on JSBench, successor to JSMeter, based on real web workload measurements. http://research.microsoft.com/en-us/projects/jsmeter/ More when there's more to say. /be

Gregor Wagner [:gwagner]

Comment 15

•

14 years ago

Our splay score in the browser is now: 6120 With 150 other tabs open: 5764 Chrome canary with a single tab gets: 3251 I guess we can close this bug?

David Mandelin [:dmandelin]

Reporter

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Nicholas Nethercote [inactive]

Comment 16

•

14 years ago

It's also interesting to see that AWFY says that on July 15 V8's perf on splay got a lot worse. It was getting ~3900 consistently, now it's bimodal, getting either either ~2800 or ~3300.

Gregor Wagner [:gwagner]

Comment 17

•

14 years ago

(In reply to comment #16) > It's also interesting to see that AWFY says that on July 15 V8's perf on > splay got a lot worse. It was getting ~3900 consistently, now it's bimodal, > getting either either ~2800 or ~3300. Bill was this your instrumentation patch on 7/15?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 18

•

14 years ago

No, I think Nick was talking about V8 (as in the JS engine, not the benchmark).

Gregor Wagner [:gwagner]

Comment 19

•

14 years ago

(In reply to comment #18) > No, I think Nick was talking about V8 (as in the JS engine, not the > benchmark). Ahh. Too many overloaded terms!