Our score on v8-splay is about 1/3 that of JSC (i.e., they run the basic test function in 1/3 the time). We spend about 1/3 of our time on that test in GC, so we cannot have a good score unless we make the GC better there. For the record, JSC also spends about 1/3 of their time in GC on that test, so they are about 3x as fast on both GC and the mutator for that test. gwagner collected some further numbers on this in bug 556133, pasted below. If I read them right, there are 2 GC cycles, one with a relatively long mark phase, and the other running a lot of object and string finalizers. splay: AppTime, Total, Mark, Sweep, FinObj, FinStr, Destroy, newChunks, destoyChunks 0.0, 157.7, 131.3, 26.3, 19.3, 6.9, 0.0, 54, 0 373.7, 508.4, 0.8, 507.6, 297.8, 209.7, 0.1, 11, 0 All numbers are cycles 10e6 except newChunks and destroyChunks represent absolute numbers. Total... Total GC time Mark... Total Mark time Sweep... Total Sweep time FinObj Time spent in Finalize Object FinStr Time spent in Finalize String Destroy Time spent in sweep Compartments, Expire GC chunks newChunks Chunks allocated since last GC destroyChunks Chunks destroyed since last GC
So for comparison I was looking at the webkit GC times for the v8-splay benchmark. Once again they have lazy finalization and run the finalizer when a new object at the same slot is allocated. The GC strategy is to reset all the mark bits and mark all live objects. For a new allocation they have to find the first unmarked object slot and run the finalizer for this slot. reset: 0.859354 reset: 1.464454 reset: 2.792359 reset: 7.718746 reset: 13.524234 reset: 28.450631 reset: 57.795398 reset: 115.305404 reset: 226.262655 destroy: 154.760660 All numbers are 1E6 cycles measured with rdtsc. reset .... reset the heap (remark all objects) destroy ... destroy the heap They have more reset cycles and the marking takes longer once the heap gets bigger. If we add all numbers we are not that much slower. I will double check if I got the right numbers but I can't see a 3x performance gap.
Gregor and I have been playing with the idea of scanning the (last GC's) mark bitmap during allocation instead of assembling free lists. That gets more expensive as the heap fills up, but it avoids touching a ton of pages at the end of the GC phase.
(In reply to comment #2) > Gregor and I have been playing with the idea of scanning the (last GC's) mark > bitmap during allocation instead of assembling free lists. That gets more > expensive as the heap fills up, but it avoids touching a ton of pages at the > end of the GC phase. xpconnect finalizers should runduring the GC. So to implement the above it would be necessary to separate ordinary objects from objects with external finalizers. On the other hand, if the win in js shell would be clear we can see if the finalizers can be delayed.
Using bitmaps does not preclude finalization. Its completely orthogonal. Also, finalizers do not have to run during GC, neither builtin objects, nor DOM/XPConnect objects. Some finalizers have to run on the main thread, but thats about it.
gal - do you still have that marking inlining patch around? It showed a nice win the last time. As for finalization I think we should try to move at least string finalization to a separated thread.
(In reply to comment #4) > Also, finalizers do not have to run during GC, neither builtin objects, nor > DOM/XPConnect objects. Some finalizers have to run on the main thread, but > thats about it. IIRC some finalizers skips locks as they assume that they are run during the GC and everything is serialized. Also JSGC_FINALIZE_END hook from XPCJSRuntime::GCCallback assumes that it runs after finalization, see, for examle, MarkAutoRootsAfterJSFinalize etc.
(In reply to comment #0) > gwagner collected some further numbers on this in bug 556133, pasted below. If > I read them right, there are 2 GC cycles, one with a relatively long mark > phase, and the other running a lot of object and string finalizers. Isn't the second GC cycle running at process teardown, after the end time has already been collected? When I use Date() to compute the time taken on v8-splay.js, I get this behavior: get start time call SplaySetup - 1032ms --- the first GC cycle is in here, trying to collect pieces of the still-reachable giant splay tree. hence the long mark phase. call SplayRun - 10ms call SplayTeardown - 5ms get end time --- the second GC cycle is after JS finishes and the process is exiting. SplayTeardown made everything unreachable so mark is quick, but everything needs to be destroyed so there are many finalizers to call. Problem: timing v8-splay.js (using the one pulled from webkit/SunSpider/tests/v8-v4 I take it?) is not timing the same thing as the V8 benchmark website. To improve the former we need to make SplaySetup faster, to improve the latter we need to make SplayRun faster.
(In reply to comment #7) > (In reply to comment #0) > > gwagner collected some further numbers on this in bug 556133, pasted below. If > > I read them right, there are 2 GC cycles, one with a relatively long mark > > phase, and the other running a lot of object and string finalizers. > > Isn't the second GC cycle running at process teardown, after the end time has > already been collected? Looks like you're right. I based my comments on a straight Shark run, so there wasn't a super-easy way to isolate just the middle run part.
Created attachment 460324 [details] [diff] [review] add splay.js to js/src/v8 suite This patch adds splay.js to the suite in js/src/v8, which stresses the inner run() method in the same way as the V8 benchmark website, as opposed to the sunspider harness method of timing the entire thing exactly once. I like this as it stresses the code in the same way the website does; this makes a big difference for splay.js, not so much the other benchmarks. For the website version, here is the time breakdown for the inner run() loop on my laptop: SlowCall: 3.01 ms (12803 count) NewString: 3.33 ms (211200 count) Relational.>: 4.24 ms (147983 count) Relational.<: 5.71 ms (197472 count) NewArray: 12.57 ms (204800 count) NewDenseArrayObject: 12.74 ms (204800 count) String: 16.48 ms (6400 count) SetName: 25.74 ms (263354 count) InitPropOrMethod: 34.84 ms (806400 count) ensureDenseArrayElements: 36.86 ms (204800 count) NewFinalizableGCThing: 53.02 ms (857680 count) ConcatN: 66.93 ms (204800 count) NewInitObject: 76.5 ms (403200 count) GC: 335.3 ms (2 count)
Forgot to say: score for JSC or JM (anything using a mark/sweep collector) is unaffected by this change.
Nick has had run-ins with splay too, even filed a chromium issue on its bogosity. /be
(In reply to comment #12) > Nick has had run-ins with splay too, even filed a chromium issue on its > bogosity. Yeah, bug 562553, which links to http://code.google.com/p/v8/issues/detail?id=690 It's a totally synthetic benchmarks. I'm sick of synthetic benchmarks. When is the SPEC corporation gonna come out with a JS suite? SPEC ain't perfect but it's made the world of C and Fortran benchmarking a lot better. Hennessy and Patterson would be appalled by SunSpider and V8.
Nick, good news is Andreas and I visited MSR today, working with Ben Livshits, Ben Zorn, and Jan Vitek's student Gregor Richards (interning this summer with the Bens) on JSBench, successor to JSMeter, based on real web workload measurements. http://research.microsoft.com/en-us/projects/jsmeter/ More when there's more to say. /be
Our splay score in the browser is now: 6120 With 150 other tabs open: 5764 Chrome canary with a single tab gets: 3251 I guess we can close this bug?
It's also interesting to see that AWFY says that on July 15 V8's perf on splay got a lot worse. It was getting ~3900 consistently, now it's bimodal, getting either either ~2800 or ~3300.
(In reply to comment #16) > It's also interesting to see that AWFY says that on July 15 V8's perf on > splay got a lot worse. It was getting ~3900 consistently, now it's bimodal, > getting either either ~2800 or ~3300. Bill was this your instrumentation patch on 7/15?
No, I think Nick was talking about V8 (as in the JS engine, not the benchmark).
(In reply to comment #18) > No, I think Nick was talking about V8 (as in the JS engine, not the > benchmark). Ahh. Too many overloaded terms!