505004 - TM: GC scopes instead of doing reference counted malloc/free

Assignee

Description

•

16 years ago

In a threadsafe build, the following code is heavily gated on atomic updating of scope->nrefs and JS_malloc/JS_free. for (var i = 0; i < 1000000; ++i) ({}).x = 1; The attached patch makes JSScope GC-ed instead of reference counted. Its a net loss of 3ms in a ST shell and a win of about 7ms in a MT shell. This was a crazy 4am hack with Blake. Once I wake up tomorrow I will do some profiling to figure out while we lose time in the ST shell. We had to bump the heap size from 64MB to 128MB in the shell to cope with all the extra GC use. GC allocating scopes is probably a good thing for fragmentation, but it also means that we put more pressure on our already sub-optimal GC allocator path (and don't get me started on the GC collection path). The first obvious improvement would be to increase the size of scopes to 64 bytes so we don't do random integer mul/div during allocation. More to follow.

Andreas Gal :gal

Assignee

Comment 1

•

16 years ago

Attached patch patch — Details — Splinter Review

Assignee: general → gal

Andreas Gal :gal

Assignee

Comment 2

•

16 years ago

This patch also fixes a random bug we ran across where we try to unlock a scope after failing to allocate it. scope will be NULL, and we die. The condition is rare and the result is a safe NULL crash.

Julian Seward [:jseward]

Comment 3

•

16 years ago

I have some initial numbers, but I don't believe them; I suspect I'm measuring the wrong builds/versions, in some way. Basically they show that the with-patch version runs the test loop in Comment #0 in about 2/3 the time of the without-patch version. That might be because the baseline version uses a 64M heap instead of a 128M heap; but changing the heap to 128M for the baseline makes very little difference. What am I measuring wrong? Using TM changeset: 30372:362d7b098a1e and patch in Comment #1 of https://bugzilla.mozilla.org/show_bug.cgi?id=505004 gcc (SUSE Linux) 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] on openSUSE 11.0, Core2 Duo, 2.4GHz, 4MB L2 cd WS_505004_ST_base/js/src/BuildR # WITHOUT patch CC='gcc -m32 -g' CXX='g++ -m32 -g' AR=ar ../configure --disable-debug \ --enable-optimize --target=i686-pc-linux-gnu make clean make -j 2 Identical configure and build in WS_505004_ST_base/js/src/BuildR ## _with is much faster than _base, viz: spinup && time ./WS_505004_ST_base/js/src/BuildR/js -j ./test505004.js produces 0.240, 0.248 user seconds spinup && time ./WS_505004_ST_with/js/src/BuildR/js -j ./test505004.js produces 0.168, 0.164 user seconds ## actual event numbers back that up, and are exactly repeatable ## over multiple runs. vTRUNK --smc-check=all --tool=cachegrind --branch-sim=yes \ ./WS_505004_ST_base/js/src/BuildR/js -j ./test505004.js ==4779== I refs: 889,861,347 ==4779== I1 misses: 5,404 ==4779== L2i misses: 4,162 ==4779== I1 miss rate: 0.00% ==4779== L2i miss rate: 0.00% ==4779== ==4779== D refs: 508,062,017 (309,778,129 rd + 198,283,888 wr) ==4779== D1 misses: 3,365,261 ( 2,082,981 rd + 1,282,280 wr) ==4779== L2d misses: 3,243,398 ( 1,968,923 rd + 1,274,475 wr) ==4779== D1 miss rate: 0.6% ( 0.6% + 0.6% ) ==4779== L2d miss rate: 0.6% ( 0.6% + 0.6% ) ==4779== ==4779== L2 refs: 3,370,665 ( 2,088,385 rd + 1,282,280 wr) ==4779== L2 misses: 3,247,560 ( 1,973,085 rd + 1,274,475 wr) ==4779== L2 miss rate: 0.2% ( 0.1% + 0.6% ) ==4779== ==4779== Branches: 113,140,157 (111,118,104 cond + 2,022,053 ind) ==4779== Mispredicts: 916,896 ( 916,020 cond + 876 ind) ==4779== Mispred rate: 0.8% ( 0.8% + 0.0% ) vTRUNK --smc-check=all --tool=cachegrind --branch-sim=yes \ ./WS_505004_ST_with/js/src/BuildR/js -j ./test505004.js ==4783== I refs: 569,920,442 ==4783== I1 misses: 5,635 ==4783== L2i misses: 4,208 ==4783== I1 miss rate: 0.00% ==4783== L2i miss rate: 0.00% ==4783== ==4783== D refs: 314,945,299 (182,886,956 rd + 132,058,343 wr) ==4783== D1 misses: 2,404,633 ( 853,986 rd + 1,550,647 wr) ==4783== L2d misses: 2,367,303 ( 818,047 rd + 1,549,256 wr) ==4783== D1 miss rate: 0.7% ( 0.4% + 1.1% ) ==4783== L2d miss rate: 0.7% ( 0.4% + 1.1% ) ==4783== ==4783== L2 refs: 2,410,268 ( 859,621 rd + 1,550,647 wr) ==4783== L2 misses: 2,371,511 ( 822,255 rd + 1,549,256 wr) ==4783== L2 miss rate: 0.2% ( 0.1% + 1.1% ) ==4783== ==4783== Branches: 64,471,757 ( 64,429,510 cond + 42,247 ind) ==4783== Mispredicts: 82,575 ( 81,695 cond + 880 ind) ==4783== Mispred rate: 0.1% ( 0.1% + 2.0% )

Julian Seward [:jseward]

Comment 4

•

16 years ago

Cachegrind counts instructions, memory references, (simulated) I1, D1, L2 accesses and misses, and (simulated) direct and indirect branches and mispredicts, on a per program, function, and line granularity. Am thinking of extending it to count bus-lock events (lock-prefixes) too.

Mike Shaver (:shaver emeritus)

Comment 5

•

16 years ago

Are you saying that you don't believe your numbers because they're too good? The 3ms/7ms thing from comment 0 is referring to sunspider, not to the loop, as far as I can tell.

Brendan Eich [:brendan]

Comment 6

•

16 years ago

Comment on attachment 389284 [details] [diff] [review] patch >diff --git a/js/src/jsapi.cpp b/js/src/jsapi.cpp >--- a/js/src/jsapi.cpp >+++ b/js/src/jsapi.cpp >@@ -2561,17 +2561,17 @@ JS_SetGCCallbackRT(JSRuntime *rt, JSGCCa > rt->gcCallback = cb; > return oldcb; > } > > JS_PUBLIC_API(JSBool) > JS_IsAboutToBeFinalized(JSContext *cx, void *thing) > { > JS_ASSERT(thing); >- return js_IsAboutToBeFinalized(cx, thing); >+ return js_IsAboutToBeFinalized(thing); > } > > JS_PUBLIC_API(void) > JS_SetGCParameter(JSRuntime *rt, JSGCParamKey key, uint32 value) > { > switch (key) { > case JSGC_MAX_BYTES: > rt->gcMaxBytes = value; >diff --git a/js/src/jsarray.cpp b/js/src/jsarray.cpp >--- a/js/src/jsarray.cpp >+++ b/js/src/jsarray.cpp >@@ -1260,27 +1260,27 @@ js_MakeArraySlow(JSContext *cx, JSObject ... > for (uint32 i = 0; i < capacity; i++) { > jsid id; > JSScopeProperty *sprop; > > if (!JS_ValueToId(cx, INT_TO_JSVAL(i), &id)) >- goto out_bad; >+ return JS_FALSE; ... > sprop = scope->add(cx, id, NULL, NULL, i + JS_INITIAL_NSLOTS, > JSPROP_ENUMERATE, 0, 0); > if (!sprop) >- goto out_bad; >+ return JS_FALSE; ... >@@ -1292,20 +1292,16 @@ js_MakeArraySlow(JSContext *cx, JSObject ... > obj->map = &scope->map; > return JS_TRUE; >- >- out_bad: >- JSScope::destroy(cx, scope); >- return JS_FALSE; At this point may as well use true for final return and false for the two early ones -- no others remain and you are touching almost all the lines (or right after the lines) involved. >@@ -2059,16 +2059,21 @@ fail: > return NULL; > } > > extern JSObject* js_NewGCObject(JSContext *cx, uintN flags) > { > return NewGCThing<JSObject>(cx, flags); > } > >+extern JSScope* js_NewGCScope(JSContext *cx, uintN flags) >+{ >+ return NewGCThing<JSScope>(cx, flags); >+} >+ > extern JSString* js_NewGCString(JSContext *cx, uintN flags) > { > return NewGCThing<JSString>(cx, flags); > } > > extern JSFunction* js_NewGCFunction(JSContext *cx, uintN flags) Fix all of these style breaks -- lose the extern, put a space before the *, newline after the *. >+void >+PurgeDeadWaitingScopes(JSRuntime *rt) Rename UnlinkDeadScopesToShare. >@@ -165,16 +167,19 @@ struct JSGCThing { > * Allocates a new GC thing of the given size. After a successful allocation > * the caller must fully initialize the thing before calling any function that > * can potentially trigger GC. This will ensure that GC tracing never sees junk > * values stored in the partially initialized thing. > */ > extern JSObject* > js_NewGCObject(JSContext *cx, uintN flags); > >+extern JSScope* >+js_NewGCScope(JSContext *cx, uintN flags); I didn't review these names, but looking now, they seem misleading. "GCScope" in particular sounds like a local root ("handle") scope. js_NewObjectGCThing, js_NewScopeGCThing, etc. would be better. Even better: static methods JSGCThing::newObject, JSGCThing::newScope, etc. members! Separately, is there any reason these are not inline? Great patch otherwise, had one like it in my q until the big hg corruption lossage the other week. /be

Brendan Eich [:brendan]

Comment 7

•

16 years ago

Hit me for a review on a revised patch, I'll plus it quickly with nits picked and static inline JSGCThing "new" helpers (C++ allows static inlines, right?). /be

Andreas Gal :gal

Assignee

Comment 8

•

16 years ago

The methods are not inline because that would require exposing the guts of all GCThings into jsgc.h, which is used by most sites defining them (include order and typedef order issue). The revised patch still slows down ST shell. Until thats resolved, there won't be a r=? coming but I send it your way when its done :)

Brendan Eich [:brendan]

Comment 9

•

16 years ago

(In reply to comment #8) > The methods are not inline because that would require exposing the guts of all > GCThings into jsgc.h, which is used by most sites defining them (include order > and typedef order issue). So? Wouldn't it help with perf? Allocator fast paths require inlining in the experience of savvy GC hackers I know (Lars Hansen has testified to this re: his experience at Opera). We could make a jsgcguts.h but why? The existing jsgc.h is already a private .h by convention. > The revised patch still slows down ST shell. You mean SS? > Until thats resolved, there won't be a r=? coming but I send it your way > when its done :) Prompt finalization due to ref-counting can reduce working set size. OTOH there are fewer cycles spend adjusting ref-counts and actually finalizing. Could it be that working set size increase is to blame? GC does not run during SS, please correct if wrong. /be

Andreas Gal :gal

Assignee

Comment 10

•

16 years ago

> > So? Wouldn't it help with perf? Allocator fast paths require inlining in the > experience of savvy GC hackers I know (Lars Hansen has testified to this re: > his experience at Opera). Of course it would. I have a separate patch (two subsequent patches, actually) to create an inlined allocator fast path. It will be all good. I promise :) > > We could make a jsgcguts.h but why? The existing jsgc.h is already a private .h > by convention. > > > The revised patch still slows down ST shell. > > You mean SS? Yeah. We win big on u-benchmrks and lose on SS in the ST shell (MT shell we win). > > > Until thats resolved, there won't be a r=? coming but I send it your way > > when its done :) > > Prompt finalization due to ref-counting can reduce working set size. OTOH there > are fewer cycles spend adjusting ref-counts and actually finalizing. > There is no prompt finalization. We only finalize the scope when the GC runs, so working set _should_ be the same? > Could it be that working set size increase is to blame? > > GC does not run during SS, please correct if wrong. My guess is on cache alignment and working set size. Trying to measure both right now. JSScope is current 40 bytes. It probably wants to be 48. > > /be I will start throwing the reviews your way for the fast path in NewGC*.

Brendan Eich [:brendan]

Comment 11

•

16 years ago

(In reply to comment #10) > Of course it would. I have a separate patch (two subsequent patches, actually) > to create an inlined allocator fast path. It will be all good. I promise :) Sure, separate bug(s), ideally one. Which bug #? > > Prompt finalization due to ref-counting can reduce working set size. OTOH there > > are fewer cycles spend adjusting ref-counts and actually finalizing. > > > > There is no prompt finalization. I was talking about ref-counting (pre-patch status quo), obviously. > We only finalize the scope when the GC runs, > so working set _should_ be the same? No, with GC'ed scopes average and high-water working set both go up. > JSScope is current 40 bytes. It probably wants to be 48. Is that why your patch (forgot to note this when reviewing) adds an align member to the wrong place? To pack things you want to replace the nrefs member with that align member. But gross memory effects might be at work here too. /be

Brendan Eich [:brendan]

Comment 12

•

16 years ago

(In reply to comment #11) > > There is no prompt finalization. > > I was talking about ref-counting (pre-patch status quo), obviously. We won't collect the garbage scopes until a second GC, IOW. /be

Robert Sayre

Updated

•

16 years ago

Blocks: 505308

Brendan Eich [:brendan]

Updated

•

16 years ago

No longer blocks: 505308

Brendan Eich [:brendan]

Updated

•

16 years ago

Blocks: 505308

Brendan Eich [:brendan]

Comment 13

•

16 years ago

(In reply to comment #12) > (In reply to comment #11) > > > There is no prompt finalization. > > > > I was talking about ref-counting (pre-patch status quo), obviously. > > We won't collect the garbage scopes until a second GC, IOW. This isn't true (good to confirm). But we are mmapping more. Is that hurting? Measure time in kernel via shark w/ vs w/o patch. If it's ~3ms there you go. /be

Andreas Gal :gal

Assignee

Comment 14

•

16 years ago

Ok, so if I bump the working set size to 128mb, performance is not affected. If I apply the patch and bump to 128mb, we slow down by 3ms. The first cachegrind is for a TM tip with 128mb heap. ==77915== I refs: 3,469,750,852 ==77915== I1 misses: 12,199,015 ==77915== L2i misses: 24,778 ==77915== I1 miss rate: 0.35% ==77915== L2i miss rate: 0.00% ==77915== ==77915== D refs: 1,778,989,508 (1,088,616,084 rd + 690,373,424 wr) ==77915== D1 misses: 4,714,862 ( 3,423,605 rd + 1,291,257 wr) ==77915== L2d misses: 1,393,297 ( 563,817 rd + 829,480 wr) ==77915== D1 miss rate: 0.2% ( 0.3% + 0.1% ) ==77915== L2d miss rate: 0.0% ( 0.0% + 0.1% ) ==77915== ==77915== L2 refs: 16,913,877 ( 15,622,620 rd + 1,291,257 wr) ==77915== L2 misses: 1,418,075 ( 588,595 rd + 829,480 wr) ==77915== L2 miss rate: 0.0% ( 0.0% + 0.1% ) The second cachegrind is with the vanilla patch. ==78181== I refs: 3,474,627,445 ==78181== I1 misses: 10,094,191 ==78181== L2i misses: 24,926 ==78181== I1 miss rate: 0.29% ==78181== L2i miss rate: 0.00% ==78181== ==78181== D refs: 1,785,796,320 (1,091,143,782 rd + 694,652,538 wr) ==78181== D1 misses: 4,729,886 ( 3,431,171 rd + 1,298,715 wr) ==78181== L2d misses: 1,383,531 ( 562,321 rd + 821,210 wr) ==78181== D1 miss rate: 0.2% ( 0.3% + 0.1% ) ==78181== L2d miss rate: 0.0% ( 0.0% + 0.1% ) ==78181== ==78181== L2 refs: 14,824,077 ( 13,525,362 rd + 1,298,715 wr) ==78181== L2 misses: 1,408,457 ( 587,247 rd + 821,210 wr) ==78181== L2 miss rate: 0.0% ( 0.0% + 0.1% ) The third cachegrind is with the patch and JSScope increased to 48 bytes from 16 for better cache alignment. ==78402== I refs: 3,467,989,687 ==78402== I1 misses: 10,039,708 ==78402== L2i misses: 24,955 ==78402== I1 miss rate: 0.28% ==78402== L2i miss rate: 0.00% ==78402== ==78402== D refs: 1,781,801,386 (1,089,024,301 rd + 692,777,085 wr) ==78402== D1 misses: 4,733,289 ( 3,436,530 rd + 1,296,759 wr) ==78402== L2d misses: 1,383,874 ( 561,900 rd + 821,974 wr) ==78402== D1 miss rate: 0.2% ( 0.3% + 0.1% ) ==78402== L2d miss rate: 0.0% ( 0.0% + 0.1% ) ==78402== ==78402== L2 refs: 14,772,997 ( 13,476,238 rd + 1,296,759 wr) ==78402== L2 misses: 1,408,829 ( 586,855 rd + 821,974 wr) ==78402== L2 miss rate: 0.0% ( 0.0% + 0.1% )

Andreas Gal :gal

Assignee

Comment 15

•

16 years ago

I reduce instruction misses a bit. I incur 10,000 more L1 misses, but I doubt that adds up to 3ms. I also incur 20,000 fewer L2 misses, so it should even out either way.

Julian Seward [:jseward]

Comment 16

•

16 years ago

(In reply to comment #14) Looks like you're pretty much at the noise floor with these numbers. No mispredict info? Is occasionally significant. --branch-sim=yes I'll poke at getting lock-prefix counting working in cachegrind tomorrow.

Andreas Gal :gal

Assignee

Comment 17

•

16 years ago

Ok, I measured again with branch prediction emulation: WITHOUT: ==78801== I refs: 3,527,838,694 ==78801== I1 misses: 10,132,028 ==78801== L2i misses: 24,982 ==78801== I1 miss rate: 0.28% ==78801== L2i miss rate: 0.00% ==78801== ==78801== D refs: 1,812,992,896 (1,108,678,395 rd + 704,314,501 wr) ==78801== D1 misses: 5,112,744 ( 3,798,345 rd + 1,314,399 wr) ==78801== L2d misses: 1,400,396 ( 571,182 rd + 829,214 wr) ==78801== D1 miss rate: 0.2% ( 0.3% + 0.1% ) ==78801== L2d miss rate: 0.0% ( 0.0% + 0.1% ) ==78801== ==78801== L2 refs: 15,244,772 ( 13,930,373 rd + 1,314,399 wr) ==78801== L2 misses: 1,425,378 ( 596,164 rd + 829,214 wr) ==78801== L2 miss rate: 0.0% ( 0.0% + 0.1% ) ==78801== ==78801== Branches: 425,488,216 ( 413,654,797 cond + 11,833,419 ind) ==78801== Mispredicts: 19,714,845 ( 14,325,387 cond + 5,389,458 ind) ==78801== Mispred rate: 4.6% ( 3.4% + 45.5% ) WITH patch + 48 byte JSScope objects: ==79062== I refs: 3,442,961,046 ==79062== I1 misses: 12,457,269 ==79062== L2i misses: 24,711 ==79062== I1 miss rate: 0.36% ==79062== L2i miss rate: 0.00% ==79062== ==79062== D refs: 1,770,936,838 (1,082,251,561 rd + 688,685,277 wr) ==79062== D1 misses: 4,892,878 ( 3,562,977 rd + 1,329,901 wr) ==79062== L2d misses: 1,385,574 ( 556,323 rd + 829,251 wr) ==79062== D1 miss rate: 0.2% ( 0.3% + 0.1% ) ==79062== L2d miss rate: 0.0% ( 0.0% + 0.1% ) ==79062== ==79062== L2 refs: 17,350,147 ( 16,020,246 rd + 1,329,901 wr) ==79062== L2 misses: 1,410,285 ( 581,034 rd + 829,251 wr) ==79062== L2 miss rate: 0.0% ( 0.0% + 0.1% ) ==79062== ==79062== Branches: 418,439,373 ( 406,864,626 cond + 11,574,747 ind) ==79062== Mispredicts: 18,913,171 ( 13,968,106 cond + 4,945,065 ind) ==79062== Mispred rate: 4.5% ( 3.4% + 42.7% )

Andreas Gal :gal

Assignee

Comment 18

•

16 years ago

I get noticeably more L2 refs, but fewer L2 misses in the end. Everything else improves, yet we are 3ms slower. Weird.

Robert Sayre

Comment 19

•

16 years ago

Use shark to check for actual L2 cache misses

Brendan Eich [:brendan]

Comment 20

•

16 years ago

(In reply to comment #19) > Use shark to check for actual L2 cache misses And time in the kernel, if not system call counts and latencies -- comment 13 was ignored, or else I missed the results (?). /be

Andreas Gal :gal

Assignee

Comment 21

•

16 years ago

#20: patience, patience, working on it. I already bumped the heap size to 128mb on the other candidate used for comparison, but there is no guarantee the extra memory is used. I will try to measure that to in a sec.

Brendan Eich [:brendan]

Comment 23

•

16 years ago

Does this bug depend on bug 506174 then? Still want that missing time accounted for -- money's on our way too many mmap calls. /be

Brendan Eich [:brendan]

Updated

•

16 years ago

Depends on: 503080

David Mandelin [:dmandelin]

Updated

•

16 years ago

Depends on: 506125

Luke Wagner [:luke]

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → DUPLICATE