Closed Bug 505004 Opened 15 years ago Closed 13 years ago

TM: GC scopes instead of doing reference counted malloc/free

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 569422

People

(Reporter: gal, Assigned: gal)

References

Details

Attachments

(1 file)

In a threadsafe build, the following code is heavily gated on atomic updating of scope->nrefs and JS_malloc/JS_free.

for (var i = 0; i < 1000000; ++i)
    ({}).x = 1;

The attached patch makes JSScope GC-ed instead of reference counted. Its a net loss of 3ms in a ST shell and a win of about 7ms in a MT shell.

This was a crazy 4am hack with Blake. Once I wake up tomorrow I will do some profiling to figure out while we lose time in the ST shell.

We had to bump the heap size from 64MB to 128MB in the shell to cope with all the extra GC use. GC allocating scopes is probably a good thing for fragmentation, but it also means that we put more pressure on our already sub-optimal GC allocator path (and don't get me started on the GC collection path).

The first obvious improvement would be to increase the size of scopes to 64 bytes so we don't do random integer mul/div during allocation. More to follow.
Attached patch patchSplinter Review
Assignee: general → gal
This patch also fixes a random bug we ran across where we try to unlock a scope after failing to allocate it. scope will be NULL, and we die. The condition is rare and the result is a safe NULL crash.
I have some initial numbers, but I don't believe them; I suspect I'm
measuring the wrong builds/versions, in some way.  Basically they
show that the with-patch version runs the test loop in Comment #0
in about 2/3 the time of the without-patch version.  That might be
because the baseline version uses a 64M heap instead of a 128M heap;
but changing the heap to 128M for the baseline makes very little difference.

What am I measuring wrong?


Using TM   changeset:   30372:362d7b098a1e
and patch in Comment #1 of https://bugzilla.mozilla.org/show_bug.cgi?id=505004

gcc (SUSE Linux) 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036]

on openSUSE 11.0, Core2 Duo, 2.4GHz, 4MB L2



cd WS_505004_ST_base/js/src/BuildR             # WITHOUT patch

CC='gcc -m32 -g' CXX='g++ -m32 -g' AR=ar ../configure --disable-debug \
        --enable-optimize --target=i686-pc-linux-gnu

make clean
make -j 2

Identical configure and build in WS_505004_ST_base/js/src/BuildR


## _with is much faster than _base, viz:

spinup && time ./WS_505004_ST_base/js/src/BuildR/js -j ./test505004.js
produces 0.240, 0.248 user seconds

spinup && time ./WS_505004_ST_with/js/src/BuildR/js -j ./test505004.js
produces 0.168, 0.164 user seconds


## actual event numbers back that up, and are exactly repeatable
## over multiple runs.

vTRUNK --smc-check=all --tool=cachegrind --branch-sim=yes \
       ./WS_505004_ST_base/js/src/BuildR/js -j ./test505004.js

==4779== I   refs:      889,861,347
==4779== I1  misses:          5,404
==4779== L2i misses:          4,162
==4779== I1  miss rate:        0.00%
==4779== L2i miss rate:        0.00%
==4779==
==4779== D   refs:      508,062,017  (309,778,129 rd   + 198,283,888 wr)
==4779== D1  misses:      3,365,261  (  2,082,981 rd   +   1,282,280 wr)
==4779== L2d misses:      3,243,398  (  1,968,923 rd   +   1,274,475 wr)
==4779== D1  miss rate:         0.6% (        0.6%     +         0.6%  )
==4779== L2d miss rate:         0.6% (        0.6%     +         0.6%  )
==4779==
==4779== L2 refs:         3,370,665  (  2,088,385 rd   +   1,282,280 wr)
==4779== L2 misses:       3,247,560  (  1,973,085 rd   +   1,274,475 wr)
==4779== L2 miss rate:          0.2% (        0.1%     +         0.6%  )
==4779==
==4779== Branches:      113,140,157  (111,118,104 cond +   2,022,053 ind)
==4779== Mispredicts:       916,896  (    916,020 cond +         876 ind)
==4779== Mispred rate:          0.8% (        0.8%     +         0.0%   )


vTRUNK --smc-check=all --tool=cachegrind --branch-sim=yes \
       ./WS_505004_ST_with/js/src/BuildR/js -j ./test505004.js

==4783== I   refs:      569,920,442
==4783== I1  misses:          5,635
==4783== L2i misses:          4,208
==4783== I1  miss rate:        0.00%
==4783== L2i miss rate:        0.00%
==4783==
==4783== D   refs:      314,945,299  (182,886,956 rd   + 132,058,343 wr)
==4783== D1  misses:      2,404,633  (    853,986 rd   +   1,550,647 wr)
==4783== L2d misses:      2,367,303  (    818,047 rd   +   1,549,256 wr)
==4783== D1  miss rate:         0.7% (        0.4%     +         1.1%  )
==4783== L2d miss rate:         0.7% (        0.4%     +         1.1%  )
==4783==
==4783== L2 refs:         2,410,268  (    859,621 rd   +   1,550,647 wr)
==4783== L2 misses:       2,371,511  (    822,255 rd   +   1,549,256 wr)
==4783== L2 miss rate:          0.2% (        0.1%     +         1.1%  )
==4783==
==4783== Branches:       64,471,757  ( 64,429,510 cond +      42,247 ind)
==4783== Mispredicts:        82,575  (     81,695 cond +         880 ind)
==4783== Mispred rate:          0.1% (        0.1%     +         2.0%   )
Cachegrind counts instructions, memory references, (simulated)
I1, D1, L2 accesses and misses, and (simulated) direct and indirect
branches and mispredicts, on a per program, function, and line
granularity.  Am thinking of extending it to count bus-lock events
(lock-prefixes) too.
Are you saying that you don't believe your numbers because they're too good?  The 3ms/7ms thing from comment 0 is referring to sunspider, not to the loop, as far as I can tell.
Comment on attachment 389284 [details] [diff] [review]
patch

>diff --git a/js/src/jsapi.cpp b/js/src/jsapi.cpp
>--- a/js/src/jsapi.cpp
>+++ b/js/src/jsapi.cpp
>@@ -2561,17 +2561,17 @@ JS_SetGCCallbackRT(JSRuntime *rt, JSGCCa
>     rt->gcCallback = cb;
>     return oldcb;
> }
> 
> JS_PUBLIC_API(JSBool)
> JS_IsAboutToBeFinalized(JSContext *cx, void *thing)
> {
>     JS_ASSERT(thing);
>-    return js_IsAboutToBeFinalized(cx, thing);
>+    return js_IsAboutToBeFinalized(thing);
> }
> 
> JS_PUBLIC_API(void)
> JS_SetGCParameter(JSRuntime *rt, JSGCParamKey key, uint32 value)
> {
>     switch (key) {
>       case JSGC_MAX_BYTES:
>         rt->gcMaxBytes = value;
>diff --git a/js/src/jsarray.cpp b/js/src/jsarray.cpp
>--- a/js/src/jsarray.cpp
>+++ b/js/src/jsarray.cpp
>@@ -1260,27 +1260,27 @@ js_MakeArraySlow(JSContext *cx, JSObject
...
>     for (uint32 i = 0; i < capacity; i++) {
>         jsid id;
>         JSScopeProperty *sprop;
> 
>         if (!JS_ValueToId(cx, INT_TO_JSVAL(i), &id))
>-            goto out_bad;
>+            return JS_FALSE;
...
>         sprop = scope->add(cx, id, NULL, NULL, i + JS_INITIAL_NSLOTS,
>                            JSPROP_ENUMERATE, 0, 0);
>         if (!sprop)
>-            goto out_bad;
>+            return JS_FALSE;
...
>@@ -1292,20 +1292,16 @@ js_MakeArraySlow(JSContext *cx, JSObject
...
>     obj->map = &scope->map;
>     return JS_TRUE;
>-
>-  out_bad:
>-    JSScope::destroy(cx, scope);
>-    return JS_FALSE;

At this point may as well use true for final return and false for the two early ones -- no others remain and you are touching almost all the lines (or right after the lines) involved.

>@@ -2059,16 +2059,21 @@ fail:
>     return NULL;
> }
> 
> extern JSObject* js_NewGCObject(JSContext *cx, uintN flags)
> {
>     return NewGCThing<JSObject>(cx, flags);
> }
> 
>+extern JSScope* js_NewGCScope(JSContext *cx, uintN flags)
>+{
>+    return NewGCThing<JSScope>(cx, flags);
>+}
>+
> extern JSString* js_NewGCString(JSContext *cx, uintN flags)
> {
>     return NewGCThing<JSString>(cx, flags);
> }
> 
> extern JSFunction* js_NewGCFunction(JSContext *cx, uintN flags)

Fix all of these style breaks -- lose the extern, put a space before  the *, newline after the *.

>+void
>+PurgeDeadWaitingScopes(JSRuntime *rt)

Rename UnlinkDeadScopesToShare.

>@@ -165,16 +167,19 @@ struct JSGCThing {
>  * Allocates a new GC thing of the given size. After a successful allocation
>  * the caller must fully initialize the thing before calling any function that
>  * can potentially trigger GC. This will ensure that GC tracing never sees junk
>  * values stored in the partially initialized thing.
>  */
> extern JSObject*
> js_NewGCObject(JSContext *cx, uintN flags);
> 
>+extern JSScope*
>+js_NewGCScope(JSContext *cx, uintN flags);

I didn't review these names, but looking now, they seem misleading. "GCScope" in particular sounds like a local root ("handle") scope. js_NewObjectGCThing, js_NewScopeGCThing, etc. would be better.

Even better: static methods JSGCThing::newObject, JSGCThing::newScope, etc. members!

Separately, is there any reason these are not inline?

Great patch otherwise, had one like it in my q until the big hg corruption lossage the other week.

/be
Hit me for a review on a revised patch, I'll plus it quickly with nits picked and static inline JSGCThing "new" helpers (C++ allows static inlines, right?).

/be
The methods are not inline because that would require exposing the guts of all GCThings into jsgc.h, which is used by most sites defining them (include order and typedef order issue).

The revised patch still slows down ST shell. Until thats resolved, there won't be a r=? coming but I send it your way when its done :)
(In reply to comment #8)
> The methods are not inline because that would require exposing the guts of all
> GCThings into jsgc.h, which is used by most sites defining them (include order
> and typedef order issue).

So? Wouldn't it help with perf? Allocator fast paths require inlining in the experience of savvy GC hackers I know (Lars Hansen has testified to this re: his experience at Opera).

We could make a jsgcguts.h but why? The existing jsgc.h is already a private .h by convention.

> The revised patch still slows down ST shell.

You mean SS?

> Until thats resolved, there won't be a r=? coming but I send it your way
> when its done :)

Prompt finalization due to ref-counting can reduce working set size. OTOH there are fewer cycles spend adjusting ref-counts and actually finalizing.

Could it be that working set size increase is to blame?

GC does not run during SS, please correct if wrong.

/be
> 
> So? Wouldn't it help with perf? Allocator fast paths require inlining in the
> experience of savvy GC hackers I know (Lars Hansen has testified to this re:
> his experience at Opera).

Of course it would. I have a separate patch (two subsequent patches, actually) to create an inlined allocator fast path. It will be all good. I promise :)

> 
> We could make a jsgcguts.h but why? The existing jsgc.h is already a private .h
> by convention.
> 
> > The revised patch still slows down ST shell.
> 
> You mean SS?

Yeah. We win big on u-benchmrks and lose on SS in the ST shell (MT shell we win).

> 
> > Until thats resolved, there won't be a r=? coming but I send it your way
> > when its done :)
> 
> Prompt finalization due to ref-counting can reduce working set size. OTOH there
> are fewer cycles spend adjusting ref-counts and actually finalizing.
> 

There is no prompt finalization. We only finalize the scope when the GC runs, so working set _should_ be the same?

> Could it be that working set size increase is to blame?
> 
> GC does not run during SS, please correct if wrong.

My guess is on cache alignment and working set size. Trying to measure both right now. JSScope is current 40 bytes. It probably wants to be 48.

> 
> /be

I will start throwing the reviews your way for the fast path in NewGC*.
(In reply to comment #10)
> Of course it would. I have a separate patch (two subsequent patches, actually)
> to create an inlined allocator fast path. It will be all good. I promise :)

Sure, separate bug(s), ideally one. Which bug #?

> > Prompt finalization due to ref-counting can reduce working set size. OTOH there
> > are fewer cycles spend adjusting ref-counts and actually finalizing.
> > 
> 
> There is no prompt finalization.

I was talking about ref-counting (pre-patch status quo), obviously.

> We only finalize the scope when the GC runs,
> so working set _should_ be the same?

No, with GC'ed scopes average and high-water working set both go up.

> JSScope is current 40 bytes. It probably wants to be 48.

Is that why your patch (forgot to note this when reviewing) adds an align member to the wrong place? To pack things you want to replace the nrefs member with that align member.

But gross memory effects might be at work here too.

/be
(In reply to comment #11)
> > There is no prompt finalization.
> 
> I was talking about ref-counting (pre-patch status quo), obviously.

We won't collect the garbage scopes until a second GC, IOW.

/be
Blocks: 505308
No longer blocks: 505308
Blocks: 505308
(In reply to comment #12)
> (In reply to comment #11)
> > > There is no prompt finalization.
> > 
> > I was talking about ref-counting (pre-patch status quo), obviously.
> 
> We won't collect the garbage scopes until a second GC, IOW.

This isn't true (good to confirm). But we are mmapping more. Is that hurting? Measure time in kernel via shark w/ vs w/o patch. If it's ~3ms there you go.

/be
Ok, so if I bump the working set size to 128mb, performance is not affected.

If I apply the patch and bump to 128mb, we slow down by 3ms.

The first cachegrind is for a TM tip with 128mb heap.

==77915== I   refs:      3,469,750,852
==77915== I1  misses:       12,199,015
==77915== L2i misses:           24,778
==77915== I1  miss rate:          0.35%
==77915== L2i miss rate:          0.00%
==77915== 
==77915== D   refs:      1,778,989,508  (1,088,616,084 rd   + 690,373,424 wr)
==77915== D1  misses:        4,714,862  (    3,423,605 rd   +   1,291,257 wr)
==77915== L2d misses:        1,393,297  (      563,817 rd   +     829,480 wr)
==77915== D1  miss rate:           0.2% (          0.3%     +         0.1%  )
==77915== L2d miss rate:           0.0% (          0.0%     +         0.1%  )
==77915== 
==77915== L2 refs:          16,913,877  (   15,622,620 rd   +   1,291,257 wr)
==77915== L2 misses:         1,418,075  (      588,595 rd   +     829,480 wr)
==77915== L2 miss rate:            0.0% (          0.0%     +         0.1%  )

The second cachegrind is with the vanilla patch.

==78181== I   refs:      3,474,627,445
==78181== I1  misses:       10,094,191
==78181== L2i misses:           24,926
==78181== I1  miss rate:          0.29%
==78181== L2i miss rate:          0.00%
==78181== 
==78181== D   refs:      1,785,796,320  (1,091,143,782 rd   + 694,652,538 wr)
==78181== D1  misses:        4,729,886  (    3,431,171 rd   +   1,298,715 wr)
==78181== L2d misses:        1,383,531  (      562,321 rd   +     821,210 wr)
==78181== D1  miss rate:           0.2% (          0.3%     +         0.1%  )
==78181== L2d miss rate:           0.0% (          0.0%     +         0.1%  )
==78181== 
==78181== L2 refs:          14,824,077  (   13,525,362 rd   +   1,298,715 wr)
==78181== L2 misses:         1,408,457  (      587,247 rd   +     821,210 wr)
==78181== L2 miss rate:            0.0% (          0.0%     +         0.1%  )

The third cachegrind is with the patch and JSScope increased to 48 bytes from 16 for better cache alignment.

==78402== I   refs:      3,467,989,687
==78402== I1  misses:       10,039,708
==78402== L2i misses:           24,955
==78402== I1  miss rate:          0.28%
==78402== L2i miss rate:          0.00%
==78402== 
==78402== D   refs:      1,781,801,386  (1,089,024,301 rd   + 692,777,085 wr)
==78402== D1  misses:        4,733,289  (    3,436,530 rd   +   1,296,759 wr)
==78402== L2d misses:        1,383,874  (      561,900 rd   +     821,974 wr)
==78402== D1  miss rate:           0.2% (          0.3%     +         0.1%  )
==78402== L2d miss rate:           0.0% (          0.0%     +         0.1%  )
==78402== 
==78402== L2 refs:          14,772,997  (   13,476,238 rd   +   1,296,759 wr)
==78402== L2 misses:         1,408,829  (      586,855 rd   +     821,974 wr)
==78402== L2 miss rate:            0.0% (          0.0%     +         0.1%  )
I reduce instruction misses a bit. I incur 10,000 more L1 misses, but I doubt that adds up to 3ms. I also incur 20,000 fewer L2 misses, so it should even out either way.
(In reply to comment #14)
Looks like you're pretty much at the noise floor with these numbers.

No mispredict info?  Is occasionally significant.  --branch-sim=yes

I'll poke at getting lock-prefix counting working in cachegrind tomorrow.
Ok, I measured again with branch prediction emulation:

WITHOUT:

==78801== I   refs:      3,527,838,694
==78801== I1  misses:       10,132,028
==78801== L2i misses:           24,982
==78801== I1  miss rate:          0.28%
==78801== L2i miss rate:          0.00%
==78801== 
==78801== D   refs:      1,812,992,896  (1,108,678,395 rd   + 704,314,501 wr)
==78801== D1  misses:        5,112,744  (    3,798,345 rd   +   1,314,399 wr)
==78801== L2d misses:        1,400,396  (      571,182 rd   +     829,214 wr)
==78801== D1  miss rate:           0.2% (          0.3%     +         0.1%  )
==78801== L2d miss rate:           0.0% (          0.0%     +         0.1%  )
==78801== 
==78801== L2 refs:          15,244,772  (   13,930,373 rd   +   1,314,399 wr)
==78801== L2 misses:         1,425,378  (      596,164 rd   +     829,214 wr)
==78801== L2 miss rate:            0.0% (          0.0%     +         0.1%  )
==78801== 
==78801== Branches:        425,488,216  (  413,654,797 cond +  11,833,419 ind)
==78801== Mispredicts:      19,714,845  (   14,325,387 cond +   5,389,458 ind)
==78801== Mispred rate:            4.6% (          3.4%     +        45.5%   )

WITH patch + 48 byte JSScope objects:

==79062== I   refs:      3,442,961,046
==79062== I1  misses:       12,457,269
==79062== L2i misses:           24,711
==79062== I1  miss rate:          0.36%
==79062== L2i miss rate:          0.00%
==79062== 
==79062== D   refs:      1,770,936,838  (1,082,251,561 rd   + 688,685,277 wr)
==79062== D1  misses:        4,892,878  (    3,562,977 rd   +   1,329,901 wr)
==79062== L2d misses:        1,385,574  (      556,323 rd   +     829,251 wr)
==79062== D1  miss rate:           0.2% (          0.3%     +         0.1%  )
==79062== L2d miss rate:           0.0% (          0.0%     +         0.1%  )
==79062== 
==79062== L2 refs:          17,350,147  (   16,020,246 rd   +   1,329,901 wr)
==79062== L2 misses:         1,410,285  (      581,034 rd   +     829,251 wr)
==79062== L2 miss rate:            0.0% (          0.0%     +         0.1%  )
==79062== 
==79062== Branches:        418,439,373  (  406,864,626 cond +  11,574,747 ind)
==79062== Mispredicts:      18,913,171  (   13,968,106 cond +   4,945,065 ind)
==79062== Mispred rate:            4.5% (          3.4%     +        42.7%   )
I get noticeably more L2 refs, but fewer L2 misses in the end. Everything else improves, yet we are 3ms slower. Weird.
Use shark to check for actual L2 cache misses
(In reply to comment #19)
> Use shark to check for actual L2 cache misses

And time in the kernel, if not system call counts and latencies -- comment 13 was ignored, or else I missed the results (?).

/be
#20: patience, patience, working on it. I already bumped the heap size to 128mb on the other candidate used for comparison, but there is no guarantee the extra memory is used. I will try to measure that to in a sec.
Does this bug depend on bug 506174 then? Still want that missing time accounted for -- money's on our way too many mmap calls.

/be
Depends on: 503080
Depends on: 506125
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: