make mark bits persistent until next GC

RESOLVED WONTFIX

Status

()

Core
JavaScript Engine
RESOLVED WONTFIX
8 years ago
8 years ago

People

(Reporter: gal, Assigned: gal)

Tracking

Trunk
x86
Mac OS X
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 obsolete attachments)

(Assignee)

Description

8 years ago
This patch does a couple things:

- Introduce a per-arena flag that indicates whether any bits are set in the mark bitmap or not.
- This allows us to "clear" the mark bitmap with a single a->info.marked = false, speedup up finalization. Also, the bitmap itself becomes persistent until the next GC run.
- To make room for the extra bit I threw out the fine-grained re-marking logic for MarkDelayedChildren. That path is never hit in our embedding, and causes all sorts of pain and code complexity. It still works, it just goes from very slow to super slow.
(Assignee)

Comment 1

8 years ago
Created attachment 430488 [details] [diff] [review]
patch
Assignee: general → gal
(Assignee)

Updated

8 years ago
Blocks: 549806
(Assignee)

Updated

8 years ago
Attachment #430488 - Flags: review?(igor)

Comment 2

8 years ago
(In reply to comment #0)
> This patch does a couple things:
> 
> - Introduce a per-arena flag that indicates whether any bits are set in the
> mark bitmap or not.

Have you measured the impact of this on the marking phase? When I have added the flag for doubles it has showed a noticeable slowdown in the marking. Still since the wast majority of doubles are garbage it was overall win. For objects and even strings the situation is not the same and the overhead of the extra check could overweight the savings. In addition I have a patch in working to inline the marking so the fast inlined path would check if the thing is marked and skip the call if so. The extra check for a marked (that could be branch-mispridicted) arena would surely harm that.

Comment 3

8 years ago
(In reply to comment #0)
> - To make room for the extra bit I threw out the fine-grained re-marking logic
> for MarkDelayedChildren. That path is never hit in our embedding,

This is simply not true. IIRC the original reports about stack overflow during the GC were coming from web developers who accidentally created infinite lists or similar structures. By no means I imply by that that this is a frequent scenario, but we should not forget about that case.

Now, that aside, the last patch for the bug 516832 uses the MarkDelayedChildren and friends to delay marking of objects for suspended threads.
(Assignee)

Comment 4

8 years ago
Oh, absolutely, overly deep lists are definitely a possible DOS scenario we have to handle, but its not something where efficiency matters to justify the code complexity + storage needs of the extra delayed marking bitmap.
(Assignee)

Comment 5

8 years ago
I will split out the delayed marking part into a separate patch and benchmark it.
(Assignee)

Updated

8 years ago
Attachment #430488 - Attachment is obsolete: true
Attachment #430488 - Flags: review?(igor)
(Assignee)

Updated

8 years ago
Depends on: 550413
(Assignee)

Comment 6

8 years ago
Created attachment 430595 [details] [diff] [review]
patch

Turns out Igor was right. The original patch showed some measurable slowdowns on the attached benchmark. I whipped out shark to see why that is the case, but I wasn't really able to see anything in the code I just touched because the rest of the profile was screaming bloody murder. For some GC loads 50% of the time is spent in the prolog and epilog of js_CallGCMarker. Marking large arrays is particularly bad. I rearranged the code a bit and added a MarkValues() function, which inlines CallGCMarker into the loop. Shark next pointed to the structure of CallGCMarker and MarkValueIfGCThing. Both have the same <> like structure we used to have w.r.t. thingSize. First we lose precision by going from jsval type to kind and then we have to dispatch again over kind. I pulled the code apart and refactored everything using MarkObject, MarkString, MarkDouble, MarkAtom etc. I also added MarkObjects() and such, to directly benefit from inlining and specialization. This also makes the code more readable IMHO, not just faster. With this stuff out of the way I could pretty easily pinpoint the cause of the slowdown. gcc generated retarded code for two subsequent accesses to a->info.privateMarkBitmap. Pulling that into a variable made it go away.

Some statistics:

Without the patch:

objects: 29330ms
doubles: 6342ms
strings: 9488ms
mix: 9614ms

With the patch:

objects: 29360ms
doubles: 3996ms
strings: 8394ms
mix: 8464ms

Yes, ladies and gentleman. That's a 40% or so speed up garbage collection if the heap contains a ton of doubles. Seriously, we have to shark the GC a bit. I never did that back when, I was focusing on the allocator path. The profile still looks like a red carpet. I only took care of the worst offenders.
(Assignee)

Comment 7

8 years ago
Comment on attachment 430595 [details] [diff] [review]
patch

Igor, this touches a ton of stuff and changes some invariants (i.e. CallValueTracerIfGCThing). Please be extra picky when reviewing.
Attachment #430595 - Flags: review?(igor)
(Assignee)

Comment 8

8 years ago
I also would like to try clearing all bitmaps immediately prior to GC instead of the indirection. I do that next.
(Assignee)

Comment 9

8 years ago
Testcase:

for (var i = 0; i < 4; ++i) {
    var a = [];
    for (n = 0; n < 512 * 1024; ++n) {
	var j = (i == 3) ? Math.floor(Math.random()*4) : i;
	a.push((i == 0) ? ({}) : ((i == 1) ? (n + 0.5) : ("x" + i)));
    }
    var t = Date.now();
    for (n = 0; n < 1000; ++n)
	gc();
    print(Date.now() - t);
}
Just as a general note, we hope to stop heap-allocating doubles at some point this spring. So an optimization that helps with GCing doubles will probably only be useful for so long.
(Assignee)

Comment 11

8 years ago
Yeah, I won't be sad if that happens. Its a 10% speedup overall, too. It just happened to be particularly inefficient for doubles before.
(Assignee)

Comment 12

8 years ago
Created attachment 430707 [details] [diff] [review]
patch

A bit more fine tuning with shark. This brings down the score for the object part of the test case to 28186.
Attachment #430595 - Attachment is obsolete: true
Attachment #430595 - Flags: review?(igor)
(Assignee)

Updated

8 years ago
Attachment #430707 - Flags: review?
(Assignee)

Comment 13

8 years ago
Created attachment 430772 [details] [diff] [review]
patch

This squeezes another couple percent out of GC performance by removing calls to the generalized MACRO paths (CALL_*_TRACER) and replacing them to their equivalent specialized implementations. This is mostly mechanical renaming on top of the previous patch, sorry for the size. I promise I am done now :)
Attachment #430707 - Attachment is obsolete: true
Attachment #430707 - Flags: review?
(Assignee)

Updated

8 years ago
Attachment #430772 - Flags: review?(igor)
(Assignee)

Comment 14

8 years ago
Created attachment 430783 [details] [diff] [review]
patch

Fixed a bogus assert.
Attachment #430772 - Attachment is obsolete: true
Attachment #430783 - Flags: review?(igor)
Attachment #430772 - Flags: review?(igor)

Comment 15

8 years ago
I will review the patch in details over Sunday (I am traveling back to Europe now).
Igor: review ping?

/be

Comment 17

8 years ago
Comment on attachment 430783 [details] [diff] [review]
patch

Sorry for forgetting to review this in time...

>diff --git a/js/src/jsbit.h b/js/src/jsbit.h
>--- a/js/src/jsbit.h
>+++ b/js/src/jsbit.h
>@@ -55,16 +55,28 @@ typedef jsbitmap_t  jsbitmap;       /* J
> 
> #define JS_TEST_BIT(_map,_bit)  ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] &      \
>                                  ((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1))))
> #define JS_SET_BIT(_map,_bit)   ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] |=     \
>                                  ((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1))))
> #define JS_CLEAR_BIT(_map,_bit) ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] &=     \
>                                  ~((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1))))
> 
>+static inline bool
>+JS_TEST_AND_SET_BIT(jsbitmap *map, size_t bit)
>+{

Nit: if some C code still includes the header, static inline should become static JS_INLINE. If this is C++ only, drop the inline.

>diff --git a/js/src/jsdbgapi.cpp b/js/src/jsdbgapi.cpp
>--- a/js/src/jsdbgapi.cpp
>+++ b/js/src/jsdbgapi.cpp
>@@ -289,20 +289,17 @@ JS_ClearAllTraps(JSContext *cx)
> void
> js_MarkTraps(JSTracer *trc)
> {
>     JSRuntime *rt = trc->context->runtime;
> 
>     for (JSTrap *trap = (JSTrap *) rt->trapList.next;
>          &trap->links != &rt->trapList;
>          trap = (JSTrap *) trap->links.next) {
>-        if (trap->closure) {
>-            JS_SET_TRACING_NAME(trc, "trap->closure");
>-            js_CallValueTracerIfGCThing(trc, (jsval) trap->closure);
>-        }
>+        js_MarkValue(trc, (jsval) trap->closure, "trap->closure");

That should be js_MarkGCThing to preserve js_CallValueTracerIfGCThing the semantics as the type of trap->closure is not known and theoretically a caller can stuff there JSString. Moreover, it could be even possible that JSString can be stuffed as jsval howeever crazy it is. If we do not want to support, lets do it in a separated bug. As such I suggest to keep the current js_CallValueTracerIfGCThing semantics as is with a special tretment of JSVAL_OBJECT value. In particular js_MarkGCThing is not 100% compatible replacement as it does not allow for tagged GC things.

>@@ -501,21 +498,19 @@ js_TraceWatchPoints(JSTracer *trc, JSObj
>-            JS_SET_TRACING_NAME(trc, "wp->closure");
>-            js_CallValueTracerIfGCThing(trc, OBJECT_TO_JSVAL(wp->closure));
>+            js_MarkObject(trc, wp->closure, "wp->closure");

The same comment here: keep the semantics of js_CallValueTracerIfGCThing.

>diff --git a/js/src/jsgc.cpp b/js/src/jsgc.cpp
>--- a/js/src/jsgc.cpp
>+++ b/js/src/jsgc.cpp
>@@ -83,16 +83,17 @@
> #include "jsxml.h"
> #endif
> 
> #ifdef INCLUDE_MOZILLA_DTRACE
> #include "jsdtracef.h"
> #endif
> 
> #include "jsobjinlines.h"
>+#include "jsscopeinlines.h"
> 
> /*
>  * Include the headers for mmap.
>  */
> #if defined(XP_WIN)
> # include <windows.h>
> #endif
> #if defined(XP_UNIX) || defined(XP_BEOS)
>@@ -272,25 +273,28 @@ struct JSGCArenaInfo {
>      */
>     jsuword         arenaIndex :        GC_ARENA_SHIFT - 1;
> 
>     /* Flag indicating if the arena is the first in the chunk. */
>     jsuword         firstArena :        1;
> 
>     JSGCThing       *freeList;
> 
>-    union {
>-        /* See comments before DelayMarkingChildren. */
>-        JSGCArena   *prevUnmarked;
>+    /*
>+     * Pointer to the mark bitmap, which is initially the empty bitmap, or
>+     * once we set the first bit, the private bitmap of the arena.
>+     */
>+    jsbitmap        *markBitmap;

Lets separate the awesome inline changes from this change to flag the arena with live things into separated bug and to have a clear picture about the performance impact of this change. Also, the markbitmap here means that an extra cacheline per arena will be read for any arena with marked things. If arena contains just a few life things, this can be visisble. 

>+JS_STATIC_ASSERT(!(sizeof(JSGCArenaInfo) % sizeof(double)));

Nit: comment here why the assert is necessary.

>@@ -884,16 +875,18 @@ struct JSGCLockHashEntry : public JSDHas
> JSBool
> js_InitGC(JSRuntime *rt, uint32 maxbytes)
> {
>+    memset(emptyMarkBitmap, 0, sizeof emptyMarkBitmap);

This should not be necessary as emptyMarkBitmap should be const and zero-initialized.

> JS_PUBLIC_API(void)
> JS_TraceChildren(JSTracer *trc, void *thing, uint32 kind)
> {
>-        obj->map->ops->trace(trc, obj);
>+        TraceChildren(trc, (JSObject *)thing);

Nit: space after the cast operator.

> #if JS_HAS_XML_SUPPORT
>       case JSTRACE_XML:
>-        js_TraceXML(trc, (JSXML *)thing);
>+        TraceChildren(trc, (JSXML *)thing);

Pre-exisitng nit: space after the cast.

>+class ClearTracingInfoOnExit {
>+#ifdef DEBUG
...
>+
>+void
>+js_MarkDouble(JSTracer *trc, jsdouble *dp)
>+{
>+    ClearTracingInfoOnExit _(trc);
>+

Nit: _(trc) looks ugly. IMO having a macro that expands into ClearTracingInfoOnExit _(trc) or into ((void) 0) would look better.

>+void
>+js_MarkString(JSTracer *trc, JSString *str)
>+{
>+    ClearTracingInfoOnExit _(trc);
>+
>+    if (JS_UNLIKELY(HandleNonMarkingTracer(trc, str)))
>+        return;

Nit: JS_UNLIKELY should go into HandleNonMarkingTracer

>+template <typename T>
>+void
>+js_MarkParent(JSTracer *trc, T *thing)

Nit: the name does not sound good. js_MarkObjectOrXml looks more precize.

>+void
>+js_MarkXMLs(JSTracer *trc, JSXML **array, size_t n)
>+{

#if JS_HAS_XML_SUPPORT before the function is missed.
Attachment #430783 - Flags: review?(igor) → review-

Comment 18

8 years ago
I have looked at performance benefits hasMarkedDoubles or its generalization proposed by the patch. For that I have measured via rdtsc() the number of cycles the double sweeping loop in js_GC():

    while (JSGCArena *a = *ap) {
        if (!a->info.hasMarkedDoubles) {
            /* No marked double values in the arena. */
            *ap = a->info.prev;
            a->info.prev = *emptyArenas;
            *emptyArenas = a;
        } else {
            a->info.hasMarkedDoubles = false;
            ap = &a->info.prev;
        }
    }

and considered a benchmark like:

function test() {
    var N = 4e6;
    var array = [];
    for (var i = 0; i != N; ++i) {
        array.push(i + 0.1);
    }
    array = null;
    gc();
}

test();

If I remove hasMarkedDoubles and replace the test (!a->info.hasMarkedDoubles) with:

        uint64 *array = a->markBitmap;
        uint64 has1 = array[0] | array[1] | array[2] | array[3];
        uint64 has2 = array[4] | array[5] | array[6] | array[7];
        if (!(has1 | has2)) {
 
Then the number of cycles increases on 1.6 GHz Atom CPU by 20% and on 2.6GHz 4-core Xenon by 10%. The reason for the increase is that dereferencing the bitmap reads an extra cacheline while accessing the hasMarkedDoubles accesses just JSGCArenaInfo. 

On the other hand during the marking the situation is the opposite. There the need to access hasMarkedDoubles means the need to read the cacheline with JSGCArenaInfo. That slows things down.

Now, given this stats I also realized that just the following loop:

    while ((a = *ap) != NULL) {
        ap = &a->info.prev;
    }

takes abut 250 CPU cycles per iteration. This is the cost of TLB and L2 misses. This should be stopped. The simplest way to do that is to follow jemalloc and use aligned big chunks. I will a bug about that.
(Assignee)

Updated

8 years ago
Attachment #430783 - Attachment is obsolete: true
(Assignee)

Updated

8 years ago
Summary: make mark bits persistent until next GC, simplify delayed marking → make mark bits persistent until next GC
(Assignee)

Comment 19

8 years ago
Actually I am just going to invalidate this bug. Its badly out of focus.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.