Closed Bug 550373 Opened 15 years ago Closed 15 years ago

make mark bits persistent until next GC

Categories

(Core :: JavaScript Engine, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: gal, Assigned: gal)

References

Details

Attachments

(5 obsolete files)

patch 15 years ago Andreas Gal :gal 22.23 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 41.12 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 45.12 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 73.96 KB, patch		Details \| Diff \| Splinter Review
patch 15 years ago Andreas Gal :gal 73.96 KB, patch	igor : review-	Details \| Diff \| Splinter Review

Andreas Gal :gal

Assignee

Description

•

15 years ago

This patch does a couple things: - Introduce a per-arena flag that indicates whether any bits are set in the mark bitmap or not. - This allows us to "clear" the mark bitmap with a single a->info.marked = false, speedup up finalization. Also, the bitmap itself becomes persistent until the next GC run. - To make room for the extra bit I threw out the fine-grained re-marking logic for MarkDelayedChildren. That path is never hit in our embedding, and causes all sorts of pain and code complexity. It still works, it just goes from very slow to super slow.

Andreas Gal :gal

Assignee

Comment 1

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Assignee: general → gal

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Blocks: 549806

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Attachment #430488 - Flags: review?(igor)

Comment 2

•

15 years ago

(In reply to comment #0) > This patch does a couple things: > > - Introduce a per-arena flag that indicates whether any bits are set in the > mark bitmap or not. Have you measured the impact of this on the marking phase? When I have added the flag for doubles it has showed a noticeable slowdown in the marking. Still since the wast majority of doubles are garbage it was overall win. For objects and even strings the situation is not the same and the overhead of the extra check could overweight the savings. In addition I have a patch in working to inline the marking so the fast inlined path would check if the thing is marked and skip the call if so. The extra check for a marked (that could be branch-mispridicted) arena would surely harm that.

Comment 3

•

15 years ago

(In reply to comment #0) > - To make room for the extra bit I threw out the fine-grained re-marking logic > for MarkDelayedChildren. That path is never hit in our embedding, This is simply not true. IIRC the original reports about stack overflow during the GC were coming from web developers who accidentally created infinite lists or similar structures. By no means I imply by that that this is a frequent scenario, but we should not forget about that case. Now, that aside, the last patch for the bug 516832 uses the MarkDelayedChildren and friends to delay marking of objects for suspended threads.

Andreas Gal :gal

Assignee

Comment 4

•

15 years ago

Oh, absolutely, overly deep lists are definitely a possible DOS scenario we have to handle, but its not something where efficiency matters to justify the code complexity + storage needs of the extra delayed marking bitmap.

Andreas Gal :gal

Assignee

Comment 5

•

15 years ago

I will split out the delayed marking part into a separate patch and benchmark it.

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Attachment #430488 - Attachment is obsolete: true

Attachment #430488 - Flags: review?(igor)

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Depends on: 550413

Andreas Gal :gal

Assignee

Comment 6

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Turns out Igor was right. The original patch showed some measurable slowdowns on the attached benchmark. I whipped out shark to see why that is the case, but I wasn't really able to see anything in the code I just touched because the rest of the profile was screaming bloody murder. For some GC loads 50% of the time is spent in the prolog and epilog of js_CallGCMarker. Marking large arrays is particularly bad. I rearranged the code a bit and added a MarkValues() function, which inlines CallGCMarker into the loop. Shark next pointed to the structure of CallGCMarker and MarkValueIfGCThing. Both have the same <> like structure we used to have w.r.t. thingSize. First we lose precision by going from jsval type to kind and then we have to dispatch again over kind. I pulled the code apart and refactored everything using MarkObject, MarkString, MarkDouble, MarkAtom etc. I also added MarkObjects() and such, to directly benefit from inlining and specialization. This also makes the code more readable IMHO, not just faster. With this stuff out of the way I could pretty easily pinpoint the cause of the slowdown. gcc generated retarded code for two subsequent accesses to a->info.privateMarkBitmap. Pulling that into a variable made it go away. Some statistics: Without the patch: objects: 29330ms doubles: 6342ms strings: 9488ms mix: 9614ms With the patch: objects: 29360ms doubles: 3996ms strings: 8394ms mix: 8464ms Yes, ladies and gentleman. That's a 40% or so speed up garbage collection if the heap contains a ton of doubles. Seriously, we have to shark the GC a bit. I never did that back when, I was focusing on the allocator path. The profile still looks like a red carpet. I only took care of the worst offenders.

Andreas Gal :gal

Assignee

Comment 7

•

15 years ago

Comment on attachment 430595 [details] [diff] [review] patch Igor, this touches a ton of stuff and changes some invariants (i.e. CallValueTracerIfGCThing). Please be extra picky when reviewing.

Attachment #430595 - Flags: review?(igor)

Andreas Gal :gal

Assignee

Comment 8

•

15 years ago

I also would like to try clearing all bitmaps immediately prior to GC instead of the indirection. I do that next.

Andreas Gal :gal

Assignee

Comment 9

•

15 years ago

Testcase: for (var i = 0; i < 4; ++i) { var a = []; for (n = 0; n < 512 * 1024; ++n) { var j = (i == 3) ? Math.floor(Math.random()*4) : i; a.push((i == 0) ? ({}) : ((i == 1) ? (n + 0.5) : ("x" + i))); } var t = Date.now(); for (n = 0; n < 1000; ++n) gc(); print(Date.now() - t); }

David Mandelin [:dmandelin]

Comment 10

•

15 years ago

Just as a general note, we hope to stop heap-allocating doubles at some point this spring. So an optimization that helps with GCing doubles will probably only be useful for so long.

Andreas Gal :gal

Assignee

Comment 11

•

15 years ago

Yeah, I won't be sad if that happens. Its a 10% speedup overall, too. It just happened to be particularly inefficient for doubles before.

Andreas Gal :gal

Assignee

Comment 12

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

A bit more fine tuning with shark. This brings down the score for the object part of the test case to 28186.

Attachment #430595 - Attachment is obsolete: true

Attachment #430595 - Flags: review?(igor)

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Attachment #430707 - Flags: review?

Andreas Gal :gal

Assignee

Comment 13

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This squeezes another couple percent out of GC performance by removing calls to the generalized MACRO paths (CALL_*_TRACER) and replacing them to their equivalent specialized implementations. This is mostly mechanical renaming on top of the previous patch, sorry for the size. I promise I am done now :)

Attachment #430707 - Attachment is obsolete: true

Attachment #430707 - Flags: review?

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Attachment #430772 - Flags: review?(igor)

Andreas Gal :gal

Assignee

Comment 14

•

15 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Fixed a bogus assert.

Attachment #430772 - Attachment is obsolete: true

Attachment #430783 - Flags: review?(igor)

Attachment #430772 - Flags: review?(igor)

Comment 15

•

15 years ago

I will review the patch in details over Sunday (I am traveling back to Europe now).

Brendan Eich [:brendan]

Comment 16

•

15 years ago

Igor: review ping? /be

Comment 17

•

15 years ago

Comment on attachment 430783 [details] [diff] [review] patch Sorry for forgetting to review this in time... >diff --git a/js/src/jsbit.h b/js/src/jsbit.h >--- a/js/src/jsbit.h >+++ b/js/src/jsbit.h >@@ -55,16 +55,28 @@ typedef jsbitmap_t jsbitmap; /* J > > #define JS_TEST_BIT(_map,_bit) ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] & \ > ((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1)))) > #define JS_SET_BIT(_map,_bit) ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] |= \ > ((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1)))) > #define JS_CLEAR_BIT(_map,_bit) ((_map)[(_bit)>>JS_BITS_PER_WORD_LOG2] &= \ > ~((jsbitmap)1<<((_bit)&(JS_BITS_PER_WORD-1)))) > >+static inline bool >+JS_TEST_AND_SET_BIT(jsbitmap *map, size_t bit) >+{ Nit: if some C code still includes the header, static inline should become static JS_INLINE. If this is C++ only, drop the inline. >diff --git a/js/src/jsdbgapi.cpp b/js/src/jsdbgapi.cpp >--- a/js/src/jsdbgapi.cpp >+++ b/js/src/jsdbgapi.cpp >@@ -289,20 +289,17 @@ JS_ClearAllTraps(JSContext *cx) > void > js_MarkTraps(JSTracer *trc) > { > JSRuntime *rt = trc->context->runtime; > > for (JSTrap *trap = (JSTrap *) rt->trapList.next; > &trap->links != &rt->trapList; > trap = (JSTrap *) trap->links.next) { >- if (trap->closure) { >- JS_SET_TRACING_NAME(trc, "trap->closure"); >- js_CallValueTracerIfGCThing(trc, (jsval) trap->closure); >- } >+ js_MarkValue(trc, (jsval) trap->closure, "trap->closure"); That should be js_MarkGCThing to preserve js_CallValueTracerIfGCThing the semantics as the type of trap->closure is not known and theoretically a caller can stuff there JSString. Moreover, it could be even possible that JSString can be stuffed as jsval howeever crazy it is. If we do not want to support, lets do it in a separated bug. As such I suggest to keep the current js_CallValueTracerIfGCThing semantics as is with a special tretment of JSVAL_OBJECT value. In particular js_MarkGCThing is not 100% compatible replacement as it does not allow for tagged GC things. >@@ -501,21 +498,19 @@ js_TraceWatchPoints(JSTracer *trc, JSObj >- JS_SET_TRACING_NAME(trc, "wp->closure"); >- js_CallValueTracerIfGCThing(trc, OBJECT_TO_JSVAL(wp->closure)); >+ js_MarkObject(trc, wp->closure, "wp->closure"); The same comment here: keep the semantics of js_CallValueTracerIfGCThing. >diff --git a/js/src/jsgc.cpp b/js/src/jsgc.cpp >--- a/js/src/jsgc.cpp >+++ b/js/src/jsgc.cpp >@@ -83,16 +83,17 @@ > #include "jsxml.h" > #endif > > #ifdef INCLUDE_MOZILLA_DTRACE > #include "jsdtracef.h" > #endif > > #include "jsobjinlines.h" >+#include "jsscopeinlines.h" > > /* > * Include the headers for mmap. > */ > #if defined(XP_WIN) > # include <windows.h> > #endif > #if defined(XP_UNIX) || defined(XP_BEOS) >@@ -272,25 +273,28 @@ struct JSGCArenaInfo { > */ > jsuword arenaIndex : GC_ARENA_SHIFT - 1; > > /* Flag indicating if the arena is the first in the chunk. */ > jsuword firstArena : 1; > > JSGCThing *freeList; > >- union { >- /* See comments before DelayMarkingChildren. */ >- JSGCArena *prevUnmarked; >+ /* >+ * Pointer to the mark bitmap, which is initially the empty bitmap, or >+ * once we set the first bit, the private bitmap of the arena. >+ */ >+ jsbitmap *markBitmap; Lets separate the awesome inline changes from this change to flag the arena with live things into separated bug and to have a clear picture about the performance impact of this change. Also, the markbitmap here means that an extra cacheline per arena will be read for any arena with marked things. If arena contains just a few life things, this can be visisble. >+JS_STATIC_ASSERT(!(sizeof(JSGCArenaInfo) % sizeof(double))); Nit: comment here why the assert is necessary. >@@ -884,16 +875,18 @@ struct JSGCLockHashEntry : public JSDHas > JSBool > js_InitGC(JSRuntime *rt, uint32 maxbytes) > { >+ memset(emptyMarkBitmap, 0, sizeof emptyMarkBitmap); This should not be necessary as emptyMarkBitmap should be const and zero-initialized. > JS_PUBLIC_API(void) > JS_TraceChildren(JSTracer *trc, void *thing, uint32 kind) > { >- obj->map->ops->trace(trc, obj); >+ TraceChildren(trc, (JSObject *)thing); Nit: space after the cast operator. > #if JS_HAS_XML_SUPPORT > case JSTRACE_XML: >- js_TraceXML(trc, (JSXML *)thing); >+ TraceChildren(trc, (JSXML *)thing); Pre-exisitng nit: space after the cast. >+class ClearTracingInfoOnExit { >+#ifdef DEBUG ... >+ >+void >+js_MarkDouble(JSTracer *trc, jsdouble *dp) >+{ >+ ClearTracingInfoOnExit _(trc); >+ Nit: _(trc) looks ugly. IMO having a macro that expands into ClearTracingInfoOnExit _(trc) or into ((void) 0) would look better. >+void >+js_MarkString(JSTracer *trc, JSString *str) >+{ >+ ClearTracingInfoOnExit _(trc); >+ >+ if (JS_UNLIKELY(HandleNonMarkingTracer(trc, str))) >+ return; Nit: JS_UNLIKELY should go into HandleNonMarkingTracer >+template <typename T> >+void >+js_MarkParent(JSTracer *trc, T *thing) Nit: the name does not sound good. js_MarkObjectOrXml looks more precize. >+void >+js_MarkXMLs(JSTracer *trc, JSXML **array, size_t n) >+{ #if JS_HAS_XML_SUPPORT before the function is missed.

Attachment #430783 - Flags: review?(igor) → review-

Comment 18

•

15 years ago

I have looked at performance benefits hasMarkedDoubles or its generalization proposed by the patch. For that I have measured via rdtsc() the number of cycles the double sweeping loop in js_GC(): while (JSGCArena *a = *ap) { if (!a->info.hasMarkedDoubles) { /* No marked double values in the arena. */ *ap = a->info.prev; a->info.prev = *emptyArenas; *emptyArenas = a; } else { a->info.hasMarkedDoubles = false; ap = &a->info.prev; } } and considered a benchmark like: function test() { var N = 4e6; var array = []; for (var i = 0; i != N; ++i) { array.push(i + 0.1); } array = null; gc(); } test(); If I remove hasMarkedDoubles and replace the test (!a->info.hasMarkedDoubles) with: uint64 *array = a->markBitmap; uint64 has1 = array[0] | array[1] | array[2] | array[3]; uint64 has2 = array[4] | array[5] | array[6] | array[7]; if (!(has1 | has2)) { Then the number of cycles increases on 1.6 GHz Atom CPU by 20% and on 2.6GHz 4-core Xenon by 10%. The reason for the increase is that dereferencing the bitmap reads an extra cacheline while accessing the hasMarkedDoubles accesses just JSGCArenaInfo. On the other hand during the marking the situation is the opposite. There the need to access hasMarkedDoubles means the need to read the cacheline with JSGCArenaInfo. That slows things down. Now, given this stats I also realized that just the following loop: while ((a = *ap) != NULL) { ap = &a->info.prev; } takes abut 250 CPU cycles per iteration. This is the cost of TLB and L2 misses. This should be stopped. The simplest way to do that is to follow jemalloc and use aligned big chunks. I will a bug about that.

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Attachment #430783 - Attachment is obsolete: true

Andreas Gal :gal

Assignee

Updated

•

15 years ago

Summary: make mark bits persistent until next GC, simplify delayed marking → make mark bits persistent until next GC

Andreas Gal :gal

Assignee

Comment 19

•

15 years ago

Actually I am just going to invalidate this bug. Its badly out of focus.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.