Last Comment Bug 600648 - Inefficient calculations of the mark bitmap
: Inefficient calculations of the mark bitmap
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: ---
Assigned To: Igor Bukanov
:
Mentors:
Depends on: 605029
Blocks:
  Show dependency treegraph
 
Reported: 2010-09-29 12:14 PDT by Igor Bukanov
Modified: 2011-05-10 15:12 PDT (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (13.98 KB, patch)
2010-09-30 07:36 PDT, Igor Bukanov
no flags Details | Diff | Review
v2 (13.81 KB, patch)
2010-10-01 01:19 PDT, Igor Bukanov
no flags Details | Diff | Review
v3 (13.12 KB, patch)
2010-10-22 06:35 PDT, Igor Bukanov
no flags Details | Diff | Review
v4 (14.63 KB, patch)
2011-04-18 07:29 PDT, Igor Bukanov
no flags Details | Diff | Review
v5 (14.54 KB, patch)
2011-05-01 03:32 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Review
v6 (18.97 KB, patch)
2011-05-03 15:30 PDT, Igor Bukanov
igor: review+
Details | Diff | Review

Description Igor Bukanov 2010-09-29 12:14:15 PDT
Since the bug 558861 the mark bitmap is represented as an array of bitmaps. But this is inefficient since the marking code first calculates chunk and arena to get the index into the bitmap array and then it calculates the cell index within the arena and uses it for the bitmap lookup. Instead the code should directly calculates the cell index within the chunk and use that as an index into the global arena avoiding the intermediate arena representation.
Comment 1 Igor Bukanov 2010-09-30 07:36:33 PDT
Created attachment 479773 [details] [diff] [review]
v1

The patch that passes just "make check" and jsapi-tests. 

To test it I use the following synthetic benchmark hat exposes the cost of mark bitmap calculation running the GC against the heap full of live short strings. For such heap the bitmap is queried twice, once during marking and once during finalization with no altering of any other data structures.

function test()
{
    var array = [];
    for (var i = 0; i != 1e6; ++i)
	array.push(String.fromCharCode(100, 101, 102 + i % 20));
    var t = Date.now();
    gc();
    t = Date.now() - t;
    print(t);
}

test();

With the patch applied callgrind shows the reduction of js_GC() timing by 10% when running optimized threadsafe js shell. The data also shows that both marking and finalization receives about the same 10% speedup.

On the other hand 10% sounds way too much for an optimization that just eliminates few shifts and additions per mark bit operation. So a try server run is in due.
Comment 2 Igor Bukanov 2010-10-01 01:19:37 PDT
Created attachment 480040 [details] [diff] [review]
v2

Here is a rebased patch (it is applied on top of the patch from the bug 600234 comment 25). It passes the tryserver. 

Besides the wins above compared with the the situation without the patch callgrind also shows 1.5% win for v8-splay.
Comment 3 Igor Bukanov 2010-10-22 06:35:39 PDT
Created attachment 485274 [details] [diff] [review]
v3

Here is an updated patch based on the patch from bug 605029. It have no effects on SS. This is expected as the patch only changes the mark bit calculations during the GC. I also see no changes with V8 as some observed wins there are within the noise.

On the other hand the following benchmark shows a 10% reduction in GC timing for a heap full of live strings.

function test()
{
    var array = [];
    for (var i = 0; i != 5e6; ++i)
        array.push("some_long_string_that_does_not_short_string".substring(1));
    var t = Date.now();
    gc();
    t = Date.now() - t;
    print(t);
}

test();
Comment 4 Igor Bukanov 2011-04-18 07:29:05 PDT
Created attachment 526721 [details] [diff] [review]
v4

Here is an unrotten patch. It shows similar results as with the previous versions. Benchmarks where the GC contributes to the timing clearly benefits from it.In particular, v8-splay runs faster by 0.5% as measured by sunspuder and valgrind.
Comment 5 Nicholas Nethercote [:njn] 2011-04-19 19:59:42 PDT
0.5% doesn't meet my threshold for "clearly benefits"...
Comment 6 Igor Bukanov 2011-04-19 22:00:14 PDT
(In reply to comment #5)
> 0.5% doesn't meet my threshold for "clearly benefits"...

See comment 1 - this optimization reduces the GC pause time by up to 10%.
Comment 7 Igor Bukanov 2011-04-19 23:14:28 PDT
(In reply to comment #6)
> (In reply to comment #5)
> > 0.5% doesn't meet my threshold for "clearly benefits"...
> 
> See comment 1 - this optimization reduces the GC pause time by up to 10%.

On 32 bit that test shows on the latest tip 12% speedup. On 64 bit the gain is about 4%.
Comment 8 Bill McCloskey (:billm) 2011-04-19 23:33:35 PDT
This sounds great. If you have time, could you measure cache misses before/after the patch? That might explain why it's doing so well.
Comment 9 Igor Bukanov 2011-04-20 11:57:28 PDT
(In reply to comment #8)
> This sounds great. If you have time, could you measure cache misses
> before/after the patch? That might explain why it's doing so well.

This is unrelated to caches and shows that at least GCC optimizer is not that powerful. Without the patch the offset of the bitmap word where the mark bit will be set for the thing is calculated using:

arenaIndex = (thingPtr & CHUNK_MASK) >> 12; // 12 == ArenaShift
thingIndex = (thingPtr & ArenaMask) >> 3; // 3 == Cell::CellShift
offset = arenaIndex * 64 + (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); // 64 == sizeof(ArenaBitmap)

or

offset = ((thingPtr & CHUNK_MASK) >> 12) * 64 + ((thingPtr & ArenaMask) >> (3 + JS_BITS_PER_WORD_LOG2)) * sizeof(word);

or (on 32 bit CPU)

offset = (((thingPtr & CHUNK_MASK) >> 12) << 6) + ((thingPtr & ArenaMask) >> 8) << 2;

or 

offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + (((thingPtr & ArenaMask) >> 6) & ~3)

or, assuming that the compiler see that thingPtr is at least word-aligned,

offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + ((thingPtr & ArenaMask) >> 6)

The patch turns that into:

thingIndex = (thingPtr & CHUNK_MASK) >> 3; // 3 == Cell::CellShift
offset = (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word);

or (on 32 bit CPU)

offset = ((thingPtr & CHUNK_MASK) >> 6;

This two are equivalent, but the GCC 4.5 optimizer do not see it even at -O4. I.e. for the following C source:

static unsigned CHUNK_MASK = (1 << 20) - 1;

unsigned case1(unsigned x)
{
    return (((x & CHUNK_MASK) >> 6) & ~63) + ((x & 4095) >> 6);
}

unsigned case2(unsigned x)
{
    return (x & CHUNK_MASK) >> 6;
}

The compiler generates for case 1:

	movl	%eax, %edx
	andl	$4095, %eax
	shrl	$6, %edx
	andl	$16320, %edx
	shrl	$6, %eax
	leal	(%edx,%eax), %eax

and for the case 2:

	andl	$1048575, %eax
	shrl	$6, %eax
Comment 10 Igor Bukanov 2011-05-01 03:32:25 PDT
Created attachment 529340 [details] [diff] [review]
v5

Here is a rebased patch. It shows no differences in SS benchmark and in V8 for the splay benchmark it shows the 1.5% win. This is expected as patch affects mostly the marking phase timing. For the wins in synthetic benchmarks that exercise the marking phase see then comment 7.
Comment 11 Bill McCloskey (:billm) 2011-05-03 13:06:52 PDT
Comment on attachment 529340 [details] [diff] [review]
v5

Review of attachment 529340 [details] [diff] [review]:

Nice patch. Thanks for keeping it small and self-contained.

::: js/src/jsgc.h
@@ +128,5 @@
     inline Chunk *chunk() const;
+
+    uintptr_t arenaAddress() const {
+        return address();
+    }

What's the purpose of this method? You just want a better name?

@@ +300,5 @@
     PRLock          *chunkLock;
 #endif
 };
 
+const size_t BytesPerArena = ArenaSize + ArenaBitmapBits / 8 + sizeof(MarkingDelay);

Could you make a new constant, ArenaBitmapBytes? This is used below as well.

@@ +312,5 @@
+
+/* A chunk bitmap contains enough mark bits for all the cells in a chunk. */
+struct ChunkBitmap {
+    static const uintptr_t BitCount = ArenaBitmapBits * ArenasPerChunk;
+    static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD;

Shouldn't the static assertion above imply that you don't need to round up?

@@ +316,5 @@
+    static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD;
+
+    uintptr_t bitmap[WordCount];
+
+    JS_ALWAYS_INLINE uintptr_t *getWordAndMask(const Cell *cell, uint32 color, uintptr_t *maskp);

I would kinda prefer getMarkWordAndMask, but it's up to you.

@@ +374,5 @@
+#endif
+    
+};
+
+JS_STATIC_ASSERT(ArenaBitmapBits / 8 * ArenasPerChunk== sizeof(ChunkBitmap));

Need a space here.
Comment 12 Igor Bukanov 2011-05-03 15:30:00 PDT
Created attachment 529856 [details] [diff] [review]
v6

The new version addresses the nits.
Comment 14 Chris Leary [:cdleary] (not checking bugmail) 2011-05-10 15:12:44 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/f1492c3cf81d

Note You need to log in before you can comment on or make changes to this bug.