The default bug view has changed. See this FAQ.

Inefficient calculations of the mark bitmap

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
7 years ago
6 years ago

People

(Reporter: Igor Bukanov, Assigned: Igor Bukanov)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 attachment, 5 obsolete attachments)

(Assignee)

Description

7 years ago
Since the bug 558861 the mark bitmap is represented as an array of bitmaps. But this is inefficient since the marking code first calculates chunk and arena to get the index into the bitmap array and then it calculates the cell index within the arena and uses it for the bitmap lookup. Instead the code should directly calculates the cell index within the chunk and use that as an index into the global arena avoiding the intermediate arena representation.
(Assignee)

Comment 1

7 years ago
Created attachment 479773 [details] [diff] [review]
v1

The patch that passes just "make check" and jsapi-tests. 

To test it I use the following synthetic benchmark hat exposes the cost of mark bitmap calculation running the GC against the heap full of live short strings. For such heap the bitmap is queried twice, once during marking and once during finalization with no altering of any other data structures.

function test()
{
    var array = [];
    for (var i = 0; i != 1e6; ++i)
	array.push(String.fromCharCode(100, 101, 102 + i % 20));
    var t = Date.now();
    gc();
    t = Date.now() - t;
    print(t);
}

test();

With the patch applied callgrind shows the reduction of js_GC() timing by 10% when running optimized threadsafe js shell. The data also shows that both marking and finalization receives about the same 10% speedup.

On the other hand 10% sounds way too much for an optimization that just eliminates few shifts and additions per mark bit operation. So a try server run is in due.
(Assignee)

Comment 2

7 years ago
Created attachment 480040 [details] [diff] [review]
v2

Here is a rebased patch (it is applied on top of the patch from the bug 600234 comment 25). It passes the tryserver. 

Besides the wins above compared with the the situation without the patch callgrind also shows 1.5% win for v8-splay.
Attachment #479773 - Attachment is obsolete: true
Attachment #480040 - Flags: review?(anygregor)
(Assignee)

Updated

7 years ago
Depends on: 605029
(Assignee)

Comment 3

7 years ago
Created attachment 485274 [details] [diff] [review]
v3

Here is an updated patch based on the patch from bug 605029. It have no effects on SS. This is expected as the patch only changes the mark bit calculations during the GC. I also see no changes with V8 as some observed wins there are within the noise.

On the other hand the following benchmark shows a 10% reduction in GC timing for a heap full of live strings.

function test()
{
    var array = [];
    for (var i = 0; i != 5e6; ++i)
        array.push("some_long_string_that_does_not_short_string".substring(1));
    var t = Date.now();
    gc();
    t = Date.now() - t;
    print(t);
}

test();
Attachment #480040 - Attachment is obsolete: true
Attachment #485274 - Flags: review?(gal)
Attachment #480040 - Flags: review?(anygregor)
(Assignee)

Comment 4

6 years ago
Created attachment 526721 [details] [diff] [review]
v4

Here is an unrotten patch. It shows similar results as with the previous versions. Benchmarks where the GC contributes to the timing clearly benefits from it.In particular, v8-splay runs faster by 0.5% as measured by sunspuder and valgrind.
Attachment #485274 - Attachment is obsolete: true
Attachment #526721 - Flags: review?(gal)
Attachment #485274 - Flags: review?(gal)
0.5% doesn't meet my threshold for "clearly benefits"...
(Assignee)

Comment 6

6 years ago
(In reply to comment #5)
> 0.5% doesn't meet my threshold for "clearly benefits"...

See comment 1 - this optimization reduces the GC pause time by up to 10%.
(Assignee)

Comment 7

6 years ago
(In reply to comment #6)
> (In reply to comment #5)
> > 0.5% doesn't meet my threshold for "clearly benefits"...
> 
> See comment 1 - this optimization reduces the GC pause time by up to 10%.

On 32 bit that test shows on the latest tip 12% speedup. On 64 bit the gain is about 4%.
This sounds great. If you have time, could you measure cache misses before/after the patch? That might explain why it's doing so well.
(Assignee)

Comment 9

6 years ago
(In reply to comment #8)
> This sounds great. If you have time, could you measure cache misses
> before/after the patch? That might explain why it's doing so well.

This is unrelated to caches and shows that at least GCC optimizer is not that powerful. Without the patch the offset of the bitmap word where the mark bit will be set for the thing is calculated using:

arenaIndex = (thingPtr & CHUNK_MASK) >> 12; // 12 == ArenaShift
thingIndex = (thingPtr & ArenaMask) >> 3; // 3 == Cell::CellShift
offset = arenaIndex * 64 + (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); // 64 == sizeof(ArenaBitmap)

or

offset = ((thingPtr & CHUNK_MASK) >> 12) * 64 + ((thingPtr & ArenaMask) >> (3 + JS_BITS_PER_WORD_LOG2)) * sizeof(word);

or (on 32 bit CPU)

offset = (((thingPtr & CHUNK_MASK) >> 12) << 6) + ((thingPtr & ArenaMask) >> 8) << 2;

or 

offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + (((thingPtr & ArenaMask) >> 6) & ~3)

or, assuming that the compiler see that thingPtr is at least word-aligned,

offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + ((thingPtr & ArenaMask) >> 6)

The patch turns that into:

thingIndex = (thingPtr & CHUNK_MASK) >> 3; // 3 == Cell::CellShift
offset = (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word);

or (on 32 bit CPU)

offset = ((thingPtr & CHUNK_MASK) >> 6;

This two are equivalent, but the GCC 4.5 optimizer do not see it even at -O4. I.e. for the following C source:

static unsigned CHUNK_MASK = (1 << 20) - 1;

unsigned case1(unsigned x)
{
    return (((x & CHUNK_MASK) >> 6) & ~63) + ((x & 4095) >> 6);
}

unsigned case2(unsigned x)
{
    return (x & CHUNK_MASK) >> 6;
}

The compiler generates for case 1:

	movl	%eax, %edx
	andl	$4095, %eax
	shrl	$6, %edx
	andl	$16320, %edx
	shrl	$6, %eax
	leal	(%edx,%eax), %eax

and for the case 2:

	andl	$1048575, %eax
	shrl	$6, %eax
Summary: Inefficient representation of the mark bitmap → Inefficient calculations of the mark bitmap
(Assignee)

Comment 10

6 years ago
Created attachment 529340 [details] [diff] [review]
v5

Here is a rebased patch. It shows no differences in SS benchmark and in V8 for the splay benchmark it shows the 1.5% win. This is expected as patch affects mostly the marking phase timing. For the wins in synthetic benchmarks that exercise the marking phase see then comment 7.
Attachment #526721 - Attachment is obsolete: true
Attachment #529340 - Flags: review?(wmccloskey)
Attachment #526721 - Flags: review?(gal)
Comment on attachment 529340 [details] [diff] [review]
v5

Review of attachment 529340 [details] [diff] [review]:

Nice patch. Thanks for keeping it small and self-contained.

::: js/src/jsgc.h
@@ +128,5 @@
     inline Chunk *chunk() const;
+
+    uintptr_t arenaAddress() const {
+        return address();
+    }

What's the purpose of this method? You just want a better name?

@@ +300,5 @@
     PRLock          *chunkLock;
 #endif
 };
 
+const size_t BytesPerArena = ArenaSize + ArenaBitmapBits / 8 + sizeof(MarkingDelay);

Could you make a new constant, ArenaBitmapBytes? This is used below as well.

@@ +312,5 @@
+
+/* A chunk bitmap contains enough mark bits for all the cells in a chunk. */
+struct ChunkBitmap {
+    static const uintptr_t BitCount = ArenaBitmapBits * ArenasPerChunk;
+    static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD;

Shouldn't the static assertion above imply that you don't need to round up?

@@ +316,5 @@
+    static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD;
+
+    uintptr_t bitmap[WordCount];
+
+    JS_ALWAYS_INLINE uintptr_t *getWordAndMask(const Cell *cell, uint32 color, uintptr_t *maskp);

I would kinda prefer getMarkWordAndMask, but it's up to you.

@@ +374,5 @@
+#endif
+    
+};
+
+JS_STATIC_ASSERT(ArenaBitmapBits / 8 * ArenasPerChunk== sizeof(ChunkBitmap));

Need a space here.
Attachment #529340 - Flags: review?(wmccloskey) → review+
(Assignee)

Comment 12

6 years ago
Created attachment 529856 [details] [diff] [review]
v6

The new version addresses the nits.
Attachment #529340 - Attachment is obsolete: true
Attachment #529856 - Flags: review+
(Assignee)

Comment 13

6 years ago
http://hg.mozilla.org/tracemonkey/rev/f1492c3cf81d
Whiteboard: fixed-in-tracemonkey
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/f1492c3cf81d
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.