Closed Bug 600648 Opened 14 years ago Closed 14 years ago

Inefficient calculations of the mark bitmap

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: igor, Assigned: igor)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 5 obsolete files)

Since the bug 558861 the mark bitmap is represented as an array of bitmaps. But this is inefficient since the marking code first calculates chunk and arena to get the index into the bitmap array and then it calculates the cell index within the arena and uses it for the bitmap lookup. Instead the code should directly calculates the cell index within the chunk and use that as an index into the global arena avoiding the intermediate arena representation.
Attached patch v1 (obsolete) — Splinter Review
The patch that passes just "make check" and jsapi-tests. To test it I use the following synthetic benchmark hat exposes the cost of mark bitmap calculation running the GC against the heap full of live short strings. For such heap the bitmap is queried twice, once during marking and once during finalization with no altering of any other data structures. function test() { var array = []; for (var i = 0; i != 1e6; ++i) array.push(String.fromCharCode(100, 101, 102 + i % 20)); var t = Date.now(); gc(); t = Date.now() - t; print(t); } test(); With the patch applied callgrind shows the reduction of js_GC() timing by 10% when running optimized threadsafe js shell. The data also shows that both marking and finalization receives about the same 10% speedup. On the other hand 10% sounds way too much for an optimization that just eliminates few shifts and additions per mark bit operation. So a try server run is in due.
Attached patch v2 (obsolete) — Splinter Review
Here is a rebased patch (it is applied on top of the patch from the bug 600234 comment 25). It passes the tryserver. Besides the wins above compared with the the situation without the patch callgrind also shows 1.5% win for v8-splay.
Attachment #479773 - Attachment is obsolete: true
Attachment #480040 - Flags: review?(anygregor)
Depends on: 605029
Attached patch v3 (obsolete) — Splinter Review
Here is an updated patch based on the patch from bug 605029. It have no effects on SS. This is expected as the patch only changes the mark bit calculations during the GC. I also see no changes with V8 as some observed wins there are within the noise. On the other hand the following benchmark shows a 10% reduction in GC timing for a heap full of live strings. function test() { var array = []; for (var i = 0; i != 5e6; ++i) array.push("some_long_string_that_does_not_short_string".substring(1)); var t = Date.now(); gc(); t = Date.now() - t; print(t); } test();
Attachment #480040 - Attachment is obsolete: true
Attachment #485274 - Flags: review?(gal)
Attachment #480040 - Flags: review?(anygregor)
Attached patch v4 (obsolete) — Splinter Review
Here is an unrotten patch. It shows similar results as with the previous versions. Benchmarks where the GC contributes to the timing clearly benefits from it.In particular, v8-splay runs faster by 0.5% as measured by sunspuder and valgrind.
Attachment #485274 - Attachment is obsolete: true
Attachment #526721 - Flags: review?(gal)
Attachment #485274 - Flags: review?(gal)
0.5% doesn't meet my threshold for "clearly benefits"...
(In reply to comment #5) > 0.5% doesn't meet my threshold for "clearly benefits"... See comment 1 - this optimization reduces the GC pause time by up to 10%.
(In reply to comment #6) > (In reply to comment #5) > > 0.5% doesn't meet my threshold for "clearly benefits"... > > See comment 1 - this optimization reduces the GC pause time by up to 10%. On 32 bit that test shows on the latest tip 12% speedup. On 64 bit the gain is about 4%.
This sounds great. If you have time, could you measure cache misses before/after the patch? That might explain why it's doing so well.
(In reply to comment #8) > This sounds great. If you have time, could you measure cache misses > before/after the patch? That might explain why it's doing so well. This is unrelated to caches and shows that at least GCC optimizer is not that powerful. Without the patch the offset of the bitmap word where the mark bit will be set for the thing is calculated using: arenaIndex = (thingPtr & CHUNK_MASK) >> 12; // 12 == ArenaShift thingIndex = (thingPtr & ArenaMask) >> 3; // 3 == Cell::CellShift offset = arenaIndex * 64 + (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); // 64 == sizeof(ArenaBitmap) or offset = ((thingPtr & CHUNK_MASK) >> 12) * 64 + ((thingPtr & ArenaMask) >> (3 + JS_BITS_PER_WORD_LOG2)) * sizeof(word); or (on 32 bit CPU) offset = (((thingPtr & CHUNK_MASK) >> 12) << 6) + ((thingPtr & ArenaMask) >> 8) << 2; or offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + (((thingPtr & ArenaMask) >> 6) & ~3) or, assuming that the compiler see that thingPtr is at least word-aligned, offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + ((thingPtr & ArenaMask) >> 6) The patch turns that into: thingIndex = (thingPtr & CHUNK_MASK) >> 3; // 3 == Cell::CellShift offset = (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); or (on 32 bit CPU) offset = ((thingPtr & CHUNK_MASK) >> 6; This two are equivalent, but the GCC 4.5 optimizer do not see it even at -O4. I.e. for the following C source: static unsigned CHUNK_MASK = (1 << 20) - 1; unsigned case1(unsigned x) { return (((x & CHUNK_MASK) >> 6) & ~63) + ((x & 4095) >> 6); } unsigned case2(unsigned x) { return (x & CHUNK_MASK) >> 6; } The compiler generates for case 1: movl %eax, %edx andl $4095, %eax shrl $6, %edx andl $16320, %edx shrl $6, %eax leal (%edx,%eax), %eax and for the case 2: andl $1048575, %eax shrl $6, %eax
Summary: Inefficient representation of the mark bitmap → Inefficient calculations of the mark bitmap
Attached patch v5 (obsolete) — Splinter Review
Here is a rebased patch. It shows no differences in SS benchmark and in V8 for the splay benchmark it shows the 1.5% win. This is expected as patch affects mostly the marking phase timing. For the wins in synthetic benchmarks that exercise the marking phase see then comment 7.
Attachment #526721 - Attachment is obsolete: true
Attachment #529340 - Flags: review?(wmccloskey)
Attachment #526721 - Flags: review?(gal)
Comment on attachment 529340 [details] [diff] [review] v5 Review of attachment 529340 [details] [diff] [review]: Nice patch. Thanks for keeping it small and self-contained. ::: js/src/jsgc.h @@ +128,5 @@ inline Chunk *chunk() const; + + uintptr_t arenaAddress() const { + return address(); + } What's the purpose of this method? You just want a better name? @@ +300,5 @@ PRLock *chunkLock; #endif }; +const size_t BytesPerArena = ArenaSize + ArenaBitmapBits / 8 + sizeof(MarkingDelay); Could you make a new constant, ArenaBitmapBytes? This is used below as well. @@ +312,5 @@ + +/* A chunk bitmap contains enough mark bits for all the cells in a chunk. */ +struct ChunkBitmap { + static const uintptr_t BitCount = ArenaBitmapBits * ArenasPerChunk; + static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD; Shouldn't the static assertion above imply that you don't need to round up? @@ +316,5 @@ + static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD; + + uintptr_t bitmap[WordCount]; + + JS_ALWAYS_INLINE uintptr_t *getWordAndMask(const Cell *cell, uint32 color, uintptr_t *maskp); I would kinda prefer getMarkWordAndMask, but it's up to you. @@ +374,5 @@ +#endif + +}; + +JS_STATIC_ASSERT(ArenaBitmapBits / 8 * ArenasPerChunk== sizeof(ChunkBitmap)); Need a space here.
Attachment #529340 - Flags: review?(wmccloskey) → review+
Attached patch v6Splinter Review
The new version addresses the nits.
Attachment #529340 - Attachment is obsolete: true
Attachment #529856 - Flags: review+
Whiteboard: fixed-in-tracemonkey
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: