Closed Bug 600648 Opened 15 years ago Closed 14 years ago

Inefficient calculations of the mark bitmap

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: igor, Assigned: igor)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 5 obsolete files)

v1 15 years ago Igor Bukanov 13.98 KB, patch		Details \| Diff \| Splinter Review
v2 15 years ago Igor Bukanov 13.81 KB, patch		Details \| Diff \| Splinter Review
v3 15 years ago Igor Bukanov 13.12 KB, patch		Details \| Diff \| Splinter Review
v4 14 years ago Igor Bukanov 14.63 KB, patch		Details \| Diff \| Splinter Review
v5 14 years ago Igor Bukanov 14.54 KB, patch	billm : review+	Details \| Diff \| Splinter Review
v6 14 years ago Igor Bukanov 18.97 KB, patch	igor : review+	Details \| Diff \| Splinter Review

Igor Bukanov

Assignee

Description

•

15 years ago

Since the bug 558861 the mark bitmap is represented as an array of bitmaps. But this is inefficient since the marking code first calculates chunk and arena to get the index into the bitmap array and then it calculates the cell index within the arena and uses it for the bitmap lookup. Instead the code should directly calculates the cell index within the chunk and use that as an index into the global arena avoiding the intermediate arena representation.

Igor Bukanov

Assignee

Comment 1

•

15 years ago

Attached patch v1 (obsolete) — Details — Splinter Review

The patch that passes just "make check" and jsapi-tests. To test it I use the following synthetic benchmark hat exposes the cost of mark bitmap calculation running the GC against the heap full of live short strings. For such heap the bitmap is queried twice, once during marking and once during finalization with no altering of any other data structures. function test() { var array = []; for (var i = 0; i != 1e6; ++i) array.push(String.fromCharCode(100, 101, 102 + i % 20)); var t = Date.now(); gc(); t = Date.now() - t; print(t); } test(); With the patch applied callgrind shows the reduction of js_GC() timing by 10% when running optimized threadsafe js shell. The data also shows that both marking and finalization receives about the same 10% speedup. On the other hand 10% sounds way too much for an optimization that just eliminates few shifts and additions per mark bit operation. So a try server run is in due.

Igor Bukanov

Assignee

Comment 2

•

15 years ago

Attached patch v2 (obsolete) — Details — Splinter Review

Here is a rebased patch (it is applied on top of the patch from the bug 600234 comment 25). It passes the tryserver. Besides the wins above compared with the the situation without the patch callgrind also shows 1.5% win for v8-splay.

Attachment #479773 - Attachment is obsolete: true

Attachment #480040 - Flags: review?(anygregor)

Igor Bukanov

Assignee

Updated

•

15 years ago

Depends on: 605029

Igor Bukanov

Assignee

Comment 3

•

15 years ago

Attached patch v3 (obsolete) — Details — Splinter Review

Here is an updated patch based on the patch from bug 605029. It have no effects on SS. This is expected as the patch only changes the mark bit calculations during the GC. I also see no changes with V8 as some observed wins there are within the noise. On the other hand the following benchmark shows a 10% reduction in GC timing for a heap full of live strings. function test() { var array = []; for (var i = 0; i != 5e6; ++i) array.push("some_long_string_that_does_not_short_string".substring(1)); var t = Date.now(); gc(); t = Date.now() - t; print(t); } test();

Attachment #480040 - Attachment is obsolete: true

Attachment #485274 - Flags: review?(gal)

Attachment #480040 - Flags: review?(anygregor)

Igor Bukanov

Assignee

Comment 4

•

14 years ago

Attached patch v4 (obsolete) — Details — Splinter Review

Here is an unrotten patch. It shows similar results as with the previous versions. Benchmarks where the GC contributes to the timing clearly benefits from it.In particular, v8-splay runs faster by 0.5% as measured by sunspuder and valgrind.

Attachment #485274 - Attachment is obsolete: true

Attachment #526721 - Flags: review?(gal)

Attachment #485274 - Flags: review?(gal)

Nicholas Nethercote [inactive]

Comment 5

•

14 years ago

0.5% doesn't meet my threshold for "clearly benefits"...

Igor Bukanov

Assignee

Comment 6

•

14 years ago

(In reply to comment #5) > 0.5% doesn't meet my threshold for "clearly benefits"... See comment 1 - this optimization reduces the GC pause time by up to 10%.

Igor Bukanov

Assignee

Comment 7

•

14 years ago

(In reply to comment #6) > (In reply to comment #5) > > 0.5% doesn't meet my threshold for "clearly benefits"... > > See comment 1 - this optimization reduces the GC pause time by up to 10%. On 32 bit that test shows on the latest tip 12% speedup. On 64 bit the gain is about 4%.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

14 years ago

This sounds great. If you have time, could you measure cache misses before/after the patch? That might explain why it's doing so well.

Igor Bukanov

Assignee

Comment 9

•

14 years ago

(In reply to comment #8) > This sounds great. If you have time, could you measure cache misses > before/after the patch? That might explain why it's doing so well. This is unrelated to caches and shows that at least GCC optimizer is not that powerful. Without the patch the offset of the bitmap word where the mark bit will be set for the thing is calculated using: arenaIndex = (thingPtr & CHUNK_MASK) >> 12; // 12 == ArenaShift thingIndex = (thingPtr & ArenaMask) >> 3; // 3 == Cell::CellShift offset = arenaIndex * 64 + (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); // 64 == sizeof(ArenaBitmap) or offset = ((thingPtr & CHUNK_MASK) >> 12) * 64 + ((thingPtr & ArenaMask) >> (3 + JS_BITS_PER_WORD_LOG2)) * sizeof(word); or (on 32 bit CPU) offset = (((thingPtr & CHUNK_MASK) >> 12) << 6) + ((thingPtr & ArenaMask) >> 8) << 2; or offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + (((thingPtr & ArenaMask) >> 6) & ~3) or, assuming that the compiler see that thingPtr is at least word-aligned, offset = (((thingPtr & CHUNK_MASK) >> 6) & ~63) + ((thingPtr & ArenaMask) >> 6) The patch turns that into: thingIndex = (thingPtr & CHUNK_MASK) >> 3; // 3 == Cell::CellShift offset = (thingIndex >> JS_BITS_PER_WORD_LOG2) * sizeof(word); or (on 32 bit CPU) offset = ((thingPtr & CHUNK_MASK) >> 6; This two are equivalent, but the GCC 4.5 optimizer do not see it even at -O4. I.e. for the following C source: static unsigned CHUNK_MASK = (1 << 20) - 1; unsigned case1(unsigned x) { return (((x & CHUNK_MASK) >> 6) & ~63) + ((x & 4095) >> 6); } unsigned case2(unsigned x) { return (x & CHUNK_MASK) >> 6; } The compiler generates for case 1: movl %eax, %edx andl $4095, %eax shrl $6, %edx andl $16320, %edx shrl $6, %eax leal (%edx,%eax), %eax and for the case 2: andl $1048575, %eax shrl $6, %eax

Summary: Inefficient representation of the mark bitmap → Inefficient calculations of the mark bitmap

Igor Bukanov

Assignee

Comment 10

•

14 years ago

Attached patch v5 (obsolete) — Details — Splinter Review

Here is a rebased patch. It shows no differences in SS benchmark and in V8 for the splay benchmark it shows the 1.5% win. This is expected as patch affects mostly the marking phase timing. For the wins in synthetic benchmarks that exercise the marking phase see then comment 7.

Attachment #526721 - Attachment is obsolete: true

Attachment #529340 - Flags: review?(wmccloskey)

Attachment #526721 - Flags: review?(gal)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

14 years ago

Comment on attachment 529340 [details] [diff] [review] v5 Review of attachment 529340 [details] [diff] [review]: Nice patch. Thanks for keeping it small and self-contained. ::: js/src/jsgc.h @@ +128,5 @@ inline Chunk *chunk() const; + + uintptr_t arenaAddress() const { + return address(); + } What's the purpose of this method? You just want a better name? @@ +300,5 @@ PRLock *chunkLock; #endif }; +const size_t BytesPerArena = ArenaSize + ArenaBitmapBits / 8 + sizeof(MarkingDelay); Could you make a new constant, ArenaBitmapBytes? This is used below as well. @@ +312,5 @@ + +/* A chunk bitmap contains enough mark bits for all the cells in a chunk. */ +struct ChunkBitmap { + static const uintptr_t BitCount = ArenaBitmapBits * ArenasPerChunk; + static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD; Shouldn't the static assertion above imply that you don't need to round up? @@ +316,5 @@ + static const uintptr_t WordCount = (BitCount + JS_BITS_PER_WORD - 1) / JS_BITS_PER_WORD; + + uintptr_t bitmap[WordCount]; + + JS_ALWAYS_INLINE uintptr_t *getWordAndMask(const Cell *cell, uint32 color, uintptr_t *maskp); I would kinda prefer getMarkWordAndMask, but it's up to you. @@ +374,5 @@ +#endif + +}; + +JS_STATIC_ASSERT(ArenaBitmapBits / 8 * ArenasPerChunk== sizeof(ChunkBitmap)); Need a space here.

Attachment #529340 - Flags: review?(wmccloskey) → review+

Igor Bukanov

Assignee

Comment 12

•

14 years ago

Attached patch v6 — Details — Splinter Review

The new version addresses the nits.

Attachment #529340 - Attachment is obsolete: true

Attachment #529856 - Flags: review+

Igor Bukanov

Assignee

Comment 13

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/f1492c3cf81d

Whiteboard: fixed-in-tracemonkey

Chris Leary [:cdleary] (not checking bugmail)

Comment 14

•

14 years ago

cdleary-bot mozilla-central merge info: http://hg.mozilla.org/mozilla-central/rev/f1492c3cf81d

Chris Leary [:cdleary] (not checking bugmail)

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Inefficient calculations of the mark bitmap

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: igor, Assigned: igor)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Crash Data

Security

(public)

User Story

Attachments

(1 file, 5 obsolete files)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Attachment

General

Description

File Name

Content Type