553812 - separate GC arena info and mark bits from the arena

Assignee

Description

•

15 years ago

+++ This bug was initially created as a clone of Bug #550373 comment 18 +++ As the data from the bug 550373 comment 18 the sweeping of doubles is dominated by the TLB and L2 cache misses as the CPU has to access arena info and mark bitmap stored together with dead double values. If we separate the info and mark bitmap storage from the data itself, we can avoid most TLB misses as the info would be packed into dense array. That should also benefit the marking phase as for already marked objects the would be no need to populate the TLB cache. To implement that I suggest to follow jemalloc and use big (2MB by default) chunks aligned on the chunk size. The arena info and the mark bitmap will be stored at the beginning of the chunk with arena coming after that.

Andreas Gal :gal

Comment 1

•

15 years ago

If we can figure out how to obtain a 2MB Jumbo page from the OS, that should significantly reduce TLB pressure. I assume most OSes run with PAE enabled these days.

Mike Shaver (:shaver emeritus)

Comment 2

•

15 years ago

None of the 32-bit desktop Windows OSes (XP, Vista32, Win7 32) run in PAE mode, unless you hack the server version of a kernel module into them and edit your boot config. True story!

Mike Shaver (:shaver emeritus)

Comment 3

•

15 years ago

I lied. PAE is supported, though the maximum RAM is still restricted to 4GB in those cases, no doubt for very important consumer protection and security reasons! http://en.wikipedia.org/wiki/Physical_Address_Extension#Microsoft_Windows

Andreas Gal :gal

Comment 4

•

15 years ago

NX is only available in PAE mode, so everyone is or should be using it. Not sure whether everyone exposes jumbo pages to user space though, hence comment #1.

Mike Shaver (:shaver emeritus)

Comment 5

•

15 years ago

Windows exposes large pages to user space, but the memory is non-pageable and I think it requires some privilege elevation, so perhaps not what we want. We can get the alignment via _aligned_malloc, I believe, on Windows. Their sample code shows 2^16 alignment, but it would be easy to test if we could do better. That wouldn't help with the TLB issue, of course. :-/

Andreas Gal :gal

Comment 6

•

15 years ago

Igor, could we get some data comparing TLB stalls to L2 stalls? Should be pretty easy to get with shark. If L2 stalls dominate, we don't have to worry about the jumbo page business too much, just provide better spacial locality.

wip 15 years ago Igor Bukanov 45.45 KB, patch		Details \| Diff \| Splinter Review
v1 15 years ago Igor Bukanov 74.20 KB, patch		Details \| Diff \| Splinter Review
v2 15 years ago Igor Bukanov 84.85 KB, patch		Details \| Diff \| Splinter Review
v3 15 years ago Igor Bukanov 88.37 KB, patch		Details \| Diff \| Splinter Review
v4 15 years ago Igor Bukanov 89.27 KB, patch		Details \| Diff \| Splinter Review
v5 15 years ago Igor Bukanov 91.19 KB, patch	gwagner : review+	Details \| Diff \| Splinter Review
v6 15 years ago Igor Bukanov 91.19 KB, patch	gal : review+	Details \| Diff \| Splinter Review
v7 15 years ago Igor Bukanov 92.09 KB, patch	igor : review+	Details \| Diff \| Splinter Review
v8 15 years ago Igor Bukanov 92.32 KB, patch	igor : review+	Details \| Diff \| Splinter Review