Closed Bug 553812 Opened 13 years ago Closed 12 years ago

separate GC arena info and mark bits from the arena

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: igor, Assigned: igor)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 8 obsolete files)

+++ This bug was initially created as a clone of Bug #550373 comment 18 +++

As the data from the bug 550373 comment 18 the sweeping of doubles is dominated by the TLB and L2 cache misses as the CPU has to access arena info and mark bitmap stored together with dead double values. If we separate the info and mark bitmap storage from the data itself, we can avoid most TLB misses as the info would be packed into dense array. That should also benefit the marking phase as for already marked objects the would be no need to populate the TLB cache.

To implement that I suggest to follow jemalloc and use big (2MB by default) chunks aligned on the chunk size. The arena info and the mark bitmap will be stored at the beginning of the chunk with arena coming after that.
If we can figure out how to obtain a 2MB Jumbo page from the OS, that should significantly reduce TLB pressure. I assume most OSes run with PAE enabled these days.
None of the 32-bit desktop Windows OSes (XP, Vista32, Win7 32) run in PAE mode, unless you hack the server version of a kernel module into them and edit your boot config.  True story!
I lied.  PAE is supported, though the maximum RAM is still restricted to 4GB in those cases, no doubt for very important consumer protection and security reasons!

http://en.wikipedia.org/wiki/Physical_Address_Extension#Microsoft_Windows
NX is only available in PAE mode, so everyone is or should be using it. Not sure whether everyone exposes jumbo pages to user space though, hence comment #1.
Windows exposes large pages to user space, but the memory is non-pageable and I think it requires some privilege elevation, so perhaps not what we want.

We can get the alignment via _aligned_malloc, I believe, on Windows.  Their sample code shows 2^16 alignment, but it would be easy to test if we could do better.  That wouldn't help with the TLB issue, of course. :-/
Igor, could we get some data comparing TLB stalls to L2 stalls? Should be pretty easy to get with shark. If L2 stalls dominate, we don't have to worry about the jumbo page business too much, just provide better spacial locality.
(In reply to comment #5)
> We can get the alignment via _aligned_malloc, I believe, on Windows.  Their
> sample code shows 2^16 alignment, but it would be easy to test if we could do
> better.

_aligned_malloc(size+alignment) on Windows is just a wrapper around malloc that calls malloc(size+alignment+1) and then does the alignment.

For big aligned chunks aligned on their size it is necessary to do what jemalloc uses. That is, it does repeated mmap/munmup to get the proper alignment, see http://hg.mozilla.org/tracemonkey/file/1fd3836fc778/memory/jemalloc/jemalloc.c#l2365 .
(In reply to comment #6)
> Igor, could we get some data comparing TLB stalls to L2 stalls? 

On Linux cachegrind does not collect TLB stats AFAICS. It shows only L1 and L2 misses. Still even with that it show that  for a benchmark like

function test() {
    var N = 4e6;
    var array = [];
    for (var i = 0; i != N; ++i) {
        array.push(i + 0.1);
    }
    array = null;
    gc();
}

test();

there are 2 sources of L1/L2 misses. First, a->info.hasInfoDoubles misses the cache when it checks for free doubles. Second, the arena release loop in DestroyGCArenas provides almost the same penalty. Apparently JSGCArenaInfo does not stay even in L2 between the double sweeping loop and arena release loop. 


> If L2 stalls dominate, we don't have to worry
> about the jumbo page business too much, just provide better spacial locality.

Providing better space locality improves both TLB and L2. Consider both
Attached patch wip (obsolete) — Splinter Review
The patch just borrows the code from jemalloc to allocate aligned big chunks and switches the GC to use the big chunks. Separating arena info and mark bits from the arena itself is coming next.
Blocks: 541140
Attached patch v1 (obsolete) — Splinter Review
The patch implements separation of the GC thing from the mark bitmap and JSGCArenaInfo. It seems to pass the tests, but it completely lucks the comments on the new layout. With benchmarks that test GC performance with the heap consisting of mostly garbage double values I observed with the patch up to 30% reduction in GC time.
Attachment #433794 - Attachment is obsolete: true
Attached patch v2 (obsolete) — Splinter Review
The new version updates the comments and uses tighter layout for chunk info. I will submit it for review after some testing.
Attachment #434381 - Attachment is obsolete: true
The patch improves sunspider timing on Linux by 0.8% on with Atom N270 1.6GHz and by 1.1% on 4-core Xenon X3353 2.6 GHz CPU. For v8 the numbers are zero (no changes on Atom) and 1.8% (measured as the best score among 10 runs).
nice work!
Attached patch v3 (obsolete) — Splinter Review
The new version moves the fields related to delay marking from JSGCArenaInfo into another structure that is allocated separately in the chunk. This makes the size of JSGCArenaInfo a power of 2 and avoids multiplication when locating the info from GC things.
Attachment #435322 - Attachment is obsolete: true
Attachment #435410 - Flags: review?(brendan)
Comment on attachment 435410 [details] [diff] [review]
v3

Will review tomorrow (Monday 29-March-2010).

/be
Comment on attachment 435410 [details] [diff] [review]
v3

I'm dealing with movers today -- could Andreas and Gregor do this review? My only nit is the license comment needs to cite The Mozilla Foundation as "Initial Deveoper", and Andreas under "Contributor(s)".

/be
Attachment #435410 - Flags: review?(gal)
Attachment #435410 - Flags: review?(brendan)
Attachment #435410 - Flags: review?(anygregor)
Comment on attachment 435410 [details] [diff] [review]
v3

>+static void	*pages_map(void *addr, size_t size);
>+static void	pages_unmap(void *addr, size_t size);

Nit: Remove extra whitespace please.

>+
>+#ifdef XP_WIN
>+# ifdef JS_GC_USES_MAP_ALIGN
>+
>+#define ALIGN_ADDR2OFFSET(al, ad) \
>+	((uintptr_t)ad & (al - 1))
>+
>+static void *
>+pages_map_align(size_t size, size_t alignment)
>+{
>+	JS_ASSERT(size % alignment == 0);
>+	JS_ASSERT(size >= alignment);

Nit: These are tabs. Please de-tab-ify the entire file if you don't mind. I think we usually try to stick to space (I might be wrong).

>+	void *ret = pages_map(NULL, size);
>+	int offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+	if (offset) {
>+		/* try to over allocate by the ammount we're offset */
>+		void *tmp;
>+		pages_unmap(ret, size);
>+		tmp = VirtualAlloc(NULL, size + alignment - offset,
>+					 MEM_RESERVE, PAGE_NOACCESS);
>+		if (offset == ALIGN_ADDR2OFFSET(alignment, tmp))
>+			ret = VirtualAlloc((void*)((intptr_t)tmp + alignment
>+						   - offset), size, MEM_COMMIT,
>+					   PAGE_READWRITE);
>+		else
>+			VirtualFree(tmp, 0, MEM_RELEASE);
>+		offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+
>+		if (offset) {
>+			/* over allocate to ensure we have an aligned region */
>+			ret = VirtualAlloc(NULL, size + alignment, MEM_RESERVE,
>+					   PAGE_NOACCESS);
>+			offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+			ret = VirtualAlloc((void*)((intptr_t)ret +
>+                                       alignment - offset),
>+                               size, MEM_COMMIT, PAGE_READWRITE);
>+		}
>+	}
>+	return (ret);
>+}
>+# endif /* JS_GC_USES_MAP_ALIGN */
>+
>+static void *
>+pages_map(void *addr, size_t size)
>+{
>+	void *ret = NULL;
>+#if defined(WINCE) && !defined(MOZ_MEMORY_WINCE6)
>+	void *va_ret;
>+	JS_ASSERT(addr == NULL);
>+	va_ret = VirtualAlloc(addr, size, MEM_RESERVE, PAGE_NOACCESS);
>+	if (va_ret)
>+		ret = VirtualAlloc(va_ret, size, MEM_COMMIT, PAGE_READWRITE);
>+	JS_ASSERT(va_ret == ret);
>+#else
>+	ret = VirtualAlloc(addr, size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
>+#endif
>+	return (ret);
>+}
>+
>+static void
>+pages_unmap(void *addr, size_t size)
>+{
>+	if (VirtualFree(addr, 0, MEM_RELEASE) == 0) {
>+#if defined(WINCE) && !defined(MOZ_MEMORY_WINCE6)
>+		if (GetLastError() == ERROR_INVALID_PARAMETER) {
>+			MEMORY_BASIC_INFORMATION info;
>+			VirtualQuery(addr, &info, sizeof(info));
>+			if (VirtualFree(info.AllocationBase, 0, MEM_RELEASE))
>+				return;
>+		}
>+#endif
>+        abort();

Are we sure we want an abort here? The SM-way seems to be to report an error back to the caller and deal with it there.

>+	kern_return_t err = vm_deallocate((vm_map_t)mach_task_self(),
>+                                      (vm_address_t)addr,
>+                                      (vm_size_t)size);
>+	if (err != KERN_SUCCESS)
>+        abort();

Dito. If this is a totally unexpected error, I would just assert on it, or report error flag back.


>+#if defined(MOZ_MEMORY_WINCE) && !defined(MOZ_MEMORY_WINCE6)
>+const size_t GC_CHUNK_SHIFT = 21;
>+#else
>+const size_t GC_CHUNK_SHIFT = 20;
>+#endif

Why the distinction? (not objecting, just curious).

>+ * All chunks that have at least one free arena are put on the doubly-linked
>+ * list with the head stored in JSRuntime.gcChunkList. JSGCChunkInfo contains
>+ * the head of the chunk's free arena list together with the link fields for
>+ * gcChunkList.
>+ * arena must be divisible by GC_CELL_SIZE, the minima allocation unit, and

minimal.

>+ * The number of arenas in the chunk is given by GC_ARENAS_PER_CHUNK. We find
>+ * that number as the following. Suppose chunk contains n arenas. Together

as follows.

>+ * with the word-aligned free arena bitmap and JSGCChunkInfo they should fit

fit into the

>-JS_STATIC_ASSERT(sizeof(jsbitmap) == sizeof(jsuword));
>+/* See comments before ThingsPerUnmarkedBit below. */
>+struct JSGCMarkingDelay {
>+    JSGCArena       *link;
>+    jsuword         unmarkedChildren;
>+};

I am going to have to rebase my patch to remove this.

>+/* Can only be applicable to arena where a->list is not null. */

Can only be called if thing is on an arena where a->list is not null.

>+/* Can only be applicable to arena where a->list is not null. */

>+ * with some delayed things. This should be fine as any GC graph
>+ * marking/traversing hooks must allow repeated calls during the same GC cycle. In particular, xpcom cycle collector relies on

Overlong line.
(In reply to comment #17)
> Dito. If this is a totally unexpected error, I would just assert on it, or
> report error flag back.

The error is unexpected (currently the GC assumes that munmap always succeeds). I will update the patch with asserts.

> 
> >+#if defined(MOZ_MEMORY_WINCE) && !defined(MOZ_MEMORY_WINCE6)
> >+const size_t GC_CHUNK_SHIFT = 21;
> >+#else
> >+const size_t GC_CHUNK_SHIFT = 20;
> >+#endif
> 
> Why the distinction? (not objecting, just curious).

This is what jemalloc does and comes from bug 478044. I am not sure why it was necessary.
Attached patch v4 (obsolete) — Splinter Review
Attachment #435410 - Attachment is obsolete: true
Attachment #435955 - Flags: review?
Attachment #435410 - Flags: review?(gal)
Attachment #435410 - Flags: review?(anygregor)
Comment on attachment 435955 [details] [diff] [review]
v4

The new version fixes above nits
Attachment #435955 - Flags: review?(anygregor)
Comment on attachment 435955 [details] [diff] [review]
v4

>+
>+#ifdef XP_WIN
>+# include <windows.h>
>+
>+# ifdef _MSC_VER
>+#  pragma warning( disable: 4267 4996 4146 )
>+# endif
>+
>+# if defined(WINCE) && !defined(MOZ_MEMORY_WINCE6)
>+/* Required for Windows CE < 6 */
>+#  define JS_GC_USES_MAP_ALIGN 1
>+#endif

nit: indentation?

>+
>+#elif defined(XP_MACOSX) || defined(DARWIN)
>+
>+# include <libkern/OSAtomic.h>
>+# include <mach/mach_error.h>
>+# include <mach/mach_init.h>
>+# include <mach/vm_map.h>
>+# include <malloc/malloc.h>
>+
>+#elif defined(XP_UNIX) || defined(XP_BEOS)
>+
>+# include <unistd.h>
>+# include <sys/mman.h>
>+
>+# ifndef MAP_NOSYNC
>+#  define MAP_NOSYNC    0
>+# endif
>+
>+/* Required on Solaris 10. Might improve performance elsewhere. */
>+# if defined(SOLARIS) && defined(MAP_ALIGN)
>+#  define JS_GC_USES_MAP_ALIGN 1
>+# endif
>+
>+#endif
>+
>+#ifndef JS_GC_USES_MAP_ALIGN
>+# define JS_GC_USES_MAP_ALIGN 0
>+#endif
>+
>+static void *pages_map(void *addr, size_t size);
>+static void pages_unmap(void *addr, size_t size);
>+
>+#ifdef XP_WIN
>+# ifdef JS_GC_USES_MAP_ALIGN
>+
>+#define ALIGN_ADDR2OFFSET(al, ad) \
>+    ((uintptr_t)ad & (al - 1))

indentation?

>+
>+static void *
>+pages_map_align(size_t size, size_t alignment)
>+{
>+    JS_ASSERT(size % alignment == 0);
>+    JS_ASSERT(size >= alignment);
>+    void *ret = pages_map(NULL, size);
>+    int offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+    if (offset) {
>+        /* try to over allocate by the ammount we're offset */

nit: amount

>+        void *tmp;
>+        pages_unmap(ret, size);

please add a comment about the difference of the 2 Virtual Alloc calls
the reserve and commit is hard to see.
What is the problem with using _aligned_malloc?

>+        tmp = VirtualAlloc(NULL, size + alignment - offset,
>+                     MEM_RESERVE, PAGE_NOACCESS);
>+        if (offset == ALIGN_ADDR2OFFSET(alignment, tmp))
>+            ret = VirtualAlloc((void*)((intptr_t)tmp + alignment
>+                           - offset), size, MEM_COMMIT,
>+                       PAGE_READWRITE);
>+        else
>+            VirtualFree(tmp, 0, MEM_RELEASE);
>+        offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+
>+        if (offset) {
>+            /* over allocate to ensure we have an aligned region */
>+            ret = VirtualAlloc(NULL, size + alignment, MEM_RESERVE,
>+                       PAGE_NOACCESS);
>+            offset = ALIGN_ADDR2OFFSET(alignment, ret);
>+            ret = VirtualAlloc((void*)((intptr_t)ret +
>+                                       alignment - offset),
>+                               size, MEM_COMMIT, PAGE_READWRITE);
>+        }
>+    }
>+    return (ret);
>+}
>+# endif /* JS_GC_USES_MAP_ALIGN */
>+
>+static void *
>+pages_map(void *addr, size_t size)
>+{
>+    void *ret = NULL;
>+#if defined(WINCE) && !defined(MOZ_MEMORY_WINCE6)
>+    void *va_ret;
>+    JS_ASSERT(addr == NULL);
>+    va_ret = VirtualAlloc(addr, size, MEM_RESERVE, PAGE_NOACCESS);
>+    if (va_ret)
>+        ret = VirtualAlloc(va_ret, size, MEM_COMMIT, PAGE_READWRITE);
>+    JS_ASSERT(va_ret == ret);
>+#else
>+    ret = VirtualAlloc(addr, size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
>+#endif
>+    return ret;
>+}
>+
>+static void
>+pages_unmap(void *addr, size_t size)
>+{
>+#if (defined(WINCE) && !defined(MOZ_MEMORY_WINCE6))
>+    if (VirtualFree(addr, 0, MEM_RELEASE))
>+        return;
>+
>+    JS_ASSERT(GetLastError() == ERROR_INVALID_PARAMETER);
>+    MEMORY_BASIC_INFORMATION info;
>+    VirtualQuery(addr, &info, sizeof(info));
>+    addr = info.AllocationBase;
>+#endif
>+
>+#ifdef DEBUG
>+    BOOL status =
>+#endif
>+        VirtualFree(addr, 0, MEM_RELEASE);
>+    JS_ASSERT(status);
>+}
>+
>+#elif defined(XP_MACOSX) || defined(DARWIN)
>+
>+static void *
>+pages_map(void *addr, size_t size)
>+{
>+    void *ret;
>+    kern_return_t err;
>+    int flags;
>+
>+    if (addr != NULL) {
>+        ret = addr;
>+        flags = 0;
>+    } else {
>+        flags = VM_FLAGS_ANYWHERE;
>+    }
>+
>+    err = vm_allocate((vm_map_t)mach_task_self(), (vm_address_t *)&ret,
>+                      (vm_size_t)size, flags);
>+    if (err != KERN_SUCCESS)
>+        return NULL;
>+
>+    JS_ASSERT((!addr && ret != addr) || (addr && ret == addr));
>+    return ret;
>+}
>+
>+static void
>+pages_unmap(void *addr, size_t size)
>+{
>+#ifdef DEBUG
>+    kern_return_t status =
>+#endif
>+        vm_deallocate((vm_map_t) mach_task_self(),
>+                      (vm_address_t) addr,
>+                      (vm_size_t) size);
>+    JS_ASSERT(status == KERN_SUCCESS);
>+}
>+
>+#elif defined(XP_UNIX) || defined(XP_BEOS)
>+
>+# if JS_GC_USES_MAP_ALIGN
>+static void *
>+pages_map_align(size_t size, size_t alignment)
>+{
>+    /*
>+     * We don't use MAP_FIXED here, because it can cause the *replacement*
>+     * of existing mappings, and we only want to create new mappings.
>+     */
>+    void *ret = mmap((void *)alignment, size, PROT_READ | PROT_WRITE,
>+                     MAP_PRIVATE | MAP_NOSYNC | MAP_ALIGN | MAP_ANON, -1, 0);
>+    if (ret == MAP_FAILED)
>+        return NULL;
>+    return ret;
>+}
>+# endif /* JS_GC_USES_MAP_ALIGN */
>+
>+static void *
>+pages_map(void *addr, size_t size)
>+{
>+    /*
>+     * We don't use MAP_FIXED here, because it can cause the *replacement*
>+     * of existing mappings, and we only want to create new mappings.
>+     */
>+    void *ret = mmap(addr, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON,
>+                     -1, 0);
>+
>+    if (ret == MAP_FAILED)
>+        return NULL;
>+    if (addr && ret != addr) {
>+        /*
>+         * We succeeded in mapping memory, but not in the right place.
>+         */
>+#ifdef DEBUG
>+        int status =
>+#endif
>+            munmap(ret, size);
>+        JS_ASSERT(status == 0);
>+        return NULL;
>+    }
>+
>+    return ret;
>+}
>+
>+static void
>+pages_unmap(void *addr, size_t size)
>+{
>+#ifdef DEBUG
>+    int ret =
>+#endif
>+        munmap(addr, size);
>+    JS_ASSERT(ret == 0);
>+}
>+
>+#endif

nit: linebreak

>+
>+
>+static void *
>+chunk_alloc_mmap(size_t size)
>+{
>+    JS_ASSERT((size & (size - 1)) == 0);
>+
>+    /*
>+     * Windows requires that there be a 1:1 mapping between VM

nit: "there is" or just "requires a 1:1"

>+     * allocation/deallocation operations.  Therefore, take care here to
>+     * acquire the final result via one mapping operation.  This means
>+     * unmapping any preliminary result that is not correctly aligned.
>+     */
>+
>+#if JS_GC_USES_MAP_ALIGN
>+    return pages_map_align(size, size);
>+#else
>+    void* ret = pages_map(NULL, size);
>+    if (ret == NULL)
>+        return NULL;
>+
>+    size_t mask = size - 1;
>+    size_t offset = reinterpret_cast<jsuword>(ret) & mask;
>+    if (offset != 0) {
>+        /* Deallocate, then try to allocate at (ret + size - offset). */
>+        pages_unmap(ret, size);
>+        ret = pages_map((void *)((uintptr_t)ret + size - offset), size);
>+        while (!ret) {
>+            /*
>+             * Over-allocate in order to map a memory region that
>+             * is definitely large enough.
>+             */
>+            ret = pages_map(NULL, size + size);
>+            if (!ret)
>+                return NULL;
>+            /*
>+             * Deallocate, then allocate the correct size, within
>+             * the over-sized mapping.
>+             */
>+            pages_unmap(ret, size + size);
>+            offset = reinterpret_cast<jsuword>(ret) & mask;
>+            if (offset == 0) {
>+                ret = pages_map(ret, size);
>+            } else {
>+                jsuword addr = reinterpret_cast<jsuword>(ret) + size - offset;
>+                ret = pages_map(reinterpret_cast<void *>(addr), size);
>+            }
>+            /*
>+             * Failure here indicates a race with another thread, so
>+             * try again.
>+             */
>+        }
>+    }
>+    return ret;
>+#endif
>+}
>+
>+namespace js {
>+
>+jsuword
>+NewGCChunk()
>+{
>+    void *p = chunk_alloc_mmap(GC_CHUNK_SIZE);
>+    if (!p)
>+        return 0;
>+
>+    jsuword chunk = reinterpret_cast<jsuword>(p);
>+    JS_ASSERT(!(chunk & GC_CHUNK_MASK));
>+    return chunk;
>+}
>+
>+void
>+DestroyGCChunk(jsuword chunk)
>+{
>+    JS_ASSERT(chunk);
>+    JS_ASSERT(!(chunk & GC_CHUNK_MASK));
>+
>+    pages_unmap(reinterpret_cast<void *>(chunk), GC_CHUNK_SIZE);
>+}
>+
>+}
>+
>Index: tm/js/src/jsgcchunk.h
>- *
>- *  +------------------------------+---------------+-------------+
>- *  | allocation area for GC thing | JSGCArenaInfo | mark bitmap |
>- *  +------------------------------+---------------+-------------+
>- *
>- * The allocation area contains GC_CELLS_PER_ARENA. We find that number as the
>- * following. Let n be the number of cells in the arena. Together with the
>- * word-aligned mark bitmap and JSGCArenaInfo they should fit the arena. Hence
>- * GC_CELLS_PER_ARENA or n_max is the maximum value of n for which the
>- * following holds:
>+ * GC memory is allocated in chunks. The size of each chunk is GC_CHUNK_SIZE.
>+ * The chunk contains an array of GC arena holding GC things, an array of

nit: arenas

>+ * the mark bitmaps for each arena, an array of JSGCArenaInfo arena
>+ * descriptors, an array of JSGCMarkingDelay descriptors, the JSGCChunkInfo
>+ * chunk descriptor and a bitmap indicating free arenas in the chunk. The
>+ * following picture demonstrates the layout:
>+ *
>+ *  +--------+--------------+-------------+----------------+------------+-------------------+
>+ *  | arenas | mark bitmaps | arena infos | marking delays | chunk info | free arena bitmap |
>+ *  +--------+--------------+-------------+----------------+------------+-------------------+

more spaces in the picture or is this just in the diff?

>+
>+/* Can only be applicable to arena where a->list is not null. */
>+inline uint8 *
>+GCArenaIndexToThing(JSGCArena *a, JSGCArenaInfo *ainfo, size_t index)
>+{
>+    JS_ASSERT(a->getInfo() == ainfo);
>+
>+    /*
>+     * We use "<=" and not "<" in the assert so index can mean the limit.
>+     * For the same reason we use "+", not "|" when finding the thing address
>+     * as the limit address can start at the next arena.
>+     */
>+    JS_ASSERT(index <= ThingsPerArena(ainfo->list->thingSize));
>+    jsuword offsetInArena = index * ainfo->list->thingSize;
>+    return reinterpret_cast<uint8 *>(a->toPageStart() + offsetInArena);
>+}
>+
>+inline JSGCThing *
>+NextThing(JSGCThing *thing, size_t thingSize)

nit: The name of the function is not very precise and the function could be used in the wrong way.
Another assert ! isStatic?

I will continue tomorrow.
(In reply to comment #21)
> What is the problem with using _aligned_malloc?

On Windows _aligned_malloc(size, alignment) is implemented via malloc(size + alignment + sizeof(void *)). Thus _aligned_malloc(alignment, alignment) would waste more than twice the memory.
Does it really? IT uses a bit extra virtual memory, but that's basically free. Only the pages we actually use get mapped, which is exactly the aligned part of that allocated block.
(In reply to comment #23)
> Does it really? 

Well, somebody having MSVC installed should look at \Program Files\Microsoft Visual Studio 8\VC\crt\src\align.c but my googling shows that is still the case.
What I meant is "does it really waste memory?". I think malloc(size + alignment + sizeof(void *)) is a reasonable way to implement this from a physical memory consumption perspective.
Attached patch v5 (obsolete) — Splinter Review
Attachment #435955 - Attachment is obsolete: true
Attachment #436205 - Flags: review?
Attachment #435955 - Flags: review?(anygregor)
Attachment #435955 - Flags: review?
Attachment #436205 - Flags: review?(gal)
Attachment #436205 - Flags: review?(anygregor)
Attachment #436205 - Flags: review?
(In reply to comment #25)
> What I meant is "does it really waste memory?". I think malloc(size + alignment
> + sizeof(void *)) is a reasonable way to implement this from a physical memory
> consumption perspective.

It depends on the values of size and alignment. What are they here?

/be
Of course only measurements will tell, but any sane malloc will not touch the mapped pages needlessly, so I would say the values is "nothing", and it will become 0 the moment the process first touches the page (which for the alignment overflow part will be "never").
(In reply to comment #27)
> (In reply to comment #25)
> > What I meant is "does it really waste memory?". I think malloc(size + alignment
> > + sizeof(void *)) is a reasonable way to implement this from a physical memory
> > consumption perspective.
> 
> It depends on the values of size and alignment. What are they here?

size == alignment == 1MB, the same values that jemalloc uses.
Comment on attachment 436205 [details] [diff] [review]
v5


>-    JSGCArenaInfo   info;
>-    jsbitmap        markBitmap[GC_ARENA_MARK_BITMAP_WORDS];
>+    uint8 data[GC_ARENA_SIZE];

I was always wondering what the indentation rules are for local vars are.
Should data be aligned to the old markBitmap?


> 
> inline void
>-SetChunkInfoIndex(jsuword chunk, unsigned index)
>+CheckValidThingPtr(void *thing)

Maybe rename this function to CheckValidGCThingPtr since static strings are valid "things"?

> 
>+inline size_t
>+ThingsPerArena(size_t thingSize)
>+{
>+    JS_ASSERT(!(thingSize & GC_CELL_MASK));
>+    JS_ASSERT(thingSize <= GC_ARENA_SIZE);
>+    return GC_ARENA_SIZE / thingSize;
>+}
>+
>+/* Can only be called if thing is on an arena where a->list is not null. */

nit "is in an arena"

> 
>         while ((a = arenaList->cursor) != NULL) {
>-            arenaList->cursor = a->info.prev;
>-            JSGCThing *freeList = a->info.freeList;
>+            JSGCArenaInfo * ainfo = a->getInfo();

nit: remove Whitespace


>-        while (a->info.unmarkedChildren != 0) {
>-            unmarkedBitIndex = JS_FLOOR_LOG2W(a->info.unmarkedChildren);
>-            a->info.unmarkedChildren &= ~((jsuword)1 << unmarkedBitIndex);
>+        while (markingDelay->unmarkedChildren != 0) {
>+            unmarkedBitIndex = JS_FLOOR_LOG2W(markingDelay->unmarkedChildren);
>+            markingDelay->unmarkedChildren &= ~((jsuword)1 << unmarkedBitIndex);

shouldn't it be jsuword(1) ?
Attachment #436205 - Flags: review?(anygregor) → review+
(In reply to comment #30)
> (From update of attachment 436205 [details] [diff] [review])
> 
> >-    JSGCArenaInfo   info;
> >-    jsbitmap        markBitmap[GC_ARENA_MARK_BITMAP_WORDS];
> >+    uint8 data[GC_ARENA_SIZE];
> 
> I was always wondering what the indentation rules are for local vars are.
> Should data be aligned to the old markBitmap?

Not sure, but C++ allows us to move declarations to just before needed, which means fewer piled up in a file (possibly with same-typed ones along ranks) at the top of the function -- which means less desire to line up declarators.

Does that help here?

/be
(In reply to comment #30)
> (From update of attachment 436205 [details] [diff] [review])
> 
> >-    JSGCArenaInfo   info;
> >-    jsbitmap        markBitmap[GC_ARENA_MARK_BITMAP_WORDS];
> >+    uint8 data[GC_ARENA_SIZE];
> 
> I was always wondering what the indentation rules are for local vars are.
> Should data be aligned to the old markBitmap?

The rule is to align all field names on a tabstop. But it is not applicable here as the patch makes data a single field in JSGCArena.
Attached patch v6 (obsolete) — Splinter Review
The new version fixes the latest nits.
Attachment #436205 - Attachment is obsolete: true
Attachment #436542 - Flags: review?
Attachment #436205 - Flags: review?(gal)
Blocks: 556589
Attachment #436542 - Flags: review? → review?(gal)
Blocks: 557538
gal: review ping
This is probably a terrible idea, for other reasons, but for what it's worth:

If one has separated the mark bits from the objects, then that means that only the allocator and GC touch them.  Thus, once the roots have been placed in a queue somewhere, the rest of the traversal can easily be done in a separate thread, if we "allocate black": new objects are marked as "visited" by the ongoing GC phase.
... oh, right.  The algorithm I had in mind also requires a write barrier, which we don't have.
(In reply to comment #36)
> ... oh, right.  The algorithm I had in mind also requires a write barrier,
> which we don't have.

We do, of course -- at a slightly higher level. JS requires read and write barriers for getters and setters. We have a write barrier for branding and a read barrier for lazy method cloning as well.

/be
Comment on attachment 436542 [details] [diff] [review]
v6

>+    jsuword getChunk() { return toPageStart() & ~GC_CHUNK_MASK; }

Nit: No same-line function implementation.

>+    void addToList(JSRuntime *rt) {
>+        prevp = &rt->gcChunkList;
>+        next = rt->gcChunkList;
>+        if (rt->gcChunkList) {
>+            JS_ASSERT(rt->gcChunkList->prevp == &rt->gcChunkList);
>+            rt->gcChunkList->prevp = &next;
>+        }
>+        rt->gcChunkList = this;

Do we need our own list implementation here or could we use js::Vector?

>-    if (a->info.unmarkedChildren != 0) {
>+    if (markingDelay->unmarkedChildren != 0) {
>         JS_ASSERT(rt->gcUnmarkedArenaStackTop);
>-        if (a->info.unmarkedChildren & bit) {
>+        if (markingDelay->unmarkedChildren & bit) {
>             /* bit already covers things with children to mark later. */
>             return;
>         }
>-        a->info.unmarkedChildren |= bit;
>+        markingDelay->unmarkedChildren |= bit;

I still really dislike this code. Another patch to remove it I guess.

Looks good. Sorry for the slow review.

r=me if you can prove to me that on windows your allocation code is faster than the builtin aligned malloc that over-allocates, or that it uses significantly less _physical_ memory.
Attachment #436542 - Flags: review?(gal) → review+
One-line inline method definition for very short functions (like one-line macros) is fine in C++ house style.

/be
(In reply to comment #37)
> (In reply to comment #36)
> > ... oh, right.  The algorithm I had in mind also requires a write barrier,
> > which we don't have.
> 
> We do, of course -- at a slightly higher level. JS requires read and write
> barriers for getters and setters. We have a write barrier for branding and a
> read barrier for lazy method cloning as well.

What parallel GC would need is a barrier protecting every reference that can keep an object alive, not just object slots. We have huge amounts of miscellanea.
Duplicate of this bug: 556589
(In reply to comment #40)
> (In reply to comment #37)
> > (In reply to comment #36)
> > > ... oh, right.  The algorithm I had in mind also requires a write barrier,
> > > which we don't have.
> > 
> > We do, of course -- at a slightly higher level. JS requires read and write
> > barriers for getters and setters. We have a write barrier for branding and a
> > read barrier for lazy method cloning as well.
> 
> What parallel GC would need is a barrier protecting every reference that can
> keep an object alive, not just object slots. We have huge amounts of
> miscellanea.

Yes, indeed. I didn't mean to make it sound like we were "done" -- all the "native" writes are the problem.

/be
(In reply to comment #38)
> >+    void addToList(JSRuntime *rt) {
> 
> Do we need our own list implementation here or could we use js::Vector?

A data structure that allows fast insertion and deletion is necessary here. Vector does not fit that pattern so the patch continue to use a custom double linked list implementation.

> r=me if you can prove to me that on windows your allocation code is faster than
> the builtin aligned malloc that over-allocates, or that it uses significantly
> less _physical_ memory.

The chunk allocation code comes directly from jemalloc. If this can be done better, we need a common bug to improve it there as well. Also consider that with jemalloc and GC allocating the same 1MB chunks the initial allocation of getting 1MB most likely will get the chunk aligned with a sane OS.
(In reply to comment #43)
> (In reply to comment #38)
> > >+    void addToList(JSRuntime *rt) {
> > 
> > Do we need our own list implementation here or could we use js::Vector?
> 
> A data structure that allows fast insertion and deletion is necessary here.

This is not true. We need a structure that allows for fast insertion and *bulk* release. So vector can do the job and would result for simpler code. I will do it in another bug.
Attached patch v7 (obsolete) — Splinter Review
The new version fixes the nits and adds chunks stats to the GC stats.
Attachment #436542 - Attachment is obsolete: true
Attachment #438490 - Flags: review+
looks like there's a crash or assertion in xpcshell testing:

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1271097483.1271099064.14971.gz&fulltext=1

find stacks there.
 0  ld-2.5.so + 0x7f2
    eip = 0x0097f7f2   esp = 0xbfb8c23c   ebp = 0xbfb8c248   ebx = 0x00002171
    esi = 0xffffffff   edi = 0x00b22ff4   eax = 0x00000000   ecx = 0x00002171
    edx = 0x00000006   efl = 0x00000206
    Found by: given as instruction pointer in context
 1  libmozjs.so!JS_Assert [jsutil.cpp:7c910d181330 : 78 + 0xb]
    eip = 0x0035314b   esp = 0xbfb8c250   ebp = 0xbfb8c268
    Found by: previous frame's frame pointer
 2  libmozjs.so!js::GCArenaReleaser::release [jsgc.cpp:7c910d181330 : 754 + 0x50]
    eip = 0x0029063b   esp = 0xbfb8c270   ebp = 0xbfb8c298   ebx = 0x0043bb74
    Found by: call frame info
 3  libmozjs.so!js::GCArenaReleaser::releaseList [jsgc.cpp:7c910d181330 : 771 + 0x18]
    eip = 0x00290707   esp = 0xbfb8c2a0   ebp = 0xbfb8c2b8   ebx = 0x0043bb74
    esi = 0x09ad7ac0   edi = 0x00000014
    Found by: call frame info
 4  libmozjs.so!FinishGCArenaLists [jsgc.cpp:7c910d181330 : 879 + 0x1a]
    eip = 0x00290025   esp = 0xbfb8c2c0   ebp = 0xbfb8c2e8   ebx = 0x0043bb74
    esi = 0x09ad7ac0   edi = 0x00000014
    Found by: call frame info
 5  libmozjs.so!js_FinishGC [jsgc.cpp:7c910d181330 : 1198 + 0xa]
    eip = 0x002900d8   esp = 0xbfb8c2f0   ebp = 0xbfb8c2f8   ebx = 0x0043bb74
    esi = 0x09ad7ac0   edi = 0x00000014
    Found by: call frame info
 6  libmozjs.so!JSRuntime::~JSRuntime [jsapi.cpp:7c910d181330 : 630 + 0xa]
Attached patch v8Splinter Review
My attempt to land this revealed a debug-only bug where the arena release method would change the arena link field in debug-only builds before the arena list enumerator read the list. The new patch fixes this.
Attachment #438490 - Attachment is obsolete: true
Attachment #438560 - Flags: review+
http://hg.mozilla.org/tracemonkey/rev/c6b51d87e959
Whiteboard: fixed-in-tracemonkey
Blocks: 559141
http://hg.mozilla.org/mozilla-central/rev/47532d9153cb
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 560818
You need to log in before you can comment on or make changes to this bug.