Closed Bug 396007 Opened 17 years ago Closed 16 years ago

Using posix_memalign for GC arenas when mmap is not available

Categories

(Core :: JavaScript Engine, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla1.9

People

(Reporter: igor, Assigned: igor)

References

Details

Attachments

(2 files, 8 obsolete files)

Bug 392263 uses mmap/VirtualAlloc to implement the allocation of aligned GC arenas using non-executable CPU pages. When the paged allocation is not available or can not be used due to an unsuitable size of the CPU page, the code switches to over sized malloc calls to implement aligned allocation.

It would be nice to use posix_memalign instead when available and avoid malloc overhead.
With jemalloc Windows we support posix_memalign.  Linux has it.  Doesn't seem to be on Mac, but would be nice to use it where possible.
I will test/document this tomorrow.
fwiw this patch seems to work OK on Windows with jemalloc (with some forced #ifdefs)
(In reply to comment #3)
> fwiw this patch seems to work OK on Windows with jemalloc (with some forced
> #ifdefs)
> 

what is the overhead of memalign in jemalloc? In particular, does it need any headers/trailers? if so, calling the function several times like in

    memalign(&p, 2^n, 2^n);

would not be able to allocate the blocks in a sequence but would make inssted a holes of size 2^n at least between blocks. A proper way to use the function would be to decrease the allocation size:

     memalign(&p, 2^n, 2^n - implementation_overhead);
jemalloc does not use object headers, so the objects would all be adjacent, assuming no pre-existing memory fragmentation.
I think we're going to want to do larger allocations than 1k for these.  With your patch I see 328200 NewGCChunk()s vs 1500 with the defaults on Windows
(In reply to comment #6)
> I think we're going to want to do larger allocations than 1k for these.  With
> your patch I see 328200 NewGCChunk()s vs 1500 with the defaults on Windows
> 

why is it necessary bad? The amount of allocated memory for GC heap is the same in both cases. But in the memalign case the system can use for its purposes the space that is otherwise wasted due to internal fragmentation.
(In reply to comment #5)
> jemalloc does not use object headers, so the objects would all be adjacent,
> assuming no pre-existing memory fragmentation.

This is nice. Still, since posix_memalign from GLIBC does use the header for posix_memalign, I would like to support that in the patch automatically detecting header's size during compilations.  

Does a build where posix_memalign comes from jemalloc defines any special preprocessor macros?
(In reply to comment #3)
> fwiw this patch seems to work OK on Windows with jemalloc (with some forced
> #ifdefs)

But how MSVC can find a definition of posix_memalign??? The patch does not try to include any extra headers assuming that posix_memalign comes from stdlib.c. Or does MSVC comes with posix_memalign defined in that header?
Attached patch v2 (obsolete) — Splinter Review
The new version passes the tests when compiled with -DJS_GC_USE_MMAP=0 on Linux. To enable posix_memalign on a Windows build with jemalloc, make sure that the following defines are passed to the compiler when it compiles jsgc.c:

 -DJS_GC_USE_MMAP=0 -DJS_HAS_POSIX_MEMALIGN=1 -DJS_GC_AREANA_PAD=0

To finish the patch I need to set JS_HAS_POSIX_MEMALIGN and JS_GC_AREANA_PAD automatically on Windows when jemalloc and its posix_memalign is available. Does jsemalloc provides any preprocessor macros for that?
Attachment #304305 - Attachment is obsolete: true
Attached patch v3 (obsolete) — Splinter Review
The new patch detects jemalloc build via checking MOZ_MEMORY and enables posix_memalign support if so. Since a default posix_memalign from GLIBC suffers from the problem in comment 4, the patch introduces JS_GC_ARENA_PAD to ask for less memory than the alignment requirements. With jemalloc it is set to 0, otherwise JS_GC_ARENA_PAD it is 2 words, the size that GLIBC needs.

The patch also uses 4K arenas with all the allocation modes. Small arenas apparently hurt the performance with posix_memalign from jemalloc.

To enable posix_memalign allocations with the patch, make sure that JS_GC_USE_MMAP is set to 0 when compiling SpiderMonkey. I.e. like passing -DJS_GC_USE_MMAP=0 to the compiler.
Attachment #304446 - Attachment is obsolete: true
Attachment #304559 - Flags: review?(brendan)
If testing shows that jemalloc is better for the browser, then code can be changed to favor posix_memalign over mmap when MOZ_MEMORY is defined.
Attached patch v4 (obsolete) — Splinter Review
The new version optimizes more for the case when GC arena == GC chunk and excludes unnecessary code in this case.
Attachment #304559 - Attachment is obsolete: true
Attachment #304565 - Flags: review?(brendan)
Attachment #304559 - Flags: review?(brendan)
With a default allocator from GLIBC enabling posix_memalign on Linux leads to 2.5% improvement in SunSpider benchmark with optimized thread-safe build of js shell. This is for Fedora-9 alpha on Pentium-M 1.1GHz laptop.
Attached patch v5 (obsolete) — Splinter Review
The new version prioritize posix_memalign over mmap when both are available. This is based on the benchmark results from Linux. It would be interesting to know  what happens on Windows.
Attachment #304565 - Attachment is obsolete: true
Attachment #304742 - Flags: review?(brendan)
Attachment #304565 - Flags: review?(brendan)
Attached patch v6 (obsolete) — Splinter Review
The previous version was over-engineered. The new version always uses posix_memalign when it is available and the user does not insist on mmap with explicit JS_GC_USE_MMAP set to 1.
Attachment #304742 - Attachment is obsolete: true
Attachment #304758 - Flags: review?(brendan)
Attachment #304742 - Flags: review?(brendan)
Flags: blocking1.9+
Target Milestone: --- → mozilla1.9beta4
What is the win (SunSpider or any other test we use) on Windows with jemalloc and this patch?

/be
Comment on attachment 304758 [details] [diff] [review]
v6

>+ * When it is is true, a platform-dependant function like mmap is used to get
>+ * memory aligned on CPU page boundary. If the macro is false or not defined,
>+ * posix_memalign is used when available. Otherwise the code uses over-
>+ * sized malloc to allocate a chunk with js_gcArenasPerChunk aligned arenas.
>+ * The approximate space overhead of this is 1/js_gcArenasPerChunk. For
>+ * details, see NewGCChunk/DestroyGCChunk below.

Please rewrap to avoid ragged right, oversized (one word) fits on one line.

>+     * Implement the chunk allocation using over sized malloc if mmap or

s/over sized/oversized/ and any others like these.

r/a=me, looks good -- anticipate Windows win too (may have heard about it, didn't see it reported here).

/be
Attachment #304758 - Flags: review?(brendan)
Attachment #304758 - Flags: review+
Attachment #304758 - Flags: approval1.9+
(In reply to comment #17)
> What is the win (SunSpider or any other test we use) on Windows with jemalloc
> and this patch?

Robert has mentioned on IRC that with the initial version of the patch that has used 1 K GC arenas things were slower by couple percent, with bigger arenas the speedup was 2%. According to this the last version of the patch uses 4K arenas in all arena allocation modes.

   
Attached patch v6b (obsolete) — Splinter Review
The new version addresses the nits from comment 18.
Attachment #304758 - Attachment is obsolete: true
Attachment #305345 - Flags: review+
Attachment #305345 - Flags: approval1.9+
Attached patch v7 (obsolete) — Splinter Review
The landed patch from bug 400902 added expected heavy conflicts with this patch. While resolving them I took opportunity to improve the patch when it is known at the compile time that GC arena == chunk.  Now the code uses bitfields in JSGCArenaInfo only if a chunked allocation is necessary.
Attachment #305345 - Attachment is obsolete: true
Comment on attachment 305852 [details] [diff] [review]
v7

The last version has passed both the test suite and mochi tests.
Attachment #305852 - Flags: review?(brendan)
The last patch compiles on Mac and probably on Windows - the try server http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTry would have already stopped in the case of compilation problems on Windows as SpiderMonkey compiles early during the compilation stage.
Comment on attachment 305852 [details] [diff] [review]
v7

>+int
>+posix_memalign(void **memptr, size_t alignment, size_t size);

Minimal declaration to get Windows happy, but house style favors extern.

>+ * Declare the headers for mmap unless we have posix_memalign and do not insist

s/Declare/Include/

>+ * memory aligned on CPU page boundary. If the macro is false or not defined,

s/boundary/boundaries/
s/not defined/undefined/

This should still fit in the given tw=78.

>+ * posix_memalign is used when available. Otherwise the code uses oversized
>+ * malloc to allocate a chunk with js_gcArenasPerChunk aligned arenas.

Suggest "... uses malloc to over-allocate a chunk ...".

>+ * The code also allocates arenas in chunks when JS_GC_USE_MMAP is true to

s/true/1/

>+ * the runtime.

Strike "the".

>+ * JS_GC_ARENA_PAD defines the number of bytes to pad JSGCArenaInfo structure.
>+ * It is used to improve allocation efficiency when using posix_memalign. If
>+ * its implementation uses internal headers, then calling

s/its/malloc's/

>+ * due to the need to fit headers. JS_GC_ARENA_PAD allows to mitigate that

s/allows to mitigate/allows us to mitigate/

or (better) just s/allows to mitigate/mitigates/

>+ * The structure is stored in one of chunk's free arenas. GET_CHUNK_INFO_INDEX

"Structure stored in one of a chunk's free arenas."

>+ * macro gives the index of this arena. When all arenas in the chunk are used,

strike "macro" as the lack of definite article "The" before "GET_CHUNK_INFO_INDEX" makes it awkward, and it's arguably deadwood in any case -- the rest of the para should rewrap well.

>+     * Implement the chunk allocation using oversized malloc if mmap or
>+     * posix_memalign are not available.

s/the chunk/chunk/
s/or/and/

>      *
>      * Since malloc allocates pointers aligned on the word boundary, to get
>      * js_gcArenasPerChunk aligned arenas, we need to malloc only
>      *   ((js_gcArenasPerChunk + 1) << GC_ARENA_SHIFT) - sizeof(size_t)
>      * bytes. But since we stores the gap between the malloced pointer and the
>      * first arena in the chunk after the chunk, we need to ask for
>      *   ((js_gcArenasPerChunk + 1) << GC_ARENA_SHIFT)
>      * bytes to ensure that we always have room to store the gap.
>      */

Again blank comment lines on either side of excerpted code, relations, and formulae would be nice.

>     p = malloc((js_gcArenasPerChunk + 1) << GC_ARENA_SHIFT);
>-    if (!p)
>-        return 0;
>-    chunk = ((jsuword) p + GC_ARENA_MASK) & ~GC_ARENA_MASK;
>-    *GetMallocedChunkGapPtr(chunk) = (uint32) (chunk - (jsuword) p);
>-    return chunk;
>+    if (p) {
>+        jsuword chunk;
>+
>+        chunk = ((jsuword) p + GC_ARENA_MASK) & ~GC_ARENA_MASK;
>+        *GetMallocedChunkGapPtr(chunk) = (uint32) (chunk - (jsuword) p);
>+        return chunk;
>+    }

Previous code followed prevailing style by casting out (unlikely) error with early return and well-predicted (IIRC) branch around then clause. Just wondering why this change. Is it better for prefetching?

Looks good, r/a+ with nits picked as appropriate.

/be
Attachment #305852 - Flags: review?(brendan)
Attachment #305852 - Flags: review+
Latest patch looks good on Windows.  I tested it and don't see any real perf change on Windows but should be a little better long term.
Attached patch v8Splinter Review
The new version of the patch addresses the nits from the comment 24.

> Previous code followed prevailing style by casting out (unlikely) error with
early return and well-predicted (IIRC) branch around then clause.

I wanted to avoid declaring "chunk" at the top of the function as it would require an ifdef around it as only oversized malloc needs it. In the new version I have changed the code to follow the early-return-on-error pattern and used an explicit block to declare "chunk":

     p = malloc((js_gcArenasPerChunk + 1) << GC_ARENA_SHIFT);
-    if (p) {
+    if (!p)
+        return 0;
+
+    {
         jsuword chunk;
 
         chunk = ((jsuword) p + GC_ARENA_MASK) & ~GC_ARENA_MASK;
         *GetMallocedChunkGapPtr(chunk) = (uint32) (chunk - (jsuword) p);
         return chunk;
     }
 #endif
-    return 0;
Attachment #305852 - Attachment is obsolete: true
Attachment #305957 - Flags: review+
Attachment #305957 - Flags: approval1.9+
Moving bugs that aren't beta 4 blockers to target final release.
Target Milestone: mozilla1.9beta4 → mozilla1.9
Attachment #305957 - Flags: approval1.9b4?
Please land this and close it.  Requesting approval for b4.
Flags: tracking1.9+ → blocking1.9+
Priority: -- → P2
Attachment #305957 - Flags: approval1.9b4? → approval1.9b4+
I checked in the patch from comment 26 to the trunk:

http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&branch=HEAD&cvsroot=%2Fcvsroot&date=explicit&mindate=1204474380&maxdate=1204474414&who=igor%25mir2.org
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Version: unspecified → Trunk
(In reply to comment #18)
> (From update of attachment 304758 [details] [diff] [review])
> >+ * When it is is true, a platform-dependant function like mmap is used to get

"it is true", "dependent". I'm shocked that brendan didn't complain.
(In reply to comment #30)
> (In reply to comment #18)
> > (From update of attachment 304758 [details] [diff] [review] [details])
> > >+ * When it is is true, a platform-dependant function like mmap is used to get
> 
> "it is true", "dependent". I'm shocked that brendan didn't complain.

Thanks for spotting this, I will file a "regression" bug about this".
Blocks: 420593
No longer blocks: 420593
Depends on: 420593
Depends on: 421236
Flags: in-testsuite-
Flags: in-litmus-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: