Closed Bug 696162 Opened 13 years ago Closed 12 years ago

Fix jsgcchunk's AllocGCChunk to be more efficient and to avoid potential problems on Mac 10.7

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12
Tracking Status
firefox12 --- fixed

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file, 4 obsolete files)

In bug 694896, we're fixing jemalloc's chunk allocator so that it works on 10.7.

The js engine uses basically the same code to allocate chunks, so whatever fixes we take there are likely to apply to js as well.  These fixes only apply to Linux and Mac; they don't apply to Windows.

It's not clear whether the hang in bug 694896 applies to the js engine, since js uses vm_allocate, while jemalloc uses mmap.  But unless the changes slow jemalloc down, we should probably make them in js too, to try to avoid the hang and to try to speed up chunk allocation.
I'm curious if you guys know why we use vm_allocate rather than mmap on mac.  If it's better, we could switch jemalloc to vm_allocate.
Assignee: general → justin.lebar+bug
Attached patch WIP v1 (obsolete) — Splinter Review
I was hoping to see some motion on our benchmarks, but I didn't.  Anyway we should probably still do this, because it avoids the potential infinite loop in AllocChunk on Linux and Mac (recall that the equivalent loop in jemalloc was causing us to spin on 10.7).
(In reply to Justin Lebar [:jlebar] from comment #4)
> I was hoping to see some motion on our benchmarks, but I didn't.  Anyway we
> should probably still do this, because it avoids the potential infinite loop
> in AllocChunk on Linux and Mac (recall that the equivalent loop in jemalloc
> was causing us to spin on 10.7).

If the benchmark will show a win in the bug 697970 the GC would not need to use aligned chunks and could use a straightforward mmap calls without alignment loops.
Pushed a new patch to try which implements a Windows optimization, per Terrence's suggestion in another bug:

On Windows (previously, everywhere), we do the following:

 1. map 2mb.
 2. unmap the whole 2mb.
 3. map 1mb inside where the 2mb was, aligned to a chunk boundary.
 4. If (3) didn't succeed, presumably because another thread jumped in between (2) and (3) and mapped into the space we were planning to use, then start over from (1).

Terrence's insight was that in step (1), we don't need to commit any memory, since we're going to immediately unmap it anyway.  We'll see if this has any effect on benchmark scores, but I doubt it will.

It may also be worth experimenting with an optimistic/pessimistic scheme on Windows similar to what this patch implements for *nix.  If we're feeling optimistic (because optimistic was the right choice last time), allocate a 1mb chunk.  If it's aligned, great, use it.  If not, fall back to the slow algorithm above.

https://tbpl.mozilla.org/?tree=Try&rev=ceb3d91fec2d
Attached patch Untested WIP v2 (obsolete) — Splinter Review
Attachment #573075 - Attachment is obsolete: true
Try run for 038665fdaad7 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=038665fdaad7
Results (out of 2 total builds):
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-038665fdaad7
Comment on attachment 573221 [details] [diff] [review]
Untested WIP v2

Not a review, but a few quick comments to tag some bugs I noticed that might impact the test results.

>- * vim: set ts=4 sw=4 et tw=99 ft=cpp:
>+ * vim: set ts=4 sw=4 sts=4 et tw=70 ft=cpp:

Was this intentional?

>+/* Last time we tried to allocate a chunk, did mmap immediately return
>+ * a pointer with the correct alignment? */
>+static bool sLastMmapWasUnaligned = false;

If you are going to use a global variable, common to multiple ifdef sections, just put it at the top of the file.  BUT, as it is just use to select a fast path, it's not clear that passing a value between cores is going to be a net win over just doing 3 syscalls in all cases.  We will need to see how this stacks up on a many-threaded allocation micro-benchmark.  Also note that even though reads and writes to bool will be atomic, you still need to ensure that all possible R/W orderings makes sense in the case you use it in.  (I didn't check this.)

>-    p = MapPages(NULL, ChunkSize);
>+
>+    if (ChunkAddrToOffset(p) == 0) {
>+        /* Optimistic case worked! */
>+        return p;
>+    }

Update sLastMmapWasUnaligned? (and elsewhere in this function)

>+    sLastMmapWasUnlaligned = true;
>+    UnmapPages(p, ChunkSize);
>+    p = MapPages(FindChunkStart(p), ChunkSize);

I think you want FindChunkStart(p + ChunkSize) here, if you are looking for the next aligned chunk.

>+    if (sLastMmapWasUnlaligned) {

Please at least compile test before posting patches?
> Please at least compile test before posting patches?

It's sometimes useful for me to post work-in-progress patches to Bugzilla so that I can pass the patches around to my various VMs for testing.  I try to be explicit when patches are a work in progress so that people aren't surprised when they don't work, but I'm sorry if I wasn't clear enough in this case.

> >+ * vim: set ts=4 sw=4 sts=4 et tw=70 ft=cpp:
> Was this intentional?

Yes; the file doesn't have width-99 comments.

> If you are going to use a global variable, common to multiple ifdef sections, just put 
> it at the top of the file.  BUT, as it is just use to select a fast path, it's not 
> clear that passing a value between cores is going to be a net win over just doing 3 
> syscalls in all cases.

Note that although this global variable is common to two ifdef sections, it's not common to all of them.  But let's hold off on nits until I actually finish this patch, if you don't mind.

jemalloc uses thread-local storage for this value, but before I ported that here, I wanted to ask about whether locks are held while allocating gc chunks, etc.  It sounds like there's no locking at all, so we should just use TLS here.

> Update sLastMmapWasUnaligned? (and elsewhere in this function)

In the chunk you cited, the code is run only if sLastMmapWasUnaligned is false, so I don't think I need to update the variable.  I'm not sure where else you're thinking sLastMmapWasUnaligned should be updated.

> >+    p = MapPages(FindChunkStart(p), ChunkSize);
> I think you want FindChunkStart(p + ChunkSize) here, if you are looking for the next aligned 
> chunk.

FindChunkStart rounds up, right?

static inline void *
FindChunkStart(void *p)
{
    jsuword addr = reinterpret_cast<jsuword>(p);
    addr = (addr + ChunkMask) & ~ChunkMask;
    return reinterpret_cast<void *>(addr);
}
Attachment #573221 - Attachment description: WIP v2 → Untested WIP v2
(In reply to Justin Lebar [:jlebar] from comment #10)
> > Please at least compile test before posting patches?
> 
> It's sometimes useful for me to post work-in-progress patches to Bugzilla so
> that I can pass the patches around to my various VMs for testing.  I try to
> be explicit when patches are a work in progress so that people aren't
> surprised when they don't work, but I'm sorry if I wasn't clear enough in
> this case.

Interesting, I think most people use WIP patches to communicate what they are working on earlier than a full review so that they can get unofficial feedback on their basic direction.  That said, even if you are using it for remote storage, I would still recommend compile testing so that you don't have to manually fix up on all your VM's.  

> > >+ * vim: set ts=4 sw=4 sts=4 et tw=70 ft=cpp:
> > Was this intentional?
> 
> Yes; the file doesn't have width-99 comments.

SM style is <100 chars per line.

> > If you are going to use a global variable, common to multiple ifdef sections, just put 
> > it at the top of the file.  BUT, as it is just use to select a fast path, it's not 
> > clear that passing a value between cores is going to be a net win over just doing 3 
> > syscalls in all cases.
> 
> Note that although this global variable is common to two ifdef sections,
> it's not common to all of them.  But let's hold off on nits until I actually
> finish this patch, if you don't mind.
> 
> jemalloc uses thread-local storage for this value, but before I ported that
> here, I wanted to ask about whether locks are held while allocating gc
> chunks, etc.  It sounds like there's no locking at all, so we should just
> use TLS here.
> 
> > Update sLastMmapWasUnaligned? (and elsewhere in this function)
> 
> In the chunk you cited, the code is run only if sLastMmapWasUnaligned is
> false, so I don't think I need to update the variable.  I'm not sure where
> else you're thinking sLastMmapWasUnaligned should be updated.

Ah yes, you are right.  Generally, booleans should be the positive case, so that making them false is not a dual-negative.  E.g. "aligned" is easier to parse than "not unaligned".
 
> > >+    p = MapPages(FindChunkStart(p), ChunkSize);
> > I think you want FindChunkStart(p + ChunkSize) here, if you are looking for the next aligned 
> > chunk.
> 
> FindChunkStart rounds up, right?

I assumed you had recycled Chunk::fromAddress here, sorry.
 
> static inline void *
> FindChunkStart(void *p)
> {
>     jsuword addr = reinterpret_cast<jsuword>(p);
>     addr = (addr + ChunkMask) & ~ChunkMask;
>     return reinterpret_cast<void *>(addr);
> }
> SM style is <100 chars per line.

Is SM style to have 100-character-long lines in comments?  I don't see that in jsapi.h.

> E.g. "aligned" is easier to parse than "not unaligned".

Agreed; I'll change it.
Spidermonkey style says "Comments should fit within 80 columns".  Code is 99 columns.

https://wiki.mozilla.org/JavaScript:SpiderMonkey:C%2B%2B_Coding_Style
Do we have TLS plumbing hiding somewhere in the JS engine?
Try run for ceb3d91fec2d is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=ceb3d91fec2d
Results (out of 62 total builds):
    success: 61
    warnings: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-ceb3d91fec2d
(In reply to Justin Lebar [:jlebar] from comment #6)
> It may also be worth experimenting with an optimistic/pessimistic scheme on
> Windows similar to what this patch implements for *nix.  If we're feeling
> optimistic (because optimistic was the right choice last time), allocate a
> 1mb chunk.  If it's aligned, great, use it.  If not, fall back to the slow
> algorithm above.
Could you instead allocate 2MiB, check if it's aligned, and if it is return two allocated chunks? That would cut out the 'unmap the unaligned 1MiB, map 2MiB' fallback step, but I have no idea if it's possible in the current code. It would also be worse if 2MiB is significantly less likely to be aligned than 1MiB, I guess.
> Could you instead allocate 2MiB, check if it's aligned, and if it is return two allocated chunks? 

This would be easy to do in the chunk allocation code.  I can't speak to whether this would be easy for the JS engine to use.

> It would also be worse if 2MiB is significantly less likely to be aligned than 1MiB, I guess.

I'm not sure what you mean.  But in order to do this, you'd have to be able to return either 1MB or 2MB from the function.
Try run for bceea3535d74 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=bceea3535d74
Results (out of 62 total builds):
    success: 58
    warnings: 4
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-bceea3535d74
Attached patch WIP v3 (obsolete) — Splinter Review
Still need to add thread-local storage.
Attachment #573221 - Attachment is obsolete: true
(In reply to Justin Lebar [:jlebar] from comment #17)
> > It would also be worse if 2MiB is significantly less likely to be aligned than 1MiB, I guess.
> 
> I'm not sure what you mean.
Well, I'm not sure how mapping works on the OS level. But if RAM (address space?) is fragmented such that there's lots of unaligned 2MiB holes, then asking for 1MiB might return an aligned section but asking for 2MiB will never do so.
Ah, I think I understand what you're saying.  You're comparing the following two algorithms?

Algorithm A:
  A.1 Map 1mb.  If that's aligned, return it.  Otherwise, continue to step A.2.
  A.2 Map 2mb.  Unmap it, then try to map 1mb inside.

Algorithm B:
  B.1 Map 2mb.  If it's aligned, return the full 2mb.  Otherwise, continue to step B.2.
  B.2 Unmap it, then try to map 1mb inside.

Now that I write it out, a difficulty with algorithm B is that JS kind of has to treat this as one chunk, not two, since when it unmaps the 2mb chunk, it has to do so all at once.  (Windows requires that there be a 1:1 mapping between map and unmap operations.)
Aah, fair enough then (annoying limitation though).
Try run for 37d374ec9db0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=37d374ec9db0
Results (out of 278 total builds):
    success: 232
    warnings: 44
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-37d374ec9db0
Try run for fe46d3bea0cf is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=fe46d3bea0cf
Results (out of 62 total builds):
    success: 33
    warnings: 27
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-fe46d3bea0cf
Try run for d736254a4fc0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=d736254a4fc0
Results (out of 47 total builds):
    exception: 1
    success: 24
    warnings: 20
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-d736254a4fc0
Oh, it's failing on Windows XP because __declspec(thread) doesn't work there with DLLs.  Awesome.

I'll see if the Windows API is any better here.

http://msdn.microsoft.com/en-us/library/9w1sdazb%28v=vs.80%29.aspx (see community content at the bottom.  I also debugged the crash locally.)
Try run for 46c8c24deef0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=46c8c24deef0
Results (out of 62 total builds):
    success: 59
    warnings: 1
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-46c8c24deef0
Attached patch Patch v1 (obsolete) — Splinter Review
Attachment #576249 - Flags: review?(igor)
Attachment #573323 - Attachment is obsolete: true
(In reply to Justin Lebar [:jlebar] from comment #28)
> Created attachment 576249 [details] [diff] [review] [diff] [details] [review]
> Patch v1

I do not see why do you need a thread-local to store the flag if the last chunk was aligned. For example, do you have data that show that if another thread would get aligned chunk, then this would not influence a chance to get an aligned chunk on this thread?
(In reply to Igor Bukanov from comment #29)
> (In reply to Justin Lebar [:jlebar] from comment #28)
> > Created attachment 576249 [details] [diff] [review] [diff] [details] [review] [diff] [details] [review]
> > Patch v1
> 
> I do not see why do you need a thread-local to store the flag if the last
> chunk was aligned. For example, do you have data that show that if another
> thread would get aligned chunk, then this would not influence a chance to
> get an aligned chunk on this thread?

Another observation is that without cooperation with jemalloc and other mmap users in the browser trying to predict the result of the mmap may backfire. Also a partial unmap on Linux/Map may in a longer run harm the kernel page allocator.

So I would prefer to see a patch that always try to allocate first 1-MB chunks at is it was before. Then in another bug we should measure how often that fails and what heuristics can help to improve on that.
> I do not see why do you need a thread-local to store the flag if the last chunk was 
> aligned.

It's not clear to me that tls is necessary here.  I'm just matching the jemalloc code.

The three options for storing this boolean are:

* Global boolean synchronized with locks or atomic ops.  With no evidence, this seems like overkill to me.

* Global boolean without locks.  This seems like a reasonable thing to do, since it's just a hint.  Terrence suggested earlier that he's concerned about the overhead of extra cache coherency traffic of this compared to TLS, but I suspect that, in a situation where we're allocating so many blocks, this would easily be lost in the noise.

* TLS.  Like you suggested, it seems to me that the OS probably serializes mmap calls, so TLS doesn't make a lot of sense here except as a performance optimization.


> Another observation is that without cooperation with jemalloc and other mmap users in the 
> browser trying to predict the result of the mmap may backfire.

Agreed.  Although I'm not sure whether we'd hit these pathological cases in practice.  The case that really matters is when we're in a tight loop allocating lots of blocks, and in that case, the hint should be reasonably accurate.

None of the changes I've made here have shown up in any benchmark, leading me to believe that this code isn't so critical to performance, at least for our common benchmarks.  My main concern is to get rid of the loop on *nix, where it's been known to be a problem.  So if you think we should diverge from jemalloc and get rid of the optimistic/pessimistic prediction code, I'm happy to do that.
(In reply to Justin Lebar [:jlebar] from comment #31)
> > I do not see why do you need a thread-local to store the flag if the last chunk was 
> > aligned.
> 
> It's not clear to me that tls is necessary here.  I'm just matching the
> jemalloc code.

Lets not copy unclear parts of jemalloc heuristics tuned for malloc, not GC allocation patterns, into the GC. Besides, if really need to match jemalloc, we should call jemalloc chunk allocator code directly when compiled with MOZ_MEMORY.

> * Global boolean without locks.  This seems like a reasonable thing to do,
> since it's just a hint.  Terrence suggested earlier that he's concerned
> about the overhead of extra cache coherency traffic of this compared to TLS,
> but I suspect that, in a situation where we're allocating so many blocks,
> this would easily be lost in the noise.

The idea of that heuristic is to minimize the number of the system calls. Using thread-local or global flag is not about optimizing the access to the flag.  Rather it is about using different heuristics that could save a few rather expensive system calls. Compared with that the cache coherency implications should be noise.

> * TLS.  Like you suggested, it seems to me that the OS probably serializes
> mmap calls, so TLS doesn't make a lot of sense here except as a performance
> optimization.

The chunk allocation pattern for a particular thread is very different from the pattern when looked from the global point of view. I want to emphasis it again that this is not about thread-local optimization.

> So if you think we should diverge
> from jemalloc and get rid of the optimistic/pessimistic prediction code, I'm
> happy to do that.

This is exactly what I suggest. I.e. lets copy fixes for the bugs (like avoiding that loop on mmap-equipped platforms) or copy clear improvements (like not-committing memory on Windows when tiring oversized allocation), but leave unclear optimizations.
Try run for a96c51d86b4b is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=a96c51d86b4b
Results (out of 249 total builds):
    exception: 4
    success: 222
    warnings: 20
    failure: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-a96c51d86b4b
Comment on attachment 576249 [details] [diff] [review]
Patch v1

The proposed changes have landed as a part of another bug.
Attachment #576249 - Attachment is obsolete: true
Attachment #576249 - Flags: review?(igor)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Could you please link me to the bug?  I don't see relevant changes in inbound.
Comment on attachment 576249 [details] [diff] [review]
Patch v1

Re-opening and re-requesting review, because I can't find any evidence that this was fixed.  There's been only one change to jsgcchunk.cpp since October, and it was cosmetic.

Igor, maybe you closed the wrong bug?
Attachment #576249 - Flags: review?(igor)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Patch v2Splinter Review
Attachment #590982 - Flags: review?(igor)
Patch v2 gets rid of TLS, per comment 32.
Comment on attachment 590982 [details] [diff] [review]
Patch v2

Review of attachment 590982 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, I thought this was landed some time ago. r+ with the comments below addressed.

::: js/src/jsgcchunk.cpp
@@ +318,2 @@
>      void *p;
> +    bool wasAligned;

This variable is no longer used.

@@ +376,5 @@
> +     * method.
> +     *
> +     * Notice that we have to unmap before we remap, due to Windows's
> +     * restriction that there be a 1:1 mapping between VM alloc and dealloc
> +     * operations. */

Nit: style for multi-line comment is always to put opening /* and closing */ on their own lines and stay within 78 char per line in the comment text.

@@ +395,5 @@
> +inline static void *
> +AllocChunkSlow()
> +{
> +    /* Map space for two chunks, then unmap around the result so we're left with
> +     * space for one chunk. */

Nit: fix comment style per above

@@ +397,5 @@
> +{
> +    /* Map space for two chunks, then unmap around the result so we're left with
> +     * space for one chunk. */
> +
> +    char *p = reinterpret_cast<char*>(MapPages(NULL, ChunkSize * 2));

If mapPages returns void *, the  cast here should be static_cast.

@@ +431,5 @@
> +     * we called AllocChunk() on this thread, the fast path would have failed.
> +     * But here in the js engine, we always try the fast path before falling
> +     * back to the slow path, because it's not clear that jemalloc's heuristic
> +     * is helpful to us.
> +     */

Fix comment style

@@ +447,5 @@
> +
> +    /* We allocated a chunk, but not at the correct alignment.  Try to extend
> +     * the tail end of the chunk and then unmap the beginning so that we have an
> +     * aligned chunk.  If that fails, do the slow version of AllocChunk. */
> +

Fix comment style

@@ +457,5 @@
> +        return p;
> +    }
> +
> +    /* Extension failed.  Clean up, then revert to the reliable-but-expensive
> +     * method. */

Fix comment style
Attachment #590982 - Flags: review?(igor) → review+
(In reply to Justin Lebar [:jlebar] from comment #40)
> Thanks, Igor.
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/e7fa7c10803e

This still contains  * */ as a comment terminator at https://hg.mozilla.org/integration/mozilla-inbound/rev/e7fa7c10803e#l1.157 . Could you fix land a quick fix for that?
Backed out on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd472718b66e

because of a build error on Win64:
https://tbpl.mozilla.org/php/getParsedLog.php?id=8790765&tree=Mozilla-Inbound
e:/builds/moz2_slave/m-in-w64/build/js/src/jsgcchunk.cpp(74) : error C2061: syntax error : identifier 'uint32'
That should be uint32_t these days.  I'm a bit surprised that built anywhere, actually.  I'd be interested in knowing what header was providing a definition of that type, because nothing should be, and having a definition is a bit dangerous for underlying-type confusion.
https://hg.mozilla.org/integration/mozilla-inbound/rev/04b4225269a4, but I failed to change the reinterpret_casts to static_casts.  :-/  I'll make the chance once I confirm that this push stuck.
static_cast followup (hopefully the last followup for this bug!)
https://hg.mozilla.org/integration/mozilla-inbound/rev/0a4bc12a0d98
https://hg.mozilla.org/mozilla-central/rev/04b4225269a4
https://hg.mozilla.org/mozilla-central/rev/0a4bc12a0d98
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla12
Attachment #576249 - Flags: review?(igor)
Is this something QA can verify?
Whiteboard: [qa?]
No, not really.  This code has been rewritten anyway.
Whiteboard: [qa?]
qa- as per comment 50 -- we won't be verifying this one.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: