Closed Bug 1073662 Opened 10 years ago Closed 10 years ago

Clean up mozjemalloc allocation logic and import jemalloc3 chunk recycling logic

Categories

(Core :: Memory Allocator, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: ehoogeveen, Assigned: ehoogeveen)

References

(Blocks 1 open bug)

Details

Attachments

(7 files, 14 obsolete files)

19.82 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
5.05 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
11.11 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
9.81 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
8.84 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
7.63 KB, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
997 bytes, patch
ehoogeveen
: review+
Details | Diff | Splinter Review
I'm moving the patches from bug 1005844 here as I'm not convinced they'll fix that bug and I'd rather not split landings across branches.
Carrying forward r=glandium from bug 1005844.
Attachment #8496199 - Flags: review+
Carrying forward r=glandium from bug 1005844. Rebased across bug 1057754, and I think I fixed a bug where I failed to call chunk_alloc correctly (but somehow it compiled fine).
Attachment #8496201 - Flags: review+
I'm much happier with this version. Compared to the last one, this:

1) Uses the chunk recording and recycling only for chunksized allocations, so that it cannot lead to unbalanced VirtualAlloc/VirtualFree calls.
2) Re-enables the address space coalescing that the recycling code does, since it's harmless now.
3) Uses commit/decommit rather than resetting in pages_purge when decommitting is enabled. This ensures that regardless of the current state of pages in a chunk, after recycling they will all be committed.
4) Adds and enforces a MALLOC_RECYCLE_LIMIT so that we don't keep unbounded amounts of address space allocated forever. I arbitrarily set this limit to 100 MiB so the cache can still be useful; smaller values might still be good enough, but note that it's only address space. The only way this value could be a problem is if we're very low on address space and some other allocator (the GC or OS thread creation, let's say) needs it. I'm open to suggestions on how to make this value less arbitrary though.

I sent this to try at https://tbpl.mozilla.org/?tree=Try&rev=fc8d139f9578 and confirmed locally that it works; it survived a run of Membench just fine.
Attachment #8496212 - Flags: review?(mh+mozilla)
Blocks: 1005844
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #4)
> 1) Uses the chunk recording and recycling only for chunksized allocations,

Note that this limitation only applies on Windows, although I doubt huge allocations are so frequent that they'll make a big difference.
Could you split the patch in a "import chunk recycle code" (which would come from the other bug) and a "cap the amount of memory" parts ?
I'll do you one better :) This is just the imported code, changed only enough to compile but not hooked up to anything (as a baseline).
Attachment #8496212 - Attachment is obsolete: true
Attachment #8496212 - Flags: review?(mh+mozilla)
Attachment #8496342 - Flags: review?(mh+mozilla)
This hooks up the code and makes it work with MALLOC_DECOMMIT (chunk_record is now allowed to fail).
Attachment #8496343 - Flags: review?(mh+mozilla)
And finally, this adds the limit.
Attachment #8496345 - Flags: review?(mh+mozilla)
Comment on attachment 8496342 [details] [diff] [review]
Part 4: Import chunk recycle code from jemalloc3.

Review of attachment 8496342 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +1184,5 @@
> + * ordering).  These are used when allocating chunks, in an attempt to re-use
> + * address space.  Depending on function, different tree orderings are needed,
> + * which is why there are two trees with the same contents.
> + */
> +static extent_tree_t	chunks_szad;

Just for consistency with jemalloc3, can you name this chunks_szad_mmap and the following chunks_ad_mmap? (also avoids ambiguity with the function argument names)
Attachment #8496342 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8496343 [details] [diff] [review]
Part 5: Hook the chunk recycle code up and make it safe for use with MALLOC_DECOMMIT.

Review of attachment 8496343 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +2721,5 @@
> +		 * matched, making it awkward to recycle allocations of varying
> +		 * sizes. Therefore we only allow recycling when the requested
> +		 * size equals the chunksize.
> +		 */
> +		return (NULL);

I think it would be better to do this check at the call site.
Please also file a followup bug to avoid not recycling huge allocs, and another bug for the issue with this feature on windows in jemalloc3 (it should be filed upstream too ; a github issue would probably be best, Cc me (glandium) when you file there)

@@ +2863,5 @@
> +		 * sizes. Therefore we only allow recycling when the size equals
> +		 * the chunksize.
> +		 */
> +		return false;
> +	}

You should just more this check out of chunk_record, that would avoid having the change the function signature.

@@ +2899,5 @@
>  		if (xnode == NULL) {
> +			/* base_node_alloc() failed. */
> +			malloc_mutex_unlock(&chunks_mtx);
> +
> +			return false;

For future-proofing, I'd rather use a temporary variable with the return value and return it at the end of the function.
Attachment #8496343 - Flags: review?(mh+mozilla) → feedback+
Comment on attachment 8496345 [details] [diff] [review]
Part 6: Limit the maximum amount of memory that the chunk recycle code may keep alive.

Review of attachment 8496345 [details] [diff] [review]:
-----------------------------------------------------------------

Please file this upstream too.

::: memory/mozjemalloc/jemalloc.c
@@ +287,5 @@
>  /* Enable chunk recycling on all platforms. */
>  #define MALLOC_RECYCLE
>  
> +/* The maximum amount of memory to keep alive for later use, in bytes. */
> +#define MALLOC_RECYCLE_LIMIT (100 * 1024 * 1024)

This should probably defined in terms of a number of chunks.

@@ +2902,5 @@
> +		 */
> +		if (xnode != NULL)
> +			base_node_dealloc(xnode);
> +
> +		return false;

cf. remark on previous patch, keep the goto label and use it.
Attachment #8496345 - Flags: review?(mh+mozilla) → feedback+
BTW, could you run this through AWSY?
(In reply to Mike Hommey [:glandium] from comment #13)
> BTW, could you run this through AWSY?

Instructions are here:
https://areweslimyet.com/faq.htm#how-can-i-request-additional-awsy-tests-on-specific-changesets-or-try-pushes

You probably don't have VPN access, so could you please do the following.

- Do a try server build without your patches applied.

- Do a try server build with your patches applied.

- Post the links to each of those try jobs here.

- Ask someone on #memshrink to queue the builds on AWSY.

Thank you.
Mike: I made a version of this for jemalloc tip and submitted a pull request, but I have no idea how to CC you on it (consider this needinfo a heads up). The pull request is here: https://github.com/jemalloc/jemalloc/pull/153

Nicholas: I'd like to see how this pull request goes first, as I had to change some things to make it work on jemalloc tip. Hopefully it won't take too long (if it does, I'll just update the patches here at some point).
Flags: needinfo?(mh+mozilla)
There's a syntax when mentioning names on github. I don't remember if it's + or @.
Flags: needinfo?(mh+mozilla)
Emanuel, any progress here? We'd *love* to get this landed :) Thank you.
No comments on the pull request other than my back and forth with Mike. I'll probably split out the unrelated fix for Windows and see if I can just attach patches to an issue instead, but I guess I'll update the patches here first.

PS: Thanks for the reminder that this is wanted, the lack of response from upstream has been very frustrating to me.
Do we genuinely need responses from upstream in order to land this?
(In reply to Nicholas Nethercote [:njn] from comment #19)
> Do we genuinely need responses from upstream in order to land this?

Well, it would be nice to know that they're on board with the approach at least. Still, at least the revised patches should be a good match for jemalloc's dev branch now. I had to rewrite much of parts 4-6; sending them through try now (got them in just before it closed, hopefully they'll finish).
The only real change here is the addition of atomic_add_z and atomic_sub_z. I basically cherry-picked the implementations we care about from jemalloc3's atomic.h.

> Just for consistency with jemalloc3, can you name this chunks_szad_mmap and
> the following chunks_ad_mmap? (also avoids ambiguity with the function
> argument names)

Done.
Attachment #8496342 - Attachment is obsolete: true
Attachment #8510035 - Flags: review?(mh+mozilla)
I changed this to fully implement JEMALLOC_MUNMAP, since it was already 90% of the way there before and things make more sense this way. I fixed pages_purge() to work with non-chunksize allocations on Windows.

This version no longer makes any changes to chunk_record, since making it fallible wasn't compatible with the dev branch of jemalloc3 and we can do everything at the caller.
Attachment #8496343 - Attachment is obsolete: true
Attachment #8510039 - Flags: review?(mh+mozilla)
This version moves the decisions on whether to record or recycle into the callers, makes JEMALLOC_RECYCLE a separate concept to JEMALLOC_MUNMAP (which serves its own purpose) and does things in terms of compile time constant variables instead of defines, as jemalloc3 does it (the defines got *really* ugly).

The main reason I didn't want to make the decision about whether or not to record a chunk in the caller is that we can't take the chunk lock until we're in chunk_record - so the check can race with additions to the chunk cache. But I guess that's not really a big deal - at most we'll go over the limit by a few chunks in a heavily multithreaded situation. I did have to make the check atomic to compensate for the lack of a lock, though.

(In reply to Mike Hommey [:glandium] from comment #12)
> > +/* The maximum amount of memory to keep alive for later use, in bytes. */
> > +#define MALLOC_RECYCLE_LIMIT (100 * 1024 * 1024)
> 
> This should probably defined in terms of a number of chunks.

Done.
Attachment #8496345 - Attachment is obsolete: true
Attachment #8510041 - Flags: review?(mh+mozilla)
:johns triggered runs on AWSY for me, which should show up here eventually:
https://areweslimyet.com/?series=bug1073662#evenspacing

I also did some extra try runs for other configurations.

This disables JEMALLOC_RECYCLE, which should match the pre-patch configuration:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=30b15a0934f6

This disables JEMALLOC_MUNMAP, completely disabling memory unmapping (which might be viable on 64-bit platforms and supposedly helps Linux):
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=23d31ac71d5b

It's good to see that these work, especially the former, as it means we can just opt out of the new code if needed.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #25)
> :johns triggered runs on AWSY for me, which should show up here eventually:
> https://areweslimyet.com/?series=bug1073662#evenspacing

It's good to trigger a baseline run as well to make comparisons easier. And doing 3 triggers of baseline and the modified version is even better.

Nonetheless, things look promising:

- The two low (start-up lines) look unchanged: 173 and 151, which are similar to current.

- The five high (after TP5, no waiting) lines look unchanged: 395--495, which are similar to current.

- The two middle lines (after TP5, +30s) lines look better: 237 and 254, vs. ~260 and ~270. Though there is enough variation in these lines that it's hard to say it's a conclusive improvement.
I queued up a few more runs with these patches -- and included the ones from comment 25 for good measure -- so we can make sure we're not seeing noise:

https://areweslimyet.com/?series=bug1073662run2#evenspacing
https://areweslimyet.com/?series=bug1073662run3#evenspacing
https://areweslimyet.com/?series=bug1073662run4#evenspacing

(These wont show up for a few hours)
Comment on attachment 8510035 [details] [diff] [review]
Part 4 v2: Import chunk recycling and some of the atomics code from jemalloc3.

Review of attachment 8510035 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +1518,5 @@
> +}
> +
> +#endif
> +
> +#define atomic_read_z(p) atomic_add_z(p, 0)

You don't need these functions (see upcoming review for part 6)
Attachment #8510035 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8510039 [details] [diff] [review]
Part 5 v2: Hook the chunk recycle code up to JEMALLOC_MUNMAP and make it safe for use with MALLOC_DECOMMIT.

Review of attachment 8510039 [details] [diff] [review]:
-----------------------------------------------------------------

This looks nicer than previous iteration. Please reflag me for the final round.

::: memory/mozjemalloc/jemalloc.c
@@ +2732,5 @@
> +	* region in one go may not be valid. However, since we allocate at least
> +	* a chunk at a time, we may touch any region in chunksized increments.
> +	*/
> +#  ifdef MALLOC_DECOMMIT
> +	for (; length >= chunksize; length -= chunksize) {

> 0

@@ +2733,5 @@
> +	* a chunk at a time, we may touch any region in chunksized increments.
> +	*/
> +#  ifdef MALLOC_DECOMMIT
> +	for (; length >= chunksize; length -= chunksize) {
> +		pages_decommit(addr, chunksize);

pages_decommit(addr, min(chunksize, length));

@@ +2741,5 @@
> +		pages_decommit(addr, length);
> +	unzeroed = false;
> +#  else
> +	for (; length >= chunksize; length -= chunksize) {
> +		VirtualAlloc(addr, chunksize, MEM_RESET, PAGE_READWRITE);

same here

@@ +2851,5 @@
> +	 * commit them again. As the original region may have been allocated in
> +	 * multiple calls to VirtualAlloc, we recommit in chunksized increments.
> +	 */
> +	{
> +		size_t size2 = size;

Replace size2 with, I don't know, remaining?

@@ +2852,5 @@
> +	 * multiple calls to VirtualAlloc, we recommit in chunksized increments.
> +	 */
> +	{
> +		size_t size2 = size;
> +		void *ret2 = ret;

Replace ret2 with something more expressive, like chunk_ptr;

@@ +2853,5 @@
> +	 */
> +	{
> +		size_t size2 = size;
> +		void *ret2 = ret;
> +		for (; size2 >= chunksize; size2 -= chunksize) {

Same as the decommit case, you can do >0 here

@@ +2854,5 @@
> +	{
> +		size_t size2 = size;
> +		void *ret2 = ret;
> +		for (; size2 >= chunksize; size2 -= chunksize) {
> +			pages_commit(ret2, chunksize);

and pages_commit(chunk_ptr, min(chunksize, remaining));

@@ +2892,5 @@
> +		ret = chunk_recycle(&chunks_szad_mmap, &chunks_ad_mmap,
> +			size, alignment, base, &zero);
> +		if (ret != NULL)
> +			goto RETURN;
> +	}

Interestingly, jemalloc3 doesn't care to check for config_munmap before calling chunk_recycle. Seems to me you should send a patch upstream for that.

@@ +3025,5 @@
>  
> +	if (!config_munmap)
> +		chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);
> +	else
> +		chunk_dealloc_mmap(chunk, size);

Why the subtle difference with jemalloc3? I'd just copy chunk_dealloc_mmap from upstream and do
if (chunk_dalloc_mmap(chunk, size))
  
chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);
Attachment #8510039 - Flags: review?(mh+mozilla) → feedback+
Comment on attachment 8510041 [details] [diff] [review]
Part 6 v2: Add JEMALLOC_RECYCLE, a mode that keeps a limited amount of chunks alive.

Review of attachment 8510041 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +1179,5 @@
> + */
> +#if SIZEOF_PTR_2POW == 2
> +#define CHUNK_RECYCLE_LIMIT 128
> +#else
> +#define CHUNK_RECYCLE_LIMIT 128 * 1024

128GB? That seems like a whole lot of decommitted address space to hold on to. Why not simply twice as much as what we'd hold on to on 32-bits?

That said, I can see how non-Firefox cases would want this to be easily configurable when this is upstream, so maybe that should go in jemalloc_config there (after all, the chunk size is there too).

@@ +2918,5 @@
>  	assert((size & chunksize_mask) == 0);
>  	assert(alignment != 0);
>  	assert((alignment & chunksize_mask) == 0);
>  
> +#ifndef MOZ_MEMORY_WINDOWS

It would probably read better as:
#ifdef MOZ_MEMORY_WINDOWS
/* comment */
#define CAN_RECYCLE(size) (size == chunksize)
#else
#define CAN_RECYCLE(size) true
#endif

     if (!config_munmap || (config_recycle && CAN_RECYCLE(size)) {

@@ +3032,5 @@
>  		xprev = prev;
>  	}
>  
> +	if (config_munmap && config_recycle)
> +		atomic_add_z(&recycled_size, size);

Note you're under the chunks_mtx lock here, so you're safe as far as other threads running this code is concerned.

@@ +3075,5 @@
> +	 * matched, making it awkward to recycle allocations of varying
> +	 * sizes. Therefore we only allow recycling when the size equals
> +	 * the chunksize unless deallocation is entirely disabled.
> +	 */
> +	if (!config_munmap || (config_recycle && size == chunksize &&

if (!config_munmap || (config_recycle && CAN_RECYCLE(size) && recycled_size < recycle_limit)) {

what's the worst that can happen from not using an atomic read here? a couple chunks sneaking in the recycle list. I don't think we need to care about the limit being absolutely accurate.
Attachment #8510041 - Flags: review?(mh+mozilla) → feedback+
(In reply to Mike Hommey [:glandium] from comment #29)
> > +#  ifdef MALLOC_DECOMMIT
> > +	for (; length >= chunksize; length -= chunksize) {
> 
> > 0

length is a size_t, so that could underflow right? I could store the size in an intptr_t (since I doubt >= 2 GiB allocations are going to work in a 32-bit process anyway), declared in the loop header. It makes me uncomfortable without an assertion though.

> Why the subtle difference with jemalloc3? I'd just copy chunk_dealloc_mmap
> from upstream and do
> if (chunk_dalloc_mmap(chunk, size))
>   
> chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);

Oh, this just occurred to me - do you mean I should put the whole 'should this chunk be recycled' logic in chunk_dalloc_mmap? I couldn't do that with the previous approach, but that actually makes sense now.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #31)
> (In reply to Mike Hommey [:glandium] from comment #29)
> > > +#  ifdef MALLOC_DECOMMIT
> > > +	for (; length >= chunksize; length -= chunksize) {
> > 
> > > 0
> 
> length is a size_t, so that could underflow right? I could store the size in
> an intptr_t (since I doubt >= 2 GiB allocations are going to work in a
> 32-bit process anyway), declared in the loop header. It makes me
> uncomfortable without an assertion though.

decrementing with min(chunksize, length) would work too.

> > Why the subtle difference with jemalloc3? I'd just copy chunk_dealloc_mmap
> > from upstream and do
> > if (chunk_dalloc_mmap(chunk, size))
> >   
> > chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);
> 
> Oh, this just occurred to me - do you mean I should put the whole 'should
> this chunk be recycled' logic in chunk_dalloc_mmap? I couldn't do that with
> the previous approach, but that actually makes sense now.

yes
(In reply to Mike Hommey [:glandium] from comment #32)
> decrementing with min(chunksize, length) would work too.

Hmm, true, I guess I'll do that.

(In reply to Mike Hommey [:glandium] from comment #30)
> 128GB? That seems like a whole lot of decommitted address space to hold on
> to. Why not simply twice as much as what we'd hold on to on 32-bits?

I figured there's plenty of address space to go around on 64-bit [1], so there's no real need to limit it - and to actually make it to 128GB we'd have to actually allocate that much first (then deallocate all of it), which seems impossible. However there's actually another limit that I forgot about: max_map_count on Linux, which appears to generally be 65530. Since we allocate in chunksize increments, we should make sure to stay well shy of that limit. How about something like 4096 * chunksize? It's more that I see no reason *not* to make it a high limit than anything else.

[1] The limit is 7 TiB on 64-bit Windows with an Itanium-based processor and 8 TiB on others, 128 TiB on 64-bit Linux and something ridiculous on OSX.

> It would probably read better as:
> #ifdef MOZ_MEMORY_WINDOWS
> /* comment */
> #define CAN_RECYCLE(size) (size == chunksize)
> #else
> #define CAN_RECYCLE(size) true
> #endif
> 
>      if (!config_munmap || (config_recycle && CAN_RECYCLE(size)) {

I like that, will do.

> what's the worst that can happen from not using an atomic read here? a
> couple chunks sneaking in the recycle list. I don't think we need to care
> about the limit being absolutely accurate.

Hmm, this is kind of a tough question. It's true we don't need atomicity, but it's worth thinking about the memory ordering.

(1) If |recycled_size < recycle_limit|, we go to record the chunk and take the lock, inserting a memory fence and essentially giving us release-acquire memory ordering (not quite sequentially consistent, because there's no fence before the load).

(2) If |recycled_size >= recycle_limit|, we never take any locks, and so this has relaxed memory ordering. If allocation always happens on one thread, and deallocation always happens on another, this could theoretically never sync up, and we wouldn't record any more chunks even though they're being recycled. In practice I expect we'd get memory fences and serializing instructions every now and then regardless, but I don't really like leaving it up to chance. It seems odd to make only the load here atomic though :\
This removes the atomic functions again, but see part 6. Carrying forward r=glandium
Attachment #8510035 - Attachment is obsolete: true
Attachment #8515511 - Flags: review+
(In reply to Mike Hommey [:glandium] from comment #29)
> > +	for (; length >= chunksize; length -= chunksize) {
> 
> > 0

> > +		pages_decommit(addr, chunksize);
> 
> pages_decommit(addr, min(chunksize, length));

Done. I switched to using VirtualAlloc/VirtualFree directly here since (1) This is Windows-specific code, (2) The MEM_RESET case already used it and (3) jemalloc3 doesn't have pages_commit/pages_decommit.

> > +		size_t size2 = size;
> 
> Replace size2 with, I don't know, remaining?

> > +		void *ret2 = ret;
> 
> Replace ret2 with something more expressive, like chunk_ptr;

Done.

> > +	if (!config_munmap)
> > +		chunk_record(&chunks_szad_mmap, &chunks_ad_mmap, chunk, size);
> > +	else
> > +		chunk_dealloc_mmap(chunk, size);
> 
> Why the subtle difference with jemalloc3? I'd just copy chunk_dealloc_mmap
> from upstream

Done, although I wrote it slightly differently to make the change in behavior in part 6 more obvious (part 6 pretty much needs it to be written this way).
Attachment #8510039 - Attachment is obsolete: true
Attachment #8515512 - Flags: review?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #30)
> > +#define CHUNK_RECYCLE_LIMIT 128 * 1024
> 
> 128GB? That seems like a whole lot of decommitted address space to hold on
> to. Why not simply twice as much as what we'd hold on to on 32-bits?

I changed this to 4096 and added a comment (4096 and 128 actually have some interesting parity in this case). Let me know if you disagree with the reasoning and think it should be less.

> #define CAN_RECYCLE(size) (size == chunksize)

Done. This is a nice little cleanup!

> > +	if (config_munmap && config_recycle)
> > +		atomic_add_z(&recycled_size, size);
> 
> Note you're under the chunks_mtx lock here, so you're safe as far as other
> threads running this code is concerned.

Done.

> if (!config_munmap || (config_recycle && CAN_RECYCLE(size) && recycled_size
> < recycle_limit)) {
> 
> what's the worst that can happen from not using an atomic read here? a
> couple chunks sneaking in the recycle list. I don't think we need to care
> about the limit being absolutely accurate.

See also comment #33. I talked this over with Waldo; he advised me to do the cautious thing, and I agree. Without any memory barriers, there's no guarantee that this load will ever sync up with the store in chunk_recycle().

I don't think we need sequential consistency here; release-acquire memory ordering is enough. To do that efficiently and explicitly, I wrote a small function that inserts a memory barrier after the load, called load_acquire_z to differentiate it from the jemalloc atomic functions.

I realize this may not be the right answer for upstream (which can just use the existing atomic_read_z), but I think this is the best solution for our fork (since we can't import all the jemalloc atomics functions verbatim anyway). Let me know if you'd prefer something different. The use of InterlockedExchange with a dummy variable on Windows is fairly well known - MinGW uses it, for instance, and it has been verified to generate the right instruction.
Attachment #8510041 - Attachment is obsolete: true
Attachment #8515516 - Flags: review?(mh+mozilla)
Comment on attachment 8515512 [details] [diff] [review]
Part 5 v3: Hook the chunk recycle code up to JEMALLOC_MUNMAP and make it safe for use with MALLOC_DECOMMIT.

Review of attachment 8515512 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +2694,5 @@
> +	for (; length > 0; length -= min(chunksize, length)) {
> +		VirtualAlloc(addr, min(chunksize, length), MEM_RESET,
> +			PAGE_READWRITE);
> +		addr = (void *)((uintptr_t)addr + chunksize);
> +	}

Note, it may or may not be clearer this way:

        for (; length > 0; length -= min(chunksize, length)) {
# ifdef MALLOC_DECOMMIT
            VirtualFree(addr, min(chunksize, length), MEM_DECOMMIT);
#   define UNZEROED false
# else
            VirtualAlloc(addr, min(chunksize, length), MEM_RESET, PAGE_READWRITE);
#   define UNZEROED true
# endif
            addr = (void *)((uintptr_t)addr + chunksize);
        }
        unzeroed = UNZEROED;

@@ +2839,5 @@
> +	if (!config_munmap) {
> +		ret = chunk_recycle(&chunks_szad_mmap, &chunks_ad_mmap,
> +			size, alignment, base, &zero);
> +		if (ret != NULL)
> +			goto RETURN;

Did you report the missing config_munmap check upstream?

@@ +2951,5 @@
>  		base_node_dealloc(xprev);
>  }
>  
> +static bool
> +chunk_dalloc_mmap(void *chunk, size_t size)

Why the function name change?
Attachment #8515512 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8515516 [details] [diff] [review]
Part 6 v3: Add JEMALLOC_RECYCLE, a mode that keeps a limited amount of chunks alive.

Review of attachment 8515516 [details] [diff] [review]:
-----------------------------------------------------------------

Nick, what do you think about the following?

::: memory/mozjemalloc/jemalloc.c
@@ +1176,5 @@
> +#if SIZEOF_PTR_2POW == 2
> +/* On a 32-bit OS, use at most 6.25% of the process address space. */
> +#define CHUNK_RECYCLE_LIMIT 128
> +#else
> +/* On a 64-bit OS, use at most 6.25% of the max_map_count (65536). */

It's worth mentioning that max_map_count is a linux kernel thing that defaults to 65536, but can be changed with sysctl.
That said, it's still weird to make a cross-platform jemalloc policy decision based on an implementation detail of one of the target OSes.
We don't change e.g. chunk sizes between 32-bits and 64-bits builds, I don't really see why we'd change the number of recycled chunks.

@@ +1512,5 @@
> +{
> +	size_t result = *p;
> +#  ifdef MOZ_MEMORY_WINDOWS
> +	long dummy = 0;
> +	InterlockedExchange(&dummy, 1);

If we really go the fence way, please add a comment that this does a memory fence as a side effect and that's what this is expected to do.
That said, I'm still not convinced we need a fence.
Attachment #8515516 - Flags: review?(n.nethercote)
Attachment #8515516 - Flags: review?(mh+mozilla)
Attachment #8515516 - Flags: review+
> Nick, what do you think about the following?
> 
> ::: memory/mozjemalloc/jemalloc.c
> @@ +1176,5 @@
> > +#if SIZEOF_PTR_2POW == 2
> > +/* On a 32-bit OS, use at most 6.25% of the process address space. */
> > +#define CHUNK_RECYCLE_LIMIT 128
> > +#else
> > +/* On a 64-bit OS, use at most 6.25% of the max_map_count (65536). */

Are you asking just about these few lines? I don't have any particular opinion one way or the other, so I'm happy with whatever you and Emanuel agree on.
Attachment #8515516 - Flags: review?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #40)
> > Nick, what do you think about the following?
> > 
> > ::: memory/mozjemalloc/jemalloc.c
> > @@ +1176,5 @@
> > > +#if SIZEOF_PTR_2POW == 2
> > > +/* On a 32-bit OS, use at most 6.25% of the process address space. */
> > > +#define CHUNK_RECYCLE_LIMIT 128
> > > +#else
> > > +/* On a 64-bit OS, use at most 6.25% of the max_map_count (65536). */
> 
> Are you asking just about these few lines?

About all the comments.
I realized there might be a problem on Mac, which uses MALLOC_DOUBLE_PURGE. pages_purge() does |madvise(..., MADV_FREE)|, but IIUC that doesn't make Mac actually decommit anything until the memory is needed by other processes. MALLOC_DOUBLE_PURGE forces it to decommit, which I believe we use when we measure in about:memory, but this isn't hooked up to the chunk recycling code.

Nicholas, how important is this to fix? Is this a reason to go with the same limit (128 chunks) on 32-bit and 64-bit? I'm not sure off-hand how to hook the recycling code up to the MALLOC_DOUBLE_PURGE stuff, but I could probably figure it out if needed.
Flags: needinfo?(n.nethercote)
Comment on attachment 8515516 [details] [diff] [review]
Part 6 v3: Add JEMALLOC_RECYCLE, a mode that keeps a limited amount of chunks alive.

Review of attachment 8515516 [details] [diff] [review]:
-----------------------------------------------------------------

I guess having the same values on 32-bit and 64-bit is slightly preferable if it makes sense, just because it makes things simpler. But I really am no expert on jemalloc's internals.
jlebar wrote all the Mac purging code; I know almost nothing about it, sorry.
(In reply to Mike Hommey [:glandium] from comment #38)
> Note, it may or may not be clearer this way:
> 
>         for (; length > 0; length -= min(chunksize, length)) {
> # ifdef MALLOC_DECOMMIT
>             VirtualFree(addr, min(chunksize, length), MEM_DECOMMIT);
> #   define UNZEROED false
> # else
>             VirtualAlloc(addr, min(chunksize, length), MEM_RESET,
> PAGE_READWRITE);
> #   define UNZEROED true
> # endif
>             addr = (void *)((uintptr_t)addr + chunksize);
>         }
>         unzeroed = UNZEROED;

I find that the indentation changes make it much harder to read for me - it makes a simple loop look very hairy. So I'd rather leave this the way it is.

> Did you report the missing config_munmap check upstream?

Not yet, but I will. I'd like to get things finished up here first.

> > +chunk_dalloc_mmap(void *chunk, size_t size)
> 
> Why the function name change?

For some reason upstream dropped the 'e' from dealloc. It made the diff look a little nicer and it matches jemalloc3 so I figured what the hell.
Carrying forward r=glandium.

(In reply to Mike Hommey [:glandium] from comment #39)
> That said, it's still weird to make a cross-platform jemalloc policy
> decision based on an implementation detail of one of the target OSes.
> We don't change e.g. chunk sizes between 32-bits and 64-bits builds, I don't
> really see why we'd change the number of recycled chunks.

I'm going to stop pushing on this; I could honestly go either way on it so let's just do the simple thing. This version makes all platforms use 128 chunks and extends the comment a little.

> > +	long dummy = 0;
> > +	InterlockedExchange(&dummy, 1);
> 
> If we really go the fence way, please add a comment that this does a memory
> fence as a side effect and that's what this is expected to do.
> That said, I'm still not convinced we need a fence.

Comment added. I also made the actual load volatile so the compiler can't move it. As for using a fence at all - I just don't feel comfortable relying on chance when it comes to multithreading.
Attachment #8515516 - Attachment is obsolete: true
Attachment #8518605 - Flags: review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #45)
> (In reply to Mike Hommey [:glandium] from comment #38)
> > Note, it may or may not be clearer this way:
> > 
> >         for (; length > 0; length -= min(chunksize, length)) {
> > # ifdef MALLOC_DECOMMIT
> >             VirtualFree(addr, min(chunksize, length), MEM_DECOMMIT);
> > #   define UNZEROED false
> > # else
> >             VirtualAlloc(addr, min(chunksize, length), MEM_RESET,
> > PAGE_READWRITE);
> > #   define UNZEROED true
> > # endif
> >             addr = (void *)((uintptr_t)addr + chunksize);
> >         }
> >         unzeroed = UNZEROED;
> 
> I find that the indentation changes make it much harder to read for me - it
> makes a simple loop look very hairy. So I'd rather leave this the way it is.

I for one don't like the loop repetition. How about

#ifdef MALLOC_DECOMMIT
# define PURGE(addr, size) VirtualFree(addr, size, MEM_DECOMMIT)
# define UNZEROED false
#else
# define PURGE(addr, size) VirtualAlloc(addr, size, MEM_RESET, PAGE_READWRITE);
# define UNZEROED true
#endif

  for (; length > 0; length -= min(chunksize, length)) {
    PURGE(addr, min(chunksize, length));
    addr = (void *)((uintptr_t)addr + chunksize);
  }
  unzeroed = UNZEROED;


> > > +chunk_dalloc_mmap(void *chunk, size_t size)
> > 
> > Why the function name change?
> 
> For some reason upstream dropped the 'e' from dealloc. It made the diff look
> a little nicer and it matches jemalloc3 so I figured what the hell.

Interesting, that must be something fairly recent because the version we have in-tree still has the e. Either way is fine I guess.
(In reply to Mike Hommey [:glandium] from comment #47)
> I for one don't like the loop repetition. How about [...]

That's not a bad idea. However, let me offer this patch as a counter-proposal: by fixing pages_commit and pages_decommit to work on chunksized regions on Windows (as they should, in the presence of recycling), we can simply call them instead.

This adds the same loop to pages_commit and pages_decommit, but removes it from chunk_recycle. In addition, this makes pages_purge work with MALLOC_DECOMMIT on any platform (although we only have it enabled on Windows), and removes the same limitation in chunk_recycle.

I also realized something while working on this that the regions that pages_commit and pages_decommit work on are not in general chunk-aligned. pages_purge is only used for chunk-aligned regions currently, but that doesn't have to remain the case.

So this patch modifies the loop slightly to first commit/decommit/purge everything up to the nearest chunk boundary, then proceed in chunksized increments. I actually think it looks better this way, though it does involve an additional variable.

Finally, this patch should fix the problem I noticed regarding MALLOC_DOUBLE_PURGE. The solution is simply to walk the tree of recycled chunks in jemalloc_purge_freed_pages_impl, and decommit-then-commit each region as hard_purge_chunk does. I can't actually test this locally since I don't have a Mac, so hopefully Try will exercise this path (though it's simple enough that if it compiles, it probably works).

I'm asking for another round of review since this version changed a fair few things.
Attachment #8515512 - Attachment is obsolete: true
Attachment #8518622 - Flags: review?(mh+mozilla)
Rebased on top of part 5 v4.

Oh, and I think the needinfo can be cleared now (see my solution to the problem in part 5).
Attachment #8518605 - Attachment is obsolete: true
Flags: needinfo?(n.nethercote)
Attachment #8518623 - Flags: review+
(In reply to Mike Hommey [:glandium] from comment #38)
> Did you report the missing config_munmap check upstream?

I filed https://github.com/jemalloc/jemalloc/issues/164
Silly compilation bustage fixes. Try on Linux looks good, but I'm having trouble getting any results out of Windows for some reason:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=878ff01b9a3c
and (after compile fix)
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2a7b9295b44c

Windows compiles fine locally, so I think this is some sort of weird infrastructure problem. It shouldn't be affected by the OSX compilation fix, since that code is ifdeffed out on Windows.
Attachment #8518622 - Attachment is obsolete: true
Attachment #8518622 - Flags: review?(mh+mozilla)
Attachment #8518698 - Flags: review?(mh+mozilla)
Something about part 5 completely breaks the Windows build on try:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2796f5b98540
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9d22d2fb88b7

The output is a completely unhelpful mess, unfortunately. I'll go over the code again and bisect it on try, I guess.
This fixes a silly copy/paste error (which made pages_commit try to decommit instead of committing). I don't know why this caused compilation to fail on try for Windows, but it was probably the cause. Sending it through try now.
Attachment #8518698 - Attachment is obsolete: true
Attachment #8518698 - Flags: review?(mh+mozilla)
Attachment #8519567 - Flags: review?(mh+mozilla)
Comment on attachment 8519567 [details] [diff] [review]
Part 5 v6: Hook the chunk recycle code up to JEMALLOC_MUNMAP and make it safe for use with MALLOC_DECOMMIT and MALLOC_DOUBLE_PURGE.

Review of attachment 8519567 [details] [diff] [review]:
-----------------------------------------------------------------

::: memory/mozjemalloc/jemalloc.c
@@ +284,5 @@
>  #define	MALLOC_DECOMMIT
>  #endif
>  
> +/* Enable unmapping pages on all platforms. */
> +#define JEMALLOC_MUNMAP

Can you move that to a separate patch? In case we want to back this out, it's easier to just backout the one-liner enabler than the whole thing.

@@ +1976,5 @@
> +	* time, we may touch any region in chunksized increments.
> +	*/
> +	size_t decommit_size = min(size, chunksize -
> +		CHUNK_ADDR2OFFSET((uintptr_t)addr));
> +	for (; size > 0; decommit_size = min(size, chunksize)) {

It would be clearer imho if this was a while loop with the size updated at the end. Likewise for the other loops.
Attachment #8519567 - Flags: review?(mh+mozilla) → review+
(In reply to Mike Hommey [:glandium] from comment #55)
> > +/* Enable unmapping pages on all platforms. */
> > +#define JEMALLOC_MUNMAP
> 
> Can you move that to a separate patch? In case we want to back this out,
> it's easier to just backout the one-liner enabler than the whole thing.

This doesn't actually change the behavior - the cache is only used when unmapping is *disabled*. jemalloc3 disables munmap by default on Linux, but we do not.

However, I can certainly split out the bit of part 6 that enables the capped recycling.

> > +	for (; size > 0; decommit_size = min(size, chunksize)) {
> 
> It would be clearer imho if this was a while loop with the size updated at
> the end. Likewise for the other loops.

Will do.
Carrying forward r=glandium. In addition to addressing the review comments, this clarifies the comment above #define JEMALLOC_MUNMAP a little and removes a reference to config_recycle (added back in part 6) in the MALLOC_DOUBLE_PURGE code.
Attachment #8519567 - Attachment is obsolete: true
Attachment #8522730 - Flags: review+
Carrying forward r=glandium. Rebased slightly, enabling JEMALLOC_RECYCLE split out into its own part.
Attachment #8518623 - Attachment is obsolete: true
Attachment #8522731 - Flags: review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #56)
> (In reply to Mike Hommey [:glandium] from comment #55)
> > > +/* Enable unmapping pages on all platforms. */
> > > +#define JEMALLOC_MUNMAP
> > 
> > Can you move that to a separate patch? In case we want to back this out,
> > it's easier to just backout the one-liner enabler than the whole thing.
> 
> This doesn't actually change the behavior - the cache is only used when
> unmapping is *disabled*. jemalloc3 disables munmap by default on Linux, but
> we do not.

Gah, backwards flags :)
This caused a 5.64% regression in TP5 (Private Bytes) on Linux. RSS stayed the same, and Windows is unaffected. I guess this isn't surprising since we're holding on to address space that we weren't before.

Windows might be unaffected because we're decommitting there, whereas Linux just uses madvise(..., MADV_DONTNEED), but I'm not really sure how Private Bytes are measured (i.e. what they are).
What does "private bytes" mean on Linux?
about:memory says about 'private': "Memory that cannot be shared with other processes, including memory that is committed and marked MEM_PRIVATE, data that is not mapped, and executable pages that have been written to."

Since we only MADV_DONTNEED the pages, I don't think that will reduce our private bytes unless the OS decommits the pages due to memory pressure.
Note that pages_decommit sets the pages to PROT_NONE instead (which apparently makes the OS *actually* get rid of them), but MALLOC_DECOMMIT is not enabled on Linux.
By the way, the patches also apparently caused this improvement:

http://graphs.mozilla.org/graph.html#tests=%5B%5B254,63,21%5D%5D&sel=1415894932000,1416067732000&displayrange=7&datatype=running

Perhaps there's a lot of allocation churn during resizing? Since we don't unmap unless the chunk cache is empty now, this might have gotten faster.
IIRC MADV_DONTNEED makes the kernel drop the pages immediately, and the recorded chunks are purged, which does MADV_DONTNEED on them right? So an increase in private bytes theoretically can't come from that.
As far as I can tell, 'Private Bytes' is a (hard to understand) Microsoft term, so I'm not sure what it even really means on Linux. It would be nice to know how Talos and/or about:memory determine this figure.

Joel, could you point me in the right direction?
Flags: needinfo?(jmaher)
here is where we calculate 'private bytes' for linux:
http://hg.mozilla.org/build/talos/file/2bdbc49134c4/talos/cmanager_linux.py#l74

I assume this is related to the alert that came out from graph server/talos ?
Flags: needinfo?(jmaher)
Thanks! So that just tallies up all the mapped ranges that have PROT_WRITE set on them.

MADV_DONTNEED doesn't change the memory protection of the range, so I don't see any reason why it would reduce private bytes. It also explains why pages_decommit *would* reduce private bytes, since it explicitly sets the range to PROT_NONE.
Seems to me it's an uninteresting metric to care about here. Yes, this bug makes us "leak" "private" address space on purpose. The important part is that RSS doesn't change because we MADV_DONTNEED it.
:glandium, would you recommend that we stop collecting private bytes for linux since we do collect rss?  Would that apply to other platforms as well?  I would rather fix what we are collecting by adjusting/removing it so we can track what is useful and meaningful.
Flags: needinfo?(mh+mozilla)
Linux aside, I would definitely suggest not collecting private bytes for Mac - the number seems to hover around 3.8 GiB for 10.6 (which is 32-bit, I think?) and about the same with more noise for 10.8.

I don't know if we could do anything to make it report a useful number (for instance, maybe the MALLOC_DOUBLE_PURGE stuff isn't kicking in because Talos doesn't call the relevant function), but it doesn't look useful now.
(In reply to Joel Maher (:jmaher) from comment #75)
> :glandium, would you recommend that we stop collecting private bytes for
> linux since we do collect rss?

yes.

> Would that apply to other platforms as well?

Depends what that would be measuring, since the code linked in comment 72 is linux specific. But if that's equivalent, I would tend to say yes, as long as we have an equivalent measurement like RSS.
Flags: needinfo?(mh+mozilla)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #76)
> Linux aside, I would definitely suggest not collecting private bytes for Mac
> - the number seems to hover around 3.8 GiB for 10.6 (which is 32-bit, I
> think?) and about the same with more noise for 10.8.

FWIW, this sounds similar to vsize, which on Mac is always 3 GiB or more.
This is causing a frequent crash in running talos on mac osx 10.6 (bug 1100485).  We see about 1 failure every 20 runs, so a lot of retriggers make this pretty easy to spot the culprit:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=ea818fdbd81c&tochange=9b77a97a378b&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20talos%20tp5o
reopening this as it is causing crashes on osx 10.6 as defined in bug 1100485.  Please take a look at fixing these crashes or consider backing this out.  As we are uplifting in <2 weeks, we should have this resolved prior to that.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer depends on: 1100457
This might help you debug. This patch here - https://bugzilla.mozilla.org/show_bug.cgi?id=552020#c43 - Makes the talos test crash a bit more often, which might help, however again only on 10.6. There is also a crash dump that occurs that has a few more symbols. What you can also do is request a try machine that you can VNC into to debug.
This wasn't actually backed out, and I'm fixing the crashes in bug 1100485.

Mason, thanks for the tip! That definitely helped expose the problem on try.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: