669245 - GC chunk choice is suboptimal

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Description

•

13 years ago

The current algorithm for choosing the GC chunk for an allocation in a compartment is:

1. If the compartment has allocated since the last GC and the chunk it allocated in has free space, use this chunk.
2. Search existing chunks for space, and use the first chunk that has available space.
3. Allocate a new chunk if all existing chunks are full.

I think that the more compartments that have allocations in a given chunk, the less likely it is that that chunk can be reclaimed (this is the "I closed a bunch of tabs but my memory usage didn't drop" problem).  If you accept this premise, we would benefit from preferring to allocate from chunks we've already allocated from.  I propose that we do the following:

1. Keep a list of all of the gc chunks that a given compartment has allocated in
2. Search this list (in order from newest to oldest) and use the most recently touched chunk that has available space.
3. Search existing chunks for space, and use the first chunk that has available space.
4. Allocate a new chunk if all existing chunks are full.

The only bit that seems tricky here is ensuring that we drop gc chunks that are reclaimed from the list in (1).  I assume JSCompartment::sweep clears the chunk cache ptr precisely to avoid dealing with this.

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Keywords: footprint

Whiteboard: [MemShrink]

Justin Lebar (not reading bugmail)

Comment 1

•

13 years ago

See http://www.facebook.com/video/video.php?v=696488619305

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Justin Lebar (not reading bugmail)

Comment 2

•

13 years ago

AIUI the video in comment 1, jemalloc avoids fragmentation by always allocating into the chunk with the lowest address.  That is, it imposes a stable ordering on the chunks, and sticks to that.

I think that heuristic might be a good place to start, instead of allocating into the last-used chunk.  You'd use this ordering in the search steps in (2) and (3).

On the other hand, jemalloc's heuristic doesn't entirely fit here, because we have extra information in the form of the compartment associated with the allocation.  We could prefer to allocate into the chunk which houses objects from the fewest other compartments, falling back to a stable ordering (e.g. lowest-address) in case of a tie.  It's hard to say if this would help.

Nicholas Nethercote [inactive]

Comment 3

•

13 years ago

(In reply to comment #2)
> 
> On the other hand, jemalloc's heuristic doesn't entirely fit here, because
> we have extra information in the form of the compartment associated with the
> allocation.

Exactly.  Our situation is quite different.

> We could prefer to allocate into the chunk which houses objects
> from the fewest other compartments, falling back to a stable ordering (e.g.
> lowest-address) in case of a tie.  It's hard to say if this would help.

The goal is clear.  You want:
(a) Each compartment's objects should span as few chunks as possible.
(b) Each chunk should hold objects from as few compartments as possible.
(c) As few chunks as possible.

Any algorithm that ends with "Allocate a new chunk if all existing chunks are full" satisfies (c).  

Your suggestion directly targets (b), assuming you mean "houses objects from the fewest other compartments (once this object is added)".  But imagine you have a chunk C containing objects only from compartment X, and an empty chunk D lower in the address space than C.  If you allocate a new object from X, it'll be put D rather than C, which is bad.  So you'd need some tweak to target (a) somewhat as well.

Kyle's algorithm targets (a).  It doesn't directly target (b), but the LRU aspect means that if allocations exhibit temporal locality (ie. a compartment that allocates will likely allocated again soon) I think it'll do fairly well on (b).

Nicholas Nethercote [inactive]

Comment 4

•

13 years ago

> (a) Each compartment's objects should span as few chunks as possible.
> (b) Each chunk should hold objects from as few compartments as possible.
> (c) As few chunks as possible.

To expand on that... (b) is desirable because it maximizes the chance that a chunk becomes empty (and so can be reused without constraint or freed) when a single compartment closes.  That's the primary goal.  (a) is also desirable because it minimizes chunk churn, and because if you don't pay attention to that, if you have lots of empty chunks not yet freed you could end up spreading objects thinly among them.

And we don't actually aim for (c), because we hold onto chunks for three GCs in order to minimize churn.  Hmm, maybe (a) and (c) are the same.

Justin Lebar (not reading bugmail)

Comment 5

•

13 years ago

I agree that we shouldn't dirty an empty chunk if we have a perfectly good not-empty chunk which already contains allocations from the same compartment.  That seems obviously good (and fwiw I think is what jemalloc does).

Maybe we should only go into an empty chunk as a last resort.  Certainly not as a first resort.

Justin Lebar (not reading bugmail)

Comment 6

•

13 years ago

But if you use MRT (most-recently touched) instead of lowest-address, you can end up in a situation where you allocate into many chunks because you're freeing out of many chunks, when instead you should be packing your allocations into a single chunk and letting the chunks whose objects are being freed become empty.

Nicholas Nethercote [inactive]

Comment 7

•

13 years ago

True.  This needs some careful thought.

Andrew McCreight [:mccr8]

Comment 8

•

13 years ago

> And we don't actually aim for (c), because we hold onto chunks for three GCs
> in order to minimize churn.  Hmm, maybe (a) and (c) are the same.

I don't think (a) and (c) are the same.  Given compartments A, B, C, D:

(c) minimize the number of chunks
AAAD
BBBD
CCCD

vs.

(a) minimize the number of chunks a block is in
AAA_
BBB_
CCC_
DDD_

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 9

•

13 years ago

(In reply to comment #6)
> But if you use MRT (most-recently touched) instead of lowest-address, you
> can end up in a situation where you allocate into many chunks because you're
> freeing out of many chunks, when instead you should be packing your
> allocations into a single chunk and letting the chunks whose objects are
> being freed become empty.

This is only true if you're looking at substantial churn *within* a compartment.  I don't think optimizing for that case is as important as optimizing for reclaiming as much memory as possible once a compartment is destroyed.

Gregor Wagner [:gwagner]

Comment 10

•

13 years ago

(In reply to comment #4)
> 
> And we don't actually aim for (c), because we hold onto chunks for three GCs
> in order to minimize churn.  Hmm, maybe (a) and (c) are the same.

That's not true any more. For random surfing we use the new GC_SHRINK trigger fairly often and return chunks right away. Our model changed from rather allocating new chunks than GC to the opposite.

Justin Lebar (not reading bugmail)

Comment 11

•

13 years ago

(In reply to comment #9)
> This is only true if you're looking at substantial churn *within* a
> compartment.  I don't think optimizing for that case is as important as
> optimizing for reclaiming as much memory as possible once a compartment is
> destroyed.

I can't immediately think of a better strategy for this than preferring to allocate first into chunks which already contain an object from your compartment and second into chunks which contain the fewest other compartments.

The question is, how do you break ties in here?  MRT, or a stable ordering?

Justin Lebar (not reading bugmail)

Updated

•

13 years ago

Depends on: 669611

Steve Fink [:sfink] [:s:]

Comment 12

•

13 years ago

(In reply to comment #3)
> The goal is clear.  You want:
> (a) Each compartment's objects should span as few chunks as possible.
> (b) Each chunk should hold objects from as few compartments as possible.
> (c) As few chunks as possible.

It isn't entirely clear to me that compartments are the best signifier of what memory is going to be released together, though they might be the most convenient. techcrunch's 70 compartments will all be released together when the tab is closed, so spreading them across chunks to minimize compartments per chunk will be counterproductive.

Looking forward, if we do a compartment per global, this problem will become even more prevalent.

You really want to minimize "lifetime categories" per chunk. Compartments are an even weaker indicator of a common lifetime, because there are so many short-lived objects. In GC terms, you don't want your nursery to share space with tenured objects. That's what jemalloc relies on with its lowest address first heuristic. We have additional information available, but overemphasizing it will lose jemalloc's advantages. (Compartment is a pretty powerful indicator of "even though this object seems to be long-lived, I know it will be freed together with these other objects", so it seems like we ought to be able to make use of it.)

Also, what is the benefit from each of njn's listed goals, excluding what you get from the others? Is it

(a) minimize fragmentation?
(b) maximize probability of freeing chunks when a compartment is destroyed
(c) minimize overall memory use

Mainly, (a) is unclear to me. As mccr8 points out, it's not the same as (c). But it's sort of (b)+(c). Perhaps it should just be discarded?

Igor Bukanov

Assignee

Comment 13

•

13 years ago

Before we spend efforts in optimizing arena allocation from chunks, perhaps we should consider if we need them at all? There were several reasons for their existence:

1. There was no portable way for fast allocation of 4K arenas so the GC needed to pool the arenas somehow.

2. Separating the mark bitmap from the arena allowed to minimize the amount of TLB cache misses.

Using big 1MB aligned chunks allowed to solve this using rather simple code for arena allocation. Such chunks were also helpful for the conservative GC work as it was easy to put all the chunks into a small hashtable and quickly check it.

Now, this was done before the compartment work. As this and related bugs show, the drawback of chunks is increased fragmentation which we need to manage with increasing code complexity.

One way to avoid the fragmentation is to replace chunks with an arena pool and tolerate the extra TLB hit during marking. Another option is to use small, like 64K, chunks that are per compartment. Yet the third way is do nothing and focus on the coping GC work that should nicely solve the fragmentation problem.

Paul Biggar

Comment 14

•

13 years ago

Why is the jemalloc memalign implementation too slow for point 1?

Igor Bukanov

Assignee

Comment 15

•

13 years ago

(In reply to comment #14)
> Why is the jemalloc memalign implementation too slow for point 1?

If I read the bug 414946 correctly we do not have yet jemalloc available on Mac.

Paul Biggar

Comment 16

•

13 years ago

Sorry, I didnt mean that to sound confrontational. I meant "is the jemalloc memalign implementation too slow for point 1"?

(jemalloc on mac should land soon, but if jemalloc is fast enough on windows/linux/android, we should use it. I know that mozalloc as implemented doesn't provide memalign, but we can trivially port your implementation into memory/mozalloc, and unconditionally provide memalign on all platforms).

Igor Bukanov

Assignee

Comment 17

•

13 years ago

(In reply to comment #16)

> but if jemalloc is fast enough on windows/linux/android, we should use it.

That has being tried but that lead to terrible ifdefs in the source with very different performance characteristics.  

> but we can trivially port your implementation into
> memory/mozalloc, and unconditionally provide memalign on all platforms).

There is no "my implementation". The arena alignment requirement is satisfied automatically as arenas are allocated out of 1MB-aligned chunks.

If we would have universal memory-efficient aligned arena allocator, I am in favor of using it with perhaps some simple per-container arena pooling to ensure optimal efficiency. But we do not have yet that option.

Irrespective of the fate of the bug 414946 why cannot we make jemalloc always available via custom names like jemalloc_malloc, jemalloc_memalign etc?

Gregor Wagner [:gwagner]

Comment 18

•

13 years ago

(In reply to comment #13)
> Yet the third way is do
> nothing and focus on the coping GC work that should nicely solve the
> fragmentation problem.

I am not that optimistic. Moving GCs can be very expensive. We should try to minimize the copying-workload as much as possible and we shouldn't expect that it will solve all our problems.

Lets compare with google chrome. They have a compacting GC. Entering gmail causes several mark-sweep GCs with 60ms and one Mark-compact 25.5 -> 19.5 MB, 108 ms. The numbers are from my high end MBP.

Our memory footprint is bigger and it will result in even bigger pauses. I think we have to use as much information as possible during allocation and consider the compacting GC as a backup mechanism. I think we have to fix  fragmentation problems now in order to make compacting GC even possible.

Justin Lebar (not reading bugmail)

Comment 19

•

13 years ago

> Irrespective of the fate of the bug 414946 why cannot we make jemalloc always 
> available via custom names like jemalloc_malloc, jemalloc_memalign etc?

If we thought there was a good chance that bug 414946 wouldn't ever land and stick, then I might agree we should look into this.  But it seems that Paul is very close, so I'm not sure it's worth the effort.  (There's also a benefit to not hardcoding in jemalloc, even if we were to use it everywhere in production -- it lets people build with --disable-jemalloc [necessary for Windows setups], lets people make debug builds with --enable-trace-malloc, etc.)

Can we spin a patch which puts all the allocations through jemalloc's memalign and get an idea if this is fast on Windows and Linux?  One could even build with Paul's patch and test on Mac.

Paul Biggar

Comment 20

•

13 years ago

(In reply to comment #19)
> Can we spin a patch which puts all the allocations through jemalloc's
> memalign and get an idea if this is fast on Windows and Linux?  One could
> even build with Paul's patch and test on Mac.

It would certainly be trivial to test this (though I'd use windows or linux so we do't conflate this with what's going on with jemalloc on mac). Replace the current arena allocation with a call to posix_memalign. See what happens.

Paul Biggar

Comment 21

•

13 years ago

(In reply to comment #17)
> That has being tried but that lead to terrible ifdefs in the source with
> very different performance characteristics.

I know there are some nasty constraints here, but I'm pretty sure they can be defeated. The ifdefs in jsgcchunk, for example, are largely not necessary. We could put a memalign implementation into memory/moz_malloc/, and then assume all platforms have memalign/posix_memalign, which we can use to allocate pages at 4k boundaries.

> There is no "my implementation". 

Apologies, I didn't mean to offend. I was talking about the jsgchunk alignment code.


> The arena alignment requirement is
> satisfied automatically as arenas are allocated out of 1MB-aligned chunks.

But isn't this the problem we're trying to remove?


> If we would have universal memory-efficient aligned arena allocator, I am in
> favor of using it with perhaps some simple per-container arena pooling to
> ensure optimal efficiency. But we do not have yet that option.

I'm sure we can have that option with a small amount of work, in the browser. (I'm not talking about landing jemalloc-on-mac, even if that doesn't land we get these improvements for 95% of our user base).

 

> Irrespective of the fate of the bug 414946 why cannot we make jemalloc
> always available via custom names like jemalloc_malloc, jemalloc_memalign
> etc?

While I think this wouldn't be my first choice, this is mostly doable. The trouble with jemalloc is integrating it with system allocators, which isn't actually required for what you suggest, so we could do this even if it didn't land.

Igor Bukanov

Assignee

Comment 22

•

13 years ago

(In reply to comment #20)
> It would certainly be trivial to test this (though I'd use windows or linux
> so we do't conflate this with what's going on with jemalloc on mac). Replace
> the current arena allocation with a call to posix_memalign. See what happens.

It takes more work than that as we would need to change the layout of the mark bitmap to place it somewhere in the arena.

Igor Bukanov

Assignee

Comment 23

•

13 years ago

(In reply to comment #19)
> Can we spin a patch which puts all the allocations through jemalloc's
> memalign and get an idea if this is fast on Windows and Linux?  One could
> even build with Paul's patch and test on Mac.

This was done about 2 years before and the conclusion was that with jemalloc on Linux this would work with some extra pooling. But now we have a conservative GC that complicates the picture. Without the chunks it would be necessary to put each arena into some data structure to check for addresses of alive GC thinks and that may add extra complexity.

For now I am in favor of experimenting with decreasing the chunk size. For example, what would happen with 256K or 64K chunks? And if 64K chunk would not show significant regression, then we can try to use one chunk per compartment eliminating the fragmentation problem.

Igor Bukanov

Assignee

Comment 24

•

13 years ago

(In reply to comment #23)
> For now I am in favor of experimenting with decreasing the chunk size. For
> example, what would happen with 256K or 64K chunks? And if 64K chunk would
> not show significant regression, then we can try to use one chunk per
> compartment eliminating the fragmentation problem.

The reason I mentioned 64K chunks is that 64K is the smallest size that can be allocated using VirtualAlloc on Windows. Thus, if we would still allow to build SM when memory-efficient posix_memalign is not available, we need to pack arena in chunks that are at least 64K in size.

Igor Bukanov

Assignee

Updated

•

13 years ago

Blocks: 671702

Paul Biggar

Comment 25

•

13 years ago

(In reply to comment #24)

> The reason I mentioned 64K chunks is that 64K is the smallest size that can
> be allocated using VirtualAlloc on Windows. Thus, if we would still allow to
> build SM when memory-efficient posix_memalign is not available, we need to
> pack arena in chunks that are at least 64K in size.

Why do we want to require VirtualAlloc? Is it because Windows doesn't have "memory-efficient posix_memalign"? (We do have that with jemalloc on Windows).

Igor Bukanov

Assignee

Comment 26

•

13 years ago

(In reply to comment #25)
> Why do we want to require VirtualAlloc? Is it because Windows doesn't have
> "memory-efficient posix_memalign"? (We do have that with jemalloc on
> Windows).

We do not know if we can avoid packing arenas into chunks. Using just posix_memalign to get 4K aligned arena can result in a performance hit that could be too big to tolerate. So we may need to support packing arenas into chunks.

On the other hand, if 64k chunks would work without a noticeable performance hit and address the fragmentation problem, then we can just use them especially as it would allow to build browser and standalone SM without availability of memory efficient memalign.

Justin Lebar (not reading bugmail)

Comment 27

•

13 years ago

Bug 671702 comment 19 says that the new allocator chooses chunks in a FIFO-ish manner.  I think it'll definitely be worth experimenting with a stable ordering here.

Igor Bukanov

Assignee

Comment 28

•

13 years ago

(In reply to comment #27)
> Bug 671702 comment 19 says that the new allocator chooses chunks in a
> FIFO-ish manner. 

There are two chunks choosers there. First empty chunks are picked in FIFO order and then chunks from a particular container with some empty arenas are picked in the same order. 

> I think it'll definitely be worth experimenting with a
> stable ordering here.

Yes, it should be worth to replace both structures with a heap or sort the arrays during the GC.

Igor Bukanov

Assignee

Updated

•

13 years ago

No longer blocks: 671702

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Whiteboard: [MemShrink:P1] → [MemShrink:P2]

Nicholas Nethercote [inactive]

Comment 29

•

13 years ago

If bug 671702 is implemented, then each chunk will belong to a single compartment and this bug will become moot (the chunk choices should be straightforward).

Depends on: 671702

Justin Lebar (not reading bugmail)

Comment 30

•

13 years ago

(In reply to comment #29)
> If bug 671702 is implemented, then each chunk will belong to a single
> compartment and this bug will become moot (the chunk choices should be
> straightforward).

That's not necessarily true.  Our chunk choice still matters, because packing arenas into the smallest number of chunks lets us free more chunks.  If our chunk choice is effectively random, we will probably do a bad job packing arenas tightly.

Nicholas Nethercote [inactive]

Comment 31

•

13 years ago

Ah, good point.  But the impact of the heuristic will be smaller, yeah?  And we don't have to worry about predicting lifetimes of compartments, arenas are all the same size, and things within arenas are all the same size.  So the heuristic should be simple.  Allocating into the closest-to-full chunks is probably best, AFAICT -- we want to avoid allocating into almost empty chunks where possible, because those are the most likely to become completely empty when the next GC runs.

Justin Lebar (not reading bugmail)

Comment 32

•

13 years ago

Allocating into closest-to-full sounds right to me.  It's late for proofs, but I think if all objects are the same size, closest-to-full is optimal if every object has an equal chance of being freed at any given point in time.  If objects are different sizes, I think it's optimal if an object's lifetime is proportional to its size, which seems kind of reasonable.

It would be cool if "closest-to-full" meant "largest number of bytes used for active allocations" rather than "largest number of bytes used for active arenas".

Justin Lebar (not reading bugmail)

Comment 33

•

13 years ago

er, inversely proportional to its size.  Larger objects live longer than smaller objects, is the assumption.

Igor Bukanov

Assignee

Comment 34

•

13 years ago

(In reply to comment #30)
> Our chunk choice still matters, because
> packing arenas into the smallest number of chunks lets us free more chunks. 
> If our chunk choice is effectively random, we will probably do a bad job
> packing arenas tightly.

The patches in the bug 671702 and the bug 673795 the arenas are picked from the chunk that most recently becomes not-full. I suppose that is rather random even without the background finalization.

(In reply to comment #32)
> Allocating into closest-to-full sounds right to me. 

In presence of background finalization the notion of closest-to-full is not deterministic. jemalloc choice of always picking something with smallest address may work better.

Justin Lebar (not reading bugmail)

Comment 35

•

13 years ago

> In presence of background finalization the notion of closest-to-full is not 
> deterministic. jemalloc choice of always picking something with smallest 
> address may work better.

Does nondeterminism matter here?  Indeed, no matter what we do, it'll be nondeterministic in the face of GC scheduling -- if the GC is scheduled before we allocate, we'll likely allocate into a different space than if the GC is scheduled after.

Igor Bukanov

Assignee

Comment 36

•

13 years ago

(In reply to comment #35)
> Does nondeterminism matter here?  Indeed, no matter what we do, it'll be
> nondeterministic in the face of GC scheduling 

closest-to-full may change whenever a single arena in the chunk is freed by the background GC while chunk-with-smallest-address change only when an arena in a previouly full chunk with smaller address is released. This happens less frequently. But I agree that this pretty moot until we get experimental data.

Justin Lebar (not reading bugmail)

Comment 37

•

13 years ago

I guess I don't understand why it matters that closest-to-full changes more frequently than smallest-address.  Do we care?

But if you're going to test both of them, then let's just see.  :)

Igor Bukanov

Assignee

Comment 38

•

13 years ago

(In reply to comment #37)
> I guess I don't understand why it matters that closest-to-full changes more
> frequently than smallest-address.  Do we care?

Performance-wise frequent updates in the chunk order could be harmful.

> 
> But if you're going to test both of them, then let's just see.  :)

yes :)

Igor Bukanov

Assignee

Comment 39

•

13 years ago

Attached patch sorted allocation lists — Details — Splinter Review

The patch sorts using merge sort available arena and chunk lists. This way the allocation always happens from the arena and chunk with smallest address. I will use it to test fragmentation.

Assignee: general → igor

Igor Bukanov

Assignee

Comment 40

•

13 years ago

Attached patch adress-sorted chunks and arenas — Details — Splinter Review

This patch sorts the available chunks (the lists of chunks with at least one free arena) and the available arena lists according to their address so GC things first allocated from the arena with a minimal address and new arenas are allocated from the chunk with a minimal address. This corespondents to what jemalloc is doing.

To test the patch I did the following about:memory measurements:

1. After starting the browser and running gmail (gmail1)
2. After closing gmail1, opening all windows from gregor-wagner.com/tmp/mem and openening a new tab with gmail (gmail2).
3. After closing all windows except gmail2
4. After closing gmail2 and running gmail3

Ideally the numbers for the case 3 and 4 should match the case 1. 

Here is the numbers for MC tip:

                            gmail      all+gmail2   gmail2       gmail3

heap-allocated               39.80 MB    792.34 MB     65.34 MB     64.58 MB
heap-committed               78.00 MB    878.00 MB    873.00 MB	   871.00 MB
heap-dirty                    3.80 MB      2.49 MB      3.71 MB	     3.81 MB
heap-unallocated             38.20 MB     85.66 MB    807.66 MB	   806.42 MB
js-compartments-system              2            2            2	           2
js-compartments-user                2          244            2	           2
js-gc-heap                   17.00 MB    376.00 MB     60.00 MB	    52.00 MB
js-gc-heap-arena-unused       1.85 MB     73.19 MB      5.92 MB	     3.81 MB
js-gc-heap-chunk-clean-unuse  0.00 MB      2.00 MB      0.00 MB	     0.00 MB
js-gc-heap-chunk-dirty-unuse  1.30 MB      4.42 MB     39.14 MB	    34.04 MB
js-gc-heap-unused-fraction     18.58%       21.17%       75.09%	      72.78%
page-faults-hard                    2            3            3	           3
page-faults-soft               48,837    1,558,121    1,739,117	   1,769,136
resident                    102.47 MB  1,337.43 MB    318.79 MB	   266.40 MB
vsize                       338.70 MB  1,642.93 MB  1,220.86 MB	 1,195.21 MB


Numbers for sorted arenas:

                            gmail      all+gmail2   gmail2       gmail3

heap-allocated               42.68 MB    781.94 MB     65.80 MB     66.32 MB
heap-committed               82.00 MB    880.00 MB    873.00 MB	   869.00 MB
heap-dirty                    3.46 MB      2.45 MB      3.15 MB	     3.36 MB
heap-unallocated             39.32 MB     98.06 MB    807.20 MB	   802.68 MB
js-compartments-system              2            2            2	           2
js-compartments-user                2          242            2	           2
js-gc-heap                   17.00 MB    379.00 MB    128.00 MB	    40.00 MB
js-gc-heap-arena-unused       1.78 MB     75.03 MB      6.92 MB	     2.80 MB
js-gc-heap-chunk-clean-unuse  0.00 MB      0.00 MB      0.00 MB	     0.00 MB
js-gc-heap-chunk-dirty-unuse  1.43 MB      7.29 MB    104.89 MB	    22.48 MB
js-gc-heap-unused-fraction     18.91%       21.72%       87.35%	      63.22%
page-faults-hard                    3            7            8	           8
page-faults-soft               48,676    2,680,661    2,789,169	   2,825,692
resident                    115.05 MB  1,347.10 MB    412.01 MB	   291.93 MB
vsize                       341.71 MB  1,691.57 MB  1,331.16 MB	 1,189.02 MB


It is interesting that the gmail2 case, when its tab was openned before all other 150+ tabs were closed, shows that the js heap incread from 60 to 120MB. It was necessary to close this gmail and open a new one before the patch shows a win over the base. The patch also shows noticable increase in page faults, I suppose that is a consequence of not using the most recently added arena/chunk to the list, that is used without the patch. So for the case when one open a new tab and close many other previously opened the address sorting makes thing worse.

Justin Lebar (not reading bugmail)

Comment 41

•

13 years ago

Well, that's kind of disappointing!  I suppose that rather than guessing at what eviction strategy we should use, we'll actually have to look at the allocation/fragmentation pattern and figure it out.  What do you think, Igor?

Igor Bukanov

Assignee

Comment 42

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #41)
> I suppose that rather than guessing at
> what eviction strategy we should use, we'll actually have to look at the
> allocation/fragmentation pattern and figure it out.

If one know the pattern, then it is possible to tune the allocation strategy so it would work perfectly for it. But that strategy will fail miserably for other allocation patterns. An allocation should be sufficiently robust yet simple to understand to judge worst-case scenario for it. But then there is not so much to choose from, i.e. if consider 4 interesting cases of sorting: address sorting, the least/most allocated arena/chunks, the least recently used (effectively this is what we use currently), then we should be covered.

Igor Bukanov

Assignee

Comment 43

•

13 years ago

Attached patch least/most allocated sorting — Details — Splinter Review

This patch comes with templated implementation MergeSort and allows to implement different selection strategies. In the patch the code implements using the least allocated arena and chunk strategy. My initial thinking was that this should be very bad, but it is useful to have it for compassion.

Igor Bukanov

Assignee

Comment 44

•

13 years ago

Attached file test results with various allocation strategy — Details

Here the test results for various allocation strategies. I repeat here for references the about:memory measurement points: 

1. After starting the browser, opening about:memory and gmail in another tad (gmail1)
2. After closing gmail, opening all windows from gregor-wagner.com/tmp/mem and openening a new tab with new gmail instance (all+gmail2).
3. After closing all windows except about:memory and gmail (gmail2)
4. After closing gmail and opening gmail again (gmail3)
Ideally the numbers for the case 3 and 4 should match the case 1.

The attachment contains the results for the TIP, sorting arena and chunks according to their address, the most allocated and least allocated status, using 64K chunks (bug 671702) and using 64K + arena/chunk address sorting ( a version of this patch adapted for the bug 671702).

As the patches here and in the bug 671702 does not affect malloc heap and the amount of JS allocation varies little between test run, the most relevant number is the size of JS heap. Here is just that number listed for convenience:

                          gmail       all+gmail2   gmail2        gmail3

MC tip                    17.00 MB    376.00 MB     60.00 MB     52.00 MB

adress-sorted             17.00 MB    379.00 MB    128.00 MB     40.00 MB
arena and chunks

use most-allocated        17.00 MB    393.00 MB     67.00 MB     63.00 MB
arena/chunk first

use least-allocated       18.00 MB    381.00 MB    132.00 MB     40.00 MB
arena/chunk first

64K per-compartment       18.31 MB    439.25 MB     32.44 MB     31.81 MB
 chunks

64K per-compartment       17.13 MB    431.06 MB     32.38 MB     30.81 MB
chunks + address sorting

So it looks like the MC tip is a winner if we just allocate memory and adds tabs, but when we start to close window and tabs 64K-per compartment tabs wins significantly and address sorting on top of it brings extra improvement.

Nicholas Nethercote [inactive]

Comment 45

•

13 years ago

So what should we do here?

Igor Bukanov

Assignee

Comment 46

•

13 years ago

(In reply to Nicholas Nethercote [:njn] from comment #45)
> So what should we do here?

With the arena decommit an excessive number of chunks is not significantly less harmful. However, I want this bug opened as trying to allocate thing for some compartment from the same chunk is useful from a performance point of view. Plus if we really can shrink the number of chunks, it will also allow to release non-arena parts of the chunks and decrease the address space fragmentation.

Robert Claypool

Comment 47

•

13 years ago

There are only two things I see as important. Making the number of pages used be the least, because allocated but unused pages will be flushed out to disk, and preparing for electrolysis in which, if I understand correctly, each compartment will run in its own process,so the concept of a sharing chunks between compartments will be nonexistent. Is address space fragmentation really relevant? Sure it looks good to be _allocating_ less memory, but what really matters is to be _using_ less pages. Each compartment should be responsible for its own chunks.
Most of the shortest lived items shouldn't be in GC'd memory and instead be allocated on the stack.

Nicholas Nethercote [inactive]

Comment 48

•

12 years ago

This bug isn't going to go anywhere, and generational GC is going to solve this problem much more comprehensively...  WONTFIX.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

sorted allocation lists 13 years ago Igor Bukanov 16.68 KB, patch		Details \| Diff \| Splinter Review
adress-sorted chunks and arenas 13 years ago Igor Bukanov 20.11 KB, patch		Details \| Diff \| Splinter Review
least/most allocated sorting 13 years ago Igor Bukanov 21.70 KB, patch		Details \| Diff \| Splinter Review
test results with various allocation strategy 13 years ago Igor Bukanov 7.77 KB, text/plain		Details