Open Bug 166701 Opened 17 years ago Updated 8 years ago

arena free list allocates memory inefficiently

Categories

(NSPR :: NSPR, defect, P2)

4.2.1
defect

Tracking

(Not tracked)

People

(Reporter: nelson, Assigned: wtc)

References

Details

(Keywords: memory-footprint, perf)

Attachments

(1 file, 1 obsolete file)

This bug is the compantion to bugzilla bug 160635, 160640, and 164501.

arena_freelist, the free list of PLArenas, in a singly linked list. 
Newly freed arenas are added to the head of the list by FreeArenaList().
When a new arena is needed, PLArenaAllocate takes the first arena on the 
freelist that is large enough to satisfy the request (a "first fit" algorithm).

There is no limit to the size of arenas put on this free list.
This means that a multi-megabyte arena may be put on the free list, and 
then used to satisfy a request for a typical 2KB arena.  

Programs that repeatedly use and then free arena's containing very large
(e.g. multi-megabyte) items tend to grow and grow because the huge arenas
get used up in arena pools where small arenas would do, necessitating the
allocation of more new huge arenas each time a huge arena is needed.

I have written two patches to NSPR to address this problem.
Both patches are attached to bug 164501.  They are
http://bugzilla.mozilla.org/attachment.cgi?id=96752&action=view
http://bugzilla.mozilla.org/attachment.cgi?id=96763&action=view

The first patch sets a 100KB limit on the size of arenas returned to the 
free list.  Arenas larger than that are always freed, rather than being
put on the free list.

The second patch is a superset of the first.  It also changes the 
arena_freelist into 4 arena free lists, one for each of the following ranges
of arena size:
    0k < x <= 2k
    2k < x <= 8k
    8k < x <= 32k
   32k < x 
And it also keeps the arenas on each list sorted in ascending order of size.
This causes the "first fit" allocation algorithm to also be a "best fit" 
algorithm.   

I believe the second of these two patches will do the best job of efficiently
allocating memory for arenas, but it may also cause a performance degradation.

The first of these two patches should not cause any performance degradation,
and will bound the size of an arena that can be allocated when a small arena
is requested. 

If the second patch doesn't hurt performance, IMO it should be used.
I think this is a good candidate for NSPR 4.2.2.
Priority: -- → P1
Target Milestone: --- → 4.2.2
reassigning to myself.  Adding Julien to cc list.
Assignee: wtc → nelsonb
This patch is a replacement for
http://bugzilla.mozilla.org/attachment.cgi?id=96763&action=view
which had a syntax error.  The old patch was to rev 3.6.2.1.
This patch is against rev 3.11, which is currently the tip.
This patch compiles without warning with MSVC 6, but is untested.
"Programs that repeatedly use and then free arena's containing very large
(e.g. multi-megabyte) items tend to grow and grow"

Does it affect Mozilla ?
Is this the cause for all bug reports like "free is broken !", "mozilla does not
release memory!" because people can see that Mozilla memory usage will never
stop increasing ?
No Bernard.  This is not the cause of reports that "free is broken".
Status: NEW → ASSIGNED
There is a bug in patch id 98189.
http://bugzilla.mozilla.org/attachment.cgi?id=98189&action=view
That patch contains a line that reads:

        while (f && f->limit - f->base > nb) {

It should read:

        while (f && f->limit - f->base < nb) {

The bug caused the free list to be sorted in descending order by size, 
which made for very poor memory utilization when allocation uses first fit.
The fix will put the list in ascending order as intended.
Comment on attachment 98189 [details] [diff] [review]
have 4 freelists, ordered by ascending size of free arenas

This patch is flawed in another way.  The value of pool->arenasize is typically
2048, which means that the size of a typical arena is 2k+55.  So, we never
allocate blocks <= 2kb, and the free list of blocks <= 2KB remains empty.  When
looking for another 2kb block, we don't search the list of blocks > 2KB.  
I'll work on an improved version.
Attachment #98189 - Attachment is obsolete: true
Attachment #98189 - Flags: needs-work+
Did you consider removing the global freelist ?
Does it make any difference ?
Comment on attachment 98189 [details] [diff] [review]
have 4 freelists, ordered by ascending size of free arenas

The following comments apply only to the first part
of this patch: do not recycle arenas whose size is
larger than PL_MAX_FREE_ARENA_SIZE (100K).

I am afraid that I can't take this change because
100K may be considered large for most applications,
it may be the normal arena size for some applications.

There are two possible alternatives.

One is to add a variant of the PL_FreeArenaPool function
that takes a 'max_free_arena_size' argument.  Arenas
larger than 'max_free_arena_size' will be freed instead
of being put to the arena freelist.

The second solution is to add a variant of the
PL_FreeArenaPool function that frees any arenas
larger than the default arena size of the arena
pool.
Here are some other ideas that occur to me.

1. When searching a free list to find a block that fits the request, 
determine the size of the arena that would be allocated from the heap if
we fail to find a suitable arena on the free list, and the search for a
free arena of that exact size.  This should completely avoid the behavior
that emulates a leak, where searches for a certain request size search
one list, and when the block allocated for that size is later freed, it 
is freed to a different list than the one searched for that size.

2. Have a separate free list for each unique "arenasize".  That is, 
each PLArenaPool has a minimum arena size called "arenasize".  In NSS,
there are just a few different values (all fixed at compile time) used
as the arenasize for different arena pools.  Instead of maintaining a 
fixed number of free lists for various ranges of arena sizes that may be
unrelated to the arena sizes used by the application, maintain a separate
arena free list for each unique "arenasize" value used by the application.
The number of these free lists might grow over time.  If the application 
uses just two different values of arenasize, 2k and 8k, then there will 
be just two arena free lists, one for 2k arenas, and one for 8k arenas.  
There are some variants to this scheme:
a) any arena whose size does not match the PLArenaPool's arenasize would 
always be freed (instead of put on the free list), regardless of the value 
of "reallyfree", or 
b) any arena whose size is more than 2x the PLArenaPool's arenasize would 
always be freed.  (or 5x or 10x)

3. IIRC, my previous patch made the code always zero the arenas before 
putting them on the free list.  A one-line change would determine if that
alone was having a detrimental performance effect.
Bernard, regarding your question in Comment #8 above, I think your question
is asking whether the free list is known to have any positive performance 
impact at all, or not.  Assuming that's the intent of your question...

The shortest answer is that, in the absense of NSPR's new "zone allocator",
and when very-large CRLs are not being used, the arena free list has a 
positive performance impact.  Its effect is only detrimental when certain 
atypically large data items get allocated in an arenapool.  We're trying
to reduce the detrimental impact of those very large data items.

I think NSPR's new zone allocator accomplishes the same task as the arena
free list, and does it better.  In fact, the arena free list may be 
detrimental when the zone allocator is being used.  But not all 
applications choose to use the zone allocator.  Those that don't still 
need some benefit from the arena free list, I think.
Wan-Teh,

How about this alternate definition of PL_MAX_FREE_ARENA_SIZE ?

#define PL_MAX_FREE_ARENA_SIZE (pool->arenasize * 10)
Nelson, that addresses my concern and doesn't require adding
a new function.
*** Bug 281866 has been marked as a duplicate of this bug. ***
QA Contact: wtchang → nspr
Assignee: nelson → nspr
Status: ASSIGNED → NEW
Priority: P1 → P3
Target Milestone: 4.2.2 → ---
I have encountered this bug in my analysis of PLArena use.  Here is an example of a PLArenaPool which has arenasize 4848, but somehow owns an arena of size 1K (at the bottom of the arenas list, below).

TokenPool
   Pool: 2cc964a4
      name: TokenPool
      arenaSize: 4848
      arenas:
         "34d7e00" (4871,5120)
         "34d6a00" (4871,5120)
         "34d5600" (4871,5120)
         "34d4200" (4871,5120)
         "34a3400" (4871,5120)
         "347e400" (1043,1536)

I am keen to find a solution to this bug, please let me know if I can help.  I have developed extensive PLArena-related analysis tools.  Maybe I can take the bug?
Keywords: footprint, perf
Robin,, please feel free (even invited) to write a patch to address this bug,
and attach it to this bug.  If it shows promise, we'll be happy to assign 
this bug to you.  :)  That goes for all the bugs related to PLArenaPools.
Here's some suggestions.  If you like one, let me know, and assign the bug to me.

Solution 1.

Make the global arena_freelist be a per-pool structure.  Then, pools become responsible for their consumption of memory, including unused arenas which are still in their free lists.

This has the disadvantage that pools cannot share unused arenas.

Solution 2.

Keep the global arena_freelist, but bound it by all three of:
1. # of arenas in it
2. the sum total of their sizes
3. new arenas going into the list must be "reasonably sized" (like, say 256 <= x <= 32K where x is the arena size).
Those three values should be configurable by users.

Then, change the PL_ArenaAllocate code to choose only appropriately sized arenas from that list.

This solution has the disadvantage that the arena_freelist may become "bullied" by an arena pool which is filling up the list with arenas which are never of the appropriate size for other arena pools.

Solution 3.

Associate with each pool a single "free arena".  When there is no current free arena, and the pool calls PL_FreeArenaPool, the first arena to be freed becomes the free arena.  When a pool calls PL_ArenaAllocate, it first checks its own free arena to see if it satisfies the allocation, which it very likely will.

Then also make the global arena_freelist like in Solution 2.

The single free arena per pool prevents a pool from "thrashing" - allocating and then deallocating an arena over and over.  It also introduces a "pool responsibility" for consumed memory.  In other words, one can blame a particular pool if its free arena is too big.  This solution also helps with the arena "bullying" problem of Solution 2, but only a little.

The disadvantage of all of these is that they complicate the arenas, which were meant to be a simple memory management structures.
Robin, Thanks for your interest in this bug. 
I have considered your 3 proposals of comment 17, in light of the patterns
of usage of arena pools that NSS is known to exhibit.  I don't know what 
(if any) other software uses NSPR's PLArenaPools, and I don't know what 
their usage patterns are like.  

(IIRC, JavaScript has its own separate implementation of arena pools, and 
so its usage patterns do not affect the design of PLArenaPools, nor do 
changes in PLArenaPools benefit it.  Please feel free to disabuse me of
that understanding if it is mistaken.)

NSS uses ArenaPools in two ways:
a) as "object allocators", keeping all the memory allocated to hold 
components of a complex object together in a single arena pool.  When the 
object is to be destroyed, the entire arenapool for the object is destroyed 
(with PL_FreeArenaPool).  Objects may have long lives or short ones.  ArenaPools for objects generally grow initially, until the object is 
completely allocated, then stabilize.  They *may* grow thereafter, but as a 
rule they never shrink, never give up any of the arenas in the pool, until 
the entire object pool is destroyed (all arenas freed to the free list).  

b) As an alloca-alternative.  A complex function that will call many subfunctions to do its job may allocate an arena pool and pass it along to 
the subordinate functions.  When the call stack finally returns completely,
the outermost function (which allocated the arena pool) destroys it to free
all its arenas.  This use is generally rather short lived (fraction of a 
second).  It may experience some shrinkage during its short lifetime, 
because subroutine functions may "mark" the pool at the beginning, and may
"release" memory back to their mark at the end.  But the cases where they
release memory are typically error cases, which tend to be followed very
shortly by the destruction of the entire arena pool.  

So, in short, I don't see NSS's usage patterns benefiting significantly
from any per-arena-pool free lists.  I think proposal #1 is a non-starter.
It's not clear if you intended proposal 3 to be a new per-pool free arena
_in addition to_ the global free list, or _instead of_ the global free list.
It might have some benefit for the alloca-alternative if it was used in 
addition to the global free list, but I would expect the result to be
very minor in NSS.  

Idea 2, bounding the global free list by total size and total number of
arenas, seem s to have potential.  I might add also adding a ceiling on 
the size of arenas placed into the free list, although there is 
disagreement about how the ceiling value should be determined (what its 
default should be).  There clearly needs to be one ceiling for the entire 
free list, rather than one ceiling for each arena pool (as has been 
suggested).

The benefits that NSS gets from using arena pools are thought to be:

a) faster allocation than malloc when the request can be satisfied from
unused space in an already-allocated arena in the pool.  Functions that 
do lots of small allocations will hit the heap less, and go faster. 
This is known to be a significant advantage.

b) faster and simpler deallocation of many small pieces  at the end, 
because only the large arenas need to be freed, not each of the 
individual small allocations from those arenas.  This is also thought
to reduce the occurrance of leaks.

c) allocation from and freeing arenas to the global arena free list is
thought to be quicker than using the heap.  However, some heap 
implementations are very efficient (they don't use a single heap-lock,
and so heap lock contention in multi-threqaded appliations is not such a 
serious problem with them.  In the presence of such heap implementations,
the single global lock on the PL Arena free list may become a hot lock.

I once did an experiment that caused the PL arena free list to be kept in
ascending order by arena size, so that the "first fit" allocation 
technique was also a "best fit" technique.  However, while this may have
reduced memory occupied by the entire set of arenas, it added greatly to 
the time required to allocate or free an an arena from/to the free list.
The resultant performance loos was deemed unacceptable.  That led to the 
subsequent work described above in this bug and other NSPR bugs as the 
"zone allocator".  The zone allocator was successful and minimizing time
spent in memory allocation code, but used more memory than the single
sorted free list.  

That's how NSS uses arena pools.  If you have info on the usage patterns
of other users, please share it with us.
I wrote:
> I once [...] caused the PL arena free list to be kept in ascending order 
> by arena size [...]  while this reduced [the total] memory occupied by 
> the entire set of arenas, it added greatly to the time required to allocate 
> or free an an arena from/to the free list.  The resultant performance loos 
> was deemed unacceptable.

I should point out that the performance loss was due to the time spent 
traversing the VERY LONG arena free list, skipping over the multitudes of
small arenas trying to find one of adequate size.  There are a number of 
ways that could be mitigated, including:
- bounding the total length of the arena free list,
- bounding the number of arenas of a given size range (zone) in the list, 
- having separate arena free lists for different size ranges (zones),
- using a true zone allocator.  (NSPR has a zone allocator now, seldom used.)

A willing person could experiment with some of the above ideas.  It should 
be a VERY simple experiment to use NSPR's zone allocator to see what effect 
that has on the various metrics we're observing (total memory footprint,
performance) on platforms that support it, because that requires no new code.  
On systems that use pthreads, it requires only setting the NSPR_USE_ZONE_ALLOCATOR environment variable to 1.  If those results are 
promising, it could be made to work on other platforms with rather little 
work.
(In reply to comment #18)
> That's how NSS uses arena pools.  If you have info on the usage patterns
> of other users, please share it with us.

Yes, I have tons of data on that.  I will post up-to-date stats later today.

Meanwhile, can you point me to a test or workload which exercises NSS's use of PLArenas?

The attached file is output from my MemStats program.  It shows some of the output after starting and stopping Firefox, at the high water mark of memory use (by PLArena code).

In other words, at the time that the PLArena code was using the most memory, the pools shown were in existence, and had the shown arenas associated with them.

The output shows the sizes of the arenas in the form (x,y).  Here, 'x' is the size requested for the arena, and 'y' (always >= x) is the actual size that malloc assigned to the allocation.  (y-x) bytes are wasted.

The pools are sorted by name.  There may be several pools with the same name.

I'm sure that I'm not explaining the output fully.  Please ask questions, if you like.
This bug is affecting my ability to fix bug 421747 without increasing memory footprint.  This bug deserves a priority increase.
Priority: P3 → P2
Blocks: 421747
Whiteboard: [MemShrink]
I think this bug still exists.  It's not clear how much memory it's causing us to waste today.

To summarize the issue as I understand it: In PL_ArenaAllocate, we use three strategies in succession to try to allocate an arena out of an arena pool:

 * First, we look through the pool's list of free arenas.  We return the first arena which fits the requested allocation, if one exists.
 * Next, we look through the global arena freelist.  Again, we return the first arena which fits.
 * Last, we malloc the arena.

The problem with using first-fit here is that we might snag a very large arena to satisfy a very small allocation request.

I'm not totally sure how this code works; it looks like arenas can be parceled out to satisfy more than one request?  But in any case, we should investigate whether this is causing us to waste memory.  If it is, my inclination would be to get rid of the global freelist, since the allocator keeps one anyway.
(In reply to Justin Lebar [:jlebar] from comment #23)
>  * First, we look through the pool's list of free arenas.  We return the
> first arena which fits the requested allocation, if one exists.
>  * Next, we look through the global arena freelist.  Again, we return the
> first arena which fits.
>  * Last, we malloc the arena.
> 
> The problem with using first-fit here is that we might snag a very large
> arena to satisfy a very small allocation request.
If it loops through all the arenas anyway, how hard would it be stop breaking out early and choose the arena with the least amount of free space that can fit the requested allocation? That also sounds like
1) it might work well with a generational GC on pages that go through a lot of similar objects, and
2) improve the probability of an almost free arena becoming entirely free more quickly (as allocations into it are avoided).
> If it loops through all the arenas anyway, how hard would it be stop breaking out early 
> and choose the arena with the least amount of free space that can fit the requested 
> allocation?

This is the best-fit strategy, whereas what we currently have is the first-fit strategy.  It might be better, but it might be worse.  For instance, if the free list is a bunch of 4KB arenas, and we're asking for a 3.99KB arena, we'll search through the whole list looking for the ideal match, when we should just settle for the first 4KB arena.  That's a waste of time.

It's a theorem that for *any* (deterministic) allocation strategy, there exists a workload which will cause it to suck.  The base allocator (malloc) already has to implement a bucketing and caching strategy, and so layering an allocator on top of malloc just increases the chances that our workload is going to cause us to suck.

That's why my first instinct is to see whether we can make plarena use malloc more.  The simpler plarena is, the less likely it is that a given workload will cause us to suck.
Hmm, fair enough. I have a few ideas, but (currently) no way to test them myself, and you're thinking in a different direction, so feel free to ignore them. 

These are not necessarily mutually exclusive.
1) Break early on a match that's 'good enough'. This won't help if the only nearly full arenas happen to be near the end of the iteration.
2) Keep track of the best match so far and break when an arena with enough free space has been found AND we've already iterated over a set limit of arenas (would benefit from randomization in the iteration order).
3) Keep a list of arenas sorted on available space, and update it with insertion sort (or some other sorting algorithm that works well on nearly-sorted lists) after a) a big allocation or b) n small allocations. Then use a binary search to quickly find an arena with enough space. If the size is out of date and it doesn't fit, try the next arena in the list and so on. This feels a little bit like refcounting though, which is always tricky.
Assignee: nspr → nnethercote
Whiteboard: [MemShrink] → [MemShrink:P2]
(In reply to Justin Lebar [:jlebar] from comment #23)
> 
> The problem with using first-fit here is that we might snag a very large
> arena to satisfy a very small allocation request.

Even if that happened, it wouldn't be a problem.  Subsequent allocation requests would be satisfied out of the snagged arena.
 
> I'm not totally sure how this code works; it looks like arenas can be
> parceled out to satisfy more than one request?

Yes.  An "arena pool" is a linked list of arenas.  An arena is just a contiguous chunk of memory with a header.  Allocations are satisfied out of arenas;  each arena can contain multiple allocations.

> But in any case, we should investigate whether this is causing us to waste 
> memory.

I don't think it is.  In Firefox, some ad hoc profiling showed me that the most common arena sizes are 1KB and 2KB, and the most common allocation request sizes are 24--336 bytes (this accounts for over 99% of allocations), and none were over 2KB.  This means that when we fall back to looking at the free-list, if it's non-empty the first arena in it is good enough almost always.

In other words, this would only be a problem if we ever had enormous arenas, and in Mozilla code we don't.  Since this is an NSPR bug and non-Mozilla apps that use it might have different patterns, we can leave it open.  But I'm going to remove the MemShrink annotation because I don't think it's relevant to Mozilla.
Whiteboard: [MemShrink:P2]
Assignee: nnethercote → wtc
Thanks for looking into this, Nick.
Nick, did you test while using https?
IIRC, libSSL allocates arenas that are bigger than 2kB.
Probably not.  I'm still not worried.
This bug is about a particular behaviour in the PLArena allocator. The problematic behaviour is described at the top of this page.

In comment#15, I describe a symptom of the behaviour. Namely, that arena pools wind up owning arenas that are smaller than their 'arenasize' property. See attachment#1 [details] [diff] [review]'s first section, CategoryManagerArena, for an example of an arena pool which owns arenas smaller than it ever should.

Attachment#1 [details] [diff] is a listing of all of the PLArenaPools at a time when Firefox memory usage is at a high water mark.

I mention this because subsequent discussions here make me think that some of us believe that there is no actual bug here, merely an inefficient allocation strategy. There is an inefficient allocation strategy at play, but that is not what this bug is about. There *is* a real bug here. It is causing memory to be wasted in Firefox. It is surely causing leakage in all users of NSPR. Really.

Apologies if everyone did understand that.
You need to log in before you can comment on or make changes to this bug.