Open Bug 487394 Opened 16 years ago Updated 2 years ago

investigate setting NSS_DISABLE_ARENA_FREE_LIST so that NSS doesn't hold on to memory it's not using

Categories

(Core :: Security: PSM, defect, P3)

defect

Tracking

()

People

(Reporter: rob, Unassigned)

References

()

Details

(Whiteboard: [psm-logic][psm-backlog])

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2a1pre) Gecko/20090407 Minefield/3.6a1pre Build Identifier: trunk From bug #485052... The committed patch allocates memory when PSM initializes NSS, and attempts to deallocate that memory when PSM deinitializes NSS. This causes a leak regression because currently neither NSS nor PSM call PL_ArenaFinish() to properly free (i.e. return to the heap) the global PLArena Free list. Boris says (comment #32) that the rest of Gecko avoids bloat when it uses arenas by always calling PL_ArenaFinish() within a well-defined lifetime. Nelson asserts (comment #25) that NSS should not call PL_ArenaFinish() itself. He suggests that PSM could work around the issue by setting the NSS_DISABLE_ARENA_FREE_LIST environment variable before initializing NSS. On https://developer.mozilla.org/en/NSS_Memory_allocation, Nelson notes that disabling the arena free list "makes NSS slower". Which is best: speed or lack-of-bloat? Reproducible: Always
Summary: PSM should define NSS_DISABLE_AREA_FREE_LIST before initializing NSS → PSM should define NSS_DISABLE_ARENA_FREE_LIST before initializing NSS
Version: unspecified → Trunk
To be clear: Bug 485052 caused an Lk regression (trace-malloc leaks at shutdown) on all platforms of about 15kB for Firefox, a similar regression is seen on the Thunderbird tinderboxes. Although the implementation of bug 485052 isn't at fault it has shown up (by increasing leak figures) the lack of clean up of the arenas. There is a general effort to reduce leaks to zero. IMHO Having some shutdown leaks "allowed" will cloud what is a leak and what isn't, especially when patches like the one in bug 485052 and increase leaks in an "expected" area.
Flags: blocking1.9.1?
Summary: PSM should define NSS_DISABLE_ARENA_FREE_LIST before initializing NSS → PSM should define NSS_DISABLE_ARENA_FREE_LIST before initializing NSS / Lk regression on 7th April 2009
Status: UNCONFIRMED → NEW
Ever confirmed: true
If we just want to fix the shutdown leak, we need to add a PL_ArenaFinish call during shutdown. If the memory possibly used by the NSS arenas is unbounded, then we need to set the NSS_DISABLE_ARENA_FREE_LIST environment variable. If it's bounded, what is the bound?
Not a regression, not a blocker, but we'd take a patch. Feel free to renominate if there are compelling reasons to block on it.
Flags: blocking1.9.1? → blocking1.9.1-
And just to make it clear, the question in comment 2 needs answering.
I want to clarify a number of points, some at Rob's request. 1. A PLArenaPool is a small structure that keeps track of a list of memory blocks known as PLArenas. PLArena have a minimum size, which is often larger than the size of a typical small data structure. When you attempt to allocate memory from a PLArenaPool, the code attempts (in this order) to allocate it from unused space in one of the arenas already associated with that PLArenaPool, or it tries to find a "free" PLArena on the global PLArena free list, or it allocates a new PLArena from the heap and links that into the PLArenaPool. Given a PLArenaPool holding one or more PLArenas, there is no public function to free just a single one of those arenas. Instead of freeing the individual PLArenas, the caller "frees" the entire PLArenaPool. There are two ways to do this. One method, PL_FinishArenaPool, frees all the PLArenas back to the heap. The other method, PL_FreeArenaPool, takes all the PLArenas away from the PLArenaPool and puts them on the global free list of PLArenas. When it wants to destroy a PLArenaPool, NSS calls PL_FreeArenaPool, unless the NSS_DISABLE_ARENA_FREE_LIST environment variable is set, in which case, NSS calls PL_FinishArenaPool. The environment variable may be set or cleared at any time. It only affects how PLArenaPools are destroyed, not how they are allocated. The allocation algorithm ALWAYS tries to allocate from the free list before allocating from the heap, even if the free list is empty. So, setting the environment variable in the middle of the running program will cause the free list of PLArenas to stop growing, and to shrink until it is empty. Once it is set, PLArenas that are already on the free list will be taken from it for new allocations, but they will be freed to the heap when the PLArenaPool is destroyed. Function PL_FinishArena (not to be confused with PL_FinishArenaPool) flushes the PLArena free list, freeing all those PLArenas back to the heap, and destroying the lock that protects the free list. It is intended to be called only at the end of the process, or at such time as PLArenaPools will never be used thereafter in the remainder of the process lifetime. NSS never calls this function because NSS does not presume itself to be the only user of the PLArenaPool code. 2. In answer to Boris's question, NSPR's PLArenaPool code does not keep track of the amount of space, nor the number of PLArenas, on the PLArena free list. There is a high water mark but it is not recorded, tracked, or bounded by NSPR.
Summary: PSM should define NSS_DISABLE_ARENA_FREE_LIST before initializing NSS / Lk regression on 7th April 2009 → PSM should define NSS_DISABLE_ARENA_FREE_LIST / Lk regression on 7th April 2009
That doesn't answer my question. My question is whether NSS's specific use of the arena APIs is bounded in terms of the number of PLArenas it will allocate over a process lifetime, at least as used via PSM. Or put another way, whether it's possible to cause the browser to allocate 500MB worth of PLArenas via NSS on visiting a web page, say.
(In reply to comment #6) > That doesn't answer my question. My question is whether NSS's specific use of > the arena APIs is bounded in terms of the number of PLArenas it will allocate > over a process lifetime, at least as used via PSM. NSS imposes no bound, but as a practical matter, it is bounded by Firefox's bound on the number of simultaneous SSL connections. > Or put another way, whether it's possible to cause the browser to allocate > 500MB worth of PLArenas via NSS on visiting a web page, say. It would take thousands upon thousands of simultaneous TCP connections to reach such numbers. So, I would say the answer is no. This is all measurable. Given that all the PLArenas allocated through NSS are now leaked at shutdown, just total up the space of those leaked PLArenaPools. That's the high water mark for that run.
> it is bounded by Firefox's bound on the number of simultaneous SSL connections. Is that really guaranteed? The patch that caused this bug to be filed doesn't do any SSL connections at all, but increased the number of arenas allocated... > That's the high water mark for that run. That doesn't answer my question either. My question is whether there is a high water mark limit over all possible runs.
NSS allocates space from PLArenaPools for data objects that correspond to sockets, keys, certificates, etc, but for for bulk data. The actual application data (e.g. http requests and response) are not allocated from PLArenaPools. Some of the objects allocated in PLArenaPools are long lived containing information that is essentially configuration information. So, the high water mark is a function of two classes of use: a) very long lived objects, and b) shorter lived objects whose numbers and total space correlate to the high water number of simultaneous connections, but not to amount of data transferred on those connections. Please take my suggestion, and instead of imagining the worst, measure the amount of space in leaked PLArenas allocated by NSS. Your most recent question is answered in comment 7.
> and instead of imagining the worst, measure the amount of space in leaked > PLArenas allocated by NSS Since I'm precisely interested in the worst-case behavior, that won't do me much good. > Your most recent question is answered in comment 7. Meaning the answer is "no"? Note that certificates can be quite long-lived, in general. Gecko assumes that certificate objects (or rather nsIX509Cert) objects are small enough to attach one to every image on a web page, for example. I have no idea whether they're sharing the same underlying NSS object if all the images come from the same server, say. > correlate to the high water number of simultaneous connections, but not to > amount of data transferred on those connections. I wasn't assuming it was anything like the amount of data transferred, and I'm glad it's not. But the "simultaneous connections" thing doesn't match what I know of as far as treatment of certificates in PSM. It's at the very least closer to "high-water-mark number of SSL sites that have all been loaded in the browser and not yet navigated away from", which is quite a bit larger than the number of SSL connections. In any case, the question I'm asking is not an NSS question, but a PSM question, since PSM is what mediates the browser's interaction with these objects and determines their lifetimes. I'm more or less waiting for Kai's answer here, unless someone else happens to know the details of that code.
> Meaning the answer is "no"? Boris, As I wrote at the beginning of comment 7: "NSS imposes no bound" (on the memory allocated from PLArenaPools). The memory allocated by NSS can be divided into two categories: a) That which NSS allocates for its own purposes, in the course of doing SSL or S/MIME, and b) That which NSS allocates at the request of the application that calls it. I can characterize the behavior of the first category of memory allocation, and did so in comment 9. I cannot characterize the second category. Within Firefox, that is really a PSM question, as you've noted. Regarding certificates, NSS itself is very miserly with the memory used to hold certificates. NSS has reference counted objects for those, and a hash table that keeps track of them all, to avoid duplication of certs in multiple objects. So, hopefully, all those PSM objects that hold a cert reference for every image (really!? I had no idea) are holding references to the same object for all the images that come from the same server.
Assignee: kaie → nobody
Whiteboard: [psm-logic]
Whiteboard: [psm-logic] → [psm-logic][psm-backlog]
Priority: -- → P2
Summary: PSM should define NSS_DISABLE_ARENA_FREE_LIST / Lk regression on 7th April 2009 → investigate setting NSS_DISABLE_ARENA_FREE_LIST so that NSS doesn't hold on to memory it's not using
Moving to p3 because no activity for at least 1 year(s). See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.