Open Bug 427099 Opened 12 years ago Updated 4 years ago

Rework OOM handling for moz2

Categories

(Core :: General, defect)

defect
Not set

Tracking

()

People

(Reporter: benjamin, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug, )

Details

(Keywords: sec-want, Whiteboard: [sg:want P1])

For mozilla 2, we want to rework some aspects of memory handling. This bug is to track everything we need to do for the "OOM track".
Component: Build Config → General
QA Contact: build-config → general
Depends on: 427107
Depends on: 427109
I have added a patch to bug #427109 that implements the necessary allocator infrastructure.  Subject to review feedback, the other portions of the OOM rework should be able to proceed now.
Depends on: 441324
Depends on: 414946
Duplicate of this bug: 441675
Damon has requested the current status of this work along with a list of TODOs.  Much of this was taken from a March dev-platform post of bsmedberg's (http://preview.tinyurl.com/ksnlc2) and other discussions, but edited for concision and to enumerate TODOs.

== OOM handling options ==
 (1) abort when malloc fails
 (2) try to reclaim memory from a reserve when malloc fails, using this API: https://wiki.mozilla.org/Mozilla_2/Memory/OOM_API

== Implementation options ==
  * extend jemalloc
    - code is checked in for both abort- and reclamation-OOM-handling
    - see bug 427109
  * wrap whatever underlying allocator we use (system or jemalloc); this code is NOT written

== Problems with jemalloc OOM extensions ==
  * on Linux
    - memory reclamation doesn't work reliably, as the kernel allocates more memory to userspace than the kernel can back, causing hard-to-predict OOM death
    - we have a userspace workaround, but performance impact is unacceptable
    - see bug 465127
  * on Mac
    - jemalloc isn't used at all
    - can't completely replace system malloc/free at dynamic link time, so we're probably stuck with multiple allocators
    - multiple allocators => potential for mismatched malloc/free
    - multiple allocators => potential increased fragmentation and memory usage
  * on Windows: appears to be in good shape, but requires a custom CRT with jemalloc malloc/free, and this CRT is not free software.

== Current plan: wrap system allocator ==
Due to the problems with jemalloc and its OOM handling, we prefer wrapping the underlying system allocator for OOM handling.  As a first step, we will abort when allocation fails.  Later, we can attempt memory reclamation.  It's not clear that trying to reclaim memory will work well, but that's a whole 'nother can of worms.

== TODOs for aborting on OOM ==
  (1) ensure that no Mozilla code directly uses libc "allocating functions," among them malloc/free/strdup/..., but instead uses NSPR or NS replacements.  static analysis tools can help with this.
  (2) implement wrappers for |operator new/new[]| and |operator delete/delete[]| (and probably the nothrow variants too), around |PR_Malloc()/PR_Free()|.  these need to be either fully inlined in all Mozilla code, or statically linked into each shared library we build.
  (3) add a |PR_TryMalloc()| function to NSPR that is understood to return NULL in OOM conditions
  (4) modify the NSPR and NS analogues of malloc/free/strdup/etc. to abort in OOM conditions.  enable this behavior with configure flags.

This work can be done (or tracked) in bug 441324.
Some clarifications:

> == Implementation options ==
>   * extend jemalloc
>     - code is checked in for both abort- and reclamation-OOM-handling
>     - see bug 427109
>   * wrap whatever underlying allocator we use (system or jemalloc); this code
> is NOT written
> 

"Underlying allocator" just means whatever provides the symbols malloc/free.  So the allocator could be the libc allocator, or jemalloc, or ...

> == Problems with jemalloc OOM extensions ==
>   * on Linux
>     - memory reclamation doesn't work reliably, as the kernel allocates more
> memory to userspace than the kernel can back, causing hard-to-predict OOM death
>     - we have a userspace workaround, but performance impact is unacceptable
>     - see bug 465127

Sorry, this should have read "on Linux kernels without swap partitions."

> == Current plan: wrap system allocator ==
> Due to the problems with jemalloc and its OOM handling, we prefer wrapping the
> underlying system allocator for OOM handling.  As a first step, we will abort
> when allocation fails.

It's worth expanding on this more.  The ultimate goal here is to remove the "futile" OOM checks we currently do in Gecko: the current consensus is that, in Gecko, OOM checks are mostly worthless or buggy, and are not worth the programmer effort.  (Some of these checks are not futile; you know who you are, but are probably in the vast minority.)  If we were to eliminate all futile OOM checks from our codebase *right now*, without making any changes to our memory allocator, one of two things would happen when we hit OOM

  (1) malloc would return NULL, and we'd eventually dereference a NULL pointer.  Gecko would die an unexploitable death.
  (2) the kernel, and later malloc, would return a pointer to "overcommitted" VM, and when Gecko tried to write to that region, the kernel would kill it in an unexploitable way.

This situation is probably acceptable enough for us to work on removing futile OOM checks in parallel with work on better OOM handling at the allocator level.  The TODO list below is the first step to "better" OOM handling.  It is arguably "better" because when Gecko crashed on OOM in case (1), it would be immediately clear what the cause of the crash was.  We would also "control" the failure in the sense that we could decide when and how to fail, rather than it being left to essentially chance.  It's also the first step towards trying to recover memory and *not* crashing --- OOM handling reworking stage 2.

Unfortunately, aborting when malloc returns NULL won't help with case (2).  It's unfortunate because we only see case (2) on mobile Linux, which is also where we're most likely to hit OOM.  C'est la vie ....

> == TODOs for aborting on OOM ==
>   (1) ensure that no Mozilla code directly uses libc "allocating functions,"
> among them malloc/free/strdup/..., but instead uses NSPR or NS replacements. 
> static analysis tools can help with this.
>   (2) implement wrappers for |operator new/new[]| and |operator
> delete/delete[]| (and probably the nothrow variants too), around
> |PR_Malloc()/PR_Free()|.  these need to be either fully inlined in all Mozilla
> code, or statically linked into each shared library we build.
>   (3) add a |PR_TryMalloc()| function to NSPR that is understood to return NULL
> in OOM conditions
>   (4) modify the NSPR and NS analogues of malloc/free/strdup/etc. to abort in
> OOM conditions.  enable this behavior with configure flags.
> 
> This work can be done (or tracked) in bug 441324.

This series of steps assumes we want to put the infallible malloc in NSPR.  It would arguably be better to put the malloc/free/strdup/... wrappers in their own library (libxmalloc?).  Among other things, it would be easier for SpiderMonkey to utilize infallible malloc if they so chose (though they probably won't want to).  This doesn't substantially change the above steps (s/NSPR/libxmalloc/), but it's worth pointing out that NSPR isn't our only choice.
Whiteboard: [sg:want P1]
(In reply to comment #3)
>  (2) try to reclaim memory from a reserve when malloc fails, using this API:
> https://wiki.mozilla.org/Mozilla_2/Memory/OOM_API

From http://mxr-test.konigsberg.mozilla.org/mozilla-central/source/memory/mozalloc/mozalloc_oom.cpp#49 , it seems that this part didn't get implemented.

I kinda assumed it would get implemented by the time we ship the HTML5 parser. In particular, bug 573078 benefit from some mechanism outside the parser making room by tearing down other documents and potentially stopping the parser that hits OOM.

Any chance of the low memory "stop sinks" plan getting to implementation by Firefox 4?
(In reply to comment #5)
> I kinda assumed it would get implemented by the time we ship the HTML5 parser.
> In particular, bug 573078 benefit from some mechanism outside the parser making
> room by tearing down other documents and potentially stopping the parser that
> hits OOM.

see bug 590674 too, i reproducibly get OOM crashes from heavy use/on startup. I'm assuming this is due to garbage collection/cache flushing etc. happening in parallel instead of blocking allocations. If allocation happens too fast, before the parallel cleanup can free up enough memory that could OOM-crash the application even if enough memory would have been freed eventually
Depends on: 610823
Depends on: 611123
Blocks: 557228
Depends on: 734703
Depends on: 733262
Depends on: 804492
Depends on: 606198
No longer depends on: 804492
Depends on: 737164
Depends on: 852501
Depends on: 862592
You need to log in before you can comment on or make changes to this bug.