Lots of missing checks for out-of-memory.

RESOLVED DUPLICATE of bug 611123

Status

()

Core
Widget: Gtk
RESOLVED DUPLICATE of bug 611123
12 years ago
5 years ago

People

(Reporter: Denis Vlasenko, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060320 Firefox/1.5
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060320 Firefox/1.5

Just one example of many:

gfx/src/gtk/nsImageGTK.cpp:572

    aTmpImage = gdk_pixmap_new(nsnull,
                               endColumn-startColumn,
                               scaleEndY-scaleStartY,
                               aDepth);
#ifdef MOZ_WIDGET_GTK2
    if (aDepth != 1)
      gdk_drawable_set_colormap(GDK_DRAWABLE(aTmpImage),
                                gdk_rgb_get_colormap());
#endif
  }

What will happen if there is no memory for aTmpImage? Big BOOM.
Maybe it cannot return (will abort() instead?) Please look here into gtk source:

GdkPixmap*
gdk_pixmap_new (GdkDrawable *drawable,
                gint         width,
                gint         height,
                gint         depth)
{
...
  pixmap = g_object_new (gdk_pixmap_get_type (), NULL);
  draw_impl = GDK_DRAWABLE_IMPL_X11 (GDK_PIXMAP_OBJECT (pixmap)->impl);
  pix_impl = GDK_PIXMAP_IMPL_X11 (GDK_PIXMAP_OBJECT (pixmap)->impl);
...

Whoa. Looks like it thinks g_object_new never fails. Now lets take a look into glib source:

g_object_new (GType        object_type,
              const gchar *first_property_name,
              ...)
{
  GObject *object;
  va_list var_args;

  g_return_val_if_fail (G_TYPE_IS_OBJECT (object_type), NULL);

  va_start (var_args, first_property_name);
  object = g_object_new_valist (object_type, first_property_name, var_args);
  va_end (var_args);

  return object;
}

GObject*
g_object_new_valist (GType        object_type,
                     const gchar *first_property_name,
                     va_list      var_args)
{
...
  params = g_new (GParameter, n_alloced_params);
  name = first_property_name;
  while (name)
    {
....
      params[n_params].name = name;
      params[n_params].value.g_type = 0;

It, too, thinks that g_new never fails. g_new is #defined to be just a wrapper around g_malloc:

gpointer
g_malloc (gulong n_bytes)
{
  if (n_bytes)
    {
      gpointer mem;

      mem = glib_mem_vtable.malloc (n_bytes);
      if (mem)
        return mem;

      g_error ("%s: failed to allocate %lu bytes", G_STRLOC, n_bytes);
    }

  return NULL;
}

So it CAN fail.

And this is not an isolated bug. There are probably hundreds of such bugs. Many are in Mozilla itself, some are in libraries outside of Mozilla project.

Is there any reason, then, why Mozilla source is littered with
  GraphicsState *state = new GraphicsState();
  if (!state)
    return NS_ERROR_FAILURE;
and other similar code? Does it try to plug a boat with million holes in the bottom?

It is impractical to audit entire source for correct operation in low-on-memory case, and in normal operation it doesn't happen frequently -> these bugs will never be fixed en masse -> Mozilla will practically always crash badly when low on mem.

What about making malloc/new/any_other_alloc just print a message about it on stderr and exiting? We can at least save a bit of code by removing all existing checks for failed allocations.

If you think it is rude for GUI app to die without any GUI-sh error message, well, current situation (crashing) is rude too.

Something like:

void* malloc(int sz) {
...
  if(we_have_no_mem) {
    fprintf(stderr, "Failed to allocate %d bytes, exiting\n", needed);
    if(!may_use_emerg_pool_of_64k_we_allocated_earlier) {
      /* Try to be nice to GUI users */
      may_use_emerg_pool_of_64k_we_allocated_earlier = 1;
      show_msgbox_with_just_an_ok_button("Failed to allocate %d bytes, exiting\n", needed);
    }
    exit(1);
  }
}

will be a bit nicer.

Reproducible: Didn't try

Updated

12 years ago
Component: General → GFX: Gtk
Product: Firefox → Core
QA Contact: general → gtk
Version: unspecified → Trunk
We would like to move towards handling OOM rather than away from it. Clearly GTK has some work to do there too. Please file the issues you found in GTK as bugs in bugzilla.gnome.org if you haven't already...
(Reporter)

Comment 2

12 years ago
How do you realistically imagine proper OOM handling? A popup like "Sorry, I cannot show you this page, not enough memory" - and continuing? Nice, but not attainable in practice. There are multiture of reasons:

1. malloc() is used far too often in every corner, every library. You'll need some memory even for the "Out of memory!" message box itself.
2. In practice, when user hits OOM, browser first slows to a crawl because of excessive paging. malloc() does not fail yet. Frequently user just closes it (and maybe starts new browser).
3. However, if user won't close OOMimg browser, it will most probably crash from NULL dereference because if we'll suppose that 0.01% of malloc() sites lacks NULL check or check is buggy in some way, due to malloc() frequency some buggy callsite will be reached in a few seconds on average. This crash is equivalent in functionality to having malloc() which will never return NULL (will just exit(1) on failure) plus (a) we don't get a nice "Out of memory!" message on stderr and (b) we wasted a lot of code on those 99.99% malloc() callsites where we do check for NULL and react to it.
4. Some versions of libc implement large malloc() requests with mmap with on-demand supply of RAM pages. Thus malloc() succeeds, but if there is no free page to satisfy an access to seemingly "already allocated" buffer, application will get SIGBUS and normally will die. Any code which attempts to check for NULLs from malloc will not work in this case, it will be just it - lots of unreachable code.
5. On Linux, desktop boxes are usually configured to "overcommit" memory, i.e. app never sees NULL from malloc. Kernel just kills the "worst" process when it gets REALLY tight on memory. Again, code which attempts to check for NULLs from malloc will not work in this case.

I am not proposing that memory allocators should be dumb and assume that RAM is infinite. I'm fine with them detecting low-mem condition and taking some action (purging caches (fonts, images, whatever) alerting user: "Low on mem, please close unneeded windows", etc). I just saying: "If there is genuinely _no memory_ to satisfy this malloc() - just say so and exit immediately. Simple. Small piece f code. Maybe a bit rude, but hey, point me to another program which handles OOM better.
Opera.
Desktops tend to act as you say (although it's often configurable, and Firefox is the sort of thing you might want to sandbox with ulimit). Embedded platforms don't. We want our code to be usable on embedded platforms. Getting there will be hard, but it is attainable, and we want to move forwards not backwards.

Comment 5

12 years ago
simply put there are companies actively interested in making this work (especially gecko on small devices). and they've recently hired developers who have historically cared about such crashes in mozilla code.

that said, this looks like a series of gtk bugs which definitely need to be reported. reporter if you haven't already done so, please file against gtk. if you don't, it'll probably be one of the first things i get to do when i start my job.

for people who like using mxr, there's a slightly dated version of gtk available @ http://landfill.mozilla.org/mxr-test/gnome/ident?i=gdk_pixmap_new where you can follow this trail of tears (i'm sorry, the ident db seems slightly out of sync w/ the files, so the line numbers are a bit off, i'll try to correct that later).
(Reporter)

Comment 6

12 years ago
Ok. GTK bugs are GTK bugs, but there is also a Mozilla bug at gfx/src/gtk/nsImageGTK.cpp:572 (see the beginning of this bug). There is no check for gdk_pixmap_new() OOM failure.
(Reporter)

Comment 7

12 years ago
Look at bug 336119 for an example of how hopeless is it. Callers of txList::add() and txList::insert() make an assumption that it cannot fail, forgetting about OOM case. Even if bug in txPattern::getSimplePatterns() will be fixed, a further audit is needed for all callers of txPattern::getSimplePatterns().

I still think that it is much more practical to just make operator new() never return NULL.
(Reporter)

Comment 8

12 years ago
Bug 336145 is another example of subtle bugs which start to trigger only in OOM condition.
(Reporter)

Comment 9

12 years ago
Bug 336192 shows a pointer member initialized in the constructor with pointer to new'ed object. It can be NULL, but in C++ constructors cannot fail and report OOM, so it is not checked for NULL.

We need to add NULL checks to many member functions in order to cover for that. More to it, some of those previously-safe member functions (i.e. they could not fail) now _can fail_ - i.e. they can suddenly return "gee, our constructor had hit OOM" error. We need to add handling for that also. And we need to somehow debug all these changes.

Looks horrible.

Alternatively, we may add bool nsProfileAccess::isConstructedOk() function and call it on every nsProfileAccess instance after constructon but before actual usage.

Not nice either.
For that reason, many of our objects have an Init method which can fail. In that bug I would add an Init method to nsProfileAccess which gets called after construction; if it fails, the creator deletes the new object and fails itself. The other methods of nsProfileAccess don't need to change.
(Reporter)

Comment 11

12 years ago
Is it possible to handle OOM caused by stack growth?
No. We limit stack size by limiting the depth of recursion.
(Reporter)

Comment 13

12 years ago
Limiting recursion depth prevents _unlimited_ stack growth.

I am talking about different scenario. Stacks of userspace programs grow on demand. What will happen if there is no available memory for next stack page?

void f() { char v[4096]; do_something(v); }

void g() {
    f(); /* What if there is no free page for stack growth? */
}
(Reporter)

Comment 14

12 years ago
I wanted to check my assumtion that currently Mozilla doesn't handle OOM well.

I switched my Linux into "no overcommit" mode, started firefox and opened a zillion of tabs, each with locally saved start page of ccd.cosmotography.com:

echo 2 >/proc/sys/vm/overcommit_memory
echo 80 >/proc/sys/vm/overcommit_ratio
exec firefox "$PWD"

and then Ctrl-clicking on the ccd_cosmotography_com.htm. Most of the time firefox was just crashing. Once it said this to stderr (or stdout):

terminate called after throwing an instance of 'std::bad_alloc'
  what():  St9bad_alloc

but most of the time it says nothing.

Strace looks like this:

...
mprotect(0x3bb80000, 4096, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3bc00000
munmap(0x3bc00000, 0)                   = -1 EINVAL (Invalid argument)
munmap(0x3bd00000, 1048576)             = 0
mprotect(0x3bc00000, 32768, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory)
munmap(0x3bc00000, 1048576)             = 0
mprotect(0x3bb80000, 8192, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory)
mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3bc00000
munmap(0x3bc00000, 0)                   = -1 EINVAL (Invalid argument)
munmap(0x3bd00000, 1048576)             = 0
mprotect(0x3bc00000, 32768, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory)
munmap(0x3bc00000, 1048576)             = 0
brk(0)                                  = 0xa4dc000
brk(0xa4dd000)                          = 0xa4dc000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
unlink("/root/.mozilla/firefox/vz0c60y9.default/lock") = 0
rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
kill(11014, SIGSEGV)                    = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 11014 detached
(Reporter)

Comment 15

12 years ago
I want to play with "gracefully failing" malloc(), something like the code below. Am I correct that something like this should be incorporated
into nsprpub/pr/src/malloc/prmem.c? I am not too familiar with Mozilla source tree.


static int fault_injection_ratio = -1;
static size_t guard_buffer_size = 64*1024;
static size_t guard_buffer_cursz = 64*1024;
static void* guard_buffer;

extern void malloc_failed(size_t size);
extern void malloc_low_on_mem(size_t size, size_t guard_buffer_cursz);

static void allocate_guard_buffer(void)
{
	free(guard_buffer);
	/* guard_buffer = NULL; */
	guard_buffer_cursz = guard_buffer_size;
	do {
		guard_buffer = malloc(guard_buffer_cursz);
		if(guard_buffer) {
			/* Guard against lazy malloc implementations */
			memset(guard_buffer, 0xfd, guard_buffer_cursz);
			break;
		}
		guard_buffer_cursz /= 2;
	} while(guard_buffer_cursz);
}

static void setup_malloc_from_env(void)
{
	char *str;

	fault_injection_ratio = 0;
	str = getenv("MALLOC_FAULT_RATIO");
	if(str) fault_injection_ratio = (strtoul(str,NULL,0) & INT_MAX);

	str = getenv("MALLOC_GUARD_BUFFER_SIZE");
	if(str) guard_buffer_size = (strtoul(str,NULL,0) & INT_MAX);

	allocate_guard_buffer();
}

static void* do_malloc(size_t size)
{
	void* p = malloc(size);
	if(p) return p;

	free(guard_buffer);
	guard_buffer = NULL;
	p = malloc(size);
	if(!p) {
		/* It's probably best to abort... */
		malloc_failed(size);
		return NULL;
	}
	/* Some memory may be still available, but not much */
	/* (less than guard_buffer_cursz). Warn user about it */
	malloc_low_on_mem(size, guard_buffer_cursz);

	allocate_guard_buffer();
	return p;
}

void* my_malloc(size_t size)
{
again:
	if(!fault_injection_ratio)
		return do_malloc(size);

	if(fault_injection_ratio == -1) {
		setup_malloc_from_env();
		goto again;
	}

	if((random() % fault_injection_ratio) == 0)
		return NULL;

	return do_malloc(size);
}
(In reply to comment #13)
> I am talking about different scenario. Stacks of userspace programs grow on
> demand. What will happen if there is no available memory for next stack page?

Crash, or at least a signal.

If you limit recursion, you can in theory ensure that you run in a fixed amount of stack space, say less than 1MB. No-one has really tried to check that systematically, yet, although we have fixed bugs where specially crafted testcase crashed the browser by overflowing the stack.
(In reply to comment #14)
> I wanted to check my assumtion that currently Mozilla doesn't handle OOM well.

We all agree with that assumption.

The question is whether we should try to make Mozilla handle OOM well, or just give up.

We use new and delete a lot, so for something meaningful I think you should hook into glibc malloc.
(Reporter)

Comment 18

12 years ago
I implemented a minimalistic solution which:

* can allocate an emergency buffer, will warn when malloc need to tap it
  (controlled by MALLOC_GUARD_SIZE env var, disabled by default)
* can inject malloc failures with 1/MALLOC_FAULT_RATIO probability
  (also controlled by env var, 0 - disabled (default))
* malloc will never fail. It will call malloc_low_on_mem()
  if emergency buffer needs to be used, and will call
  malloc_failed() if there is no mem at all.
* current version just prints message to stderr. Better one may show
  a (rate-limited) message box if malloc_low_on_mem() is called.

This patch will currently work on glibc systems only - it uses the fact that malloc and realloc are weak symbols and can be replaced. This would not work, for example, with uclibc.

With this patch, (1) lots of error checks for out-of-memory can be eliminated and (2) firefox exits cleanly instead of crashing on OOM.

Testing with ccd_cosmotography_com.htm as described earlier:

# ./firefox
malloc(86) failed
# MALLOC_GUARD_SIZE=10000000 ./firefox
The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadAlloc (insufficient resources for operation)'.
  (Details: serial 15797 error_code 11 request_code 53 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
# ./firefox
The program 'firefox-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadAlloc (insufficient resources for operation)'.
  (Details: serial 12427 error_code 11 request_code 53 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
(Reporter)

Comment 19

12 years ago
Created attachment 220714 [details] [diff] [review]
replace malloc with one which never returns NULL
(Reporter)

Comment 20

12 years ago
Test scenario where only firefox process is forced to OOM (X server is not affected):

# MALLOC_GUARD_SIZE=1000000 softlimit -a 60000000 firefox /pub/space
...Ctrl-clicking....
Low on mem: less than 1000000 bytes are free
Low on mem: less than 1000000 bytes are free
Low on mem: less than 500000 bytes are free
Low on mem: less than 250000 bytes are free
Low on mem: less than 125000 bytes are free
Low on mem: less than 62500 bytes are free
Low on mem: less than 31250 bytes are free
Low on mem: less than 15625 bytes are free
Low on mem: less than 7812 bytes are free
Low on mem: less than 3906 bytes are free
Low on mem: less than 1953 bytes are free
Low on mem: less than 976 bytes are free
malloc(64) failed
Maybe a good strategy would be to detect nearly-OOM at a high-level and then try to have the browser recover by closing windows and freeing memory in other ways. But there's still the possibility that a single large allocation could fail, or that the reserve buffer could be exhausted accidentally while trying to recover, so I'm not sure that approach could be reliable.
This bug is starting to be too much of a meta bug.... I'd file a separate bug on gfx/src/gtk/nsImageGTK.cpp:572 so it actually gets fixed....
(Reporter)

Comment 23

12 years ago
Boris, on closer look I found out that GTK (actually, glib) already contains the code which make g_xxx_new allocations to terminate the program on failure, never return NULL. There is no bug at gfx/src/gtk/nsImageGTK.cpp:572...
He's not CCed anymore.

I'm going to start a thread on mozilla.dev.platform.
(In reply to comment #21)
> Maybe a good strategy would be to detect nearly-OOM at a high-level and then
> try to have the browser recover by closing windows and freeing memory in other
> ways. But there's still the possibility that a single large allocation could
> fail, or that the reserve buffer could be exhausted accidentally while trying
> to recover, so I'm not sure that approach could be reliable.
> 

As I recall, the theory behind the memory flushing stuff in the XPCOM allocator was something along these lines.
Here's the thread.
http://groups.google.com/group/mozilla.dev.platform/browse_frm/thread/afba0c925ba1a8f7/ca83e4b470987dd9#ca83e4b470987dd9

Nokia uses GTK on some of their devices, I wonder what they do about this.
(Reporter)

Comment 27

12 years ago
Alink to gtk archived mail:

http://mail.gnome.org/archives/gtk-app-devel-list/2000-April/msg00369.html

Updated

11 years ago
Status: UNCONFIRMED → NEW
Component: GFX: Gtk → Widget: Gtk
Ever confirmed: true
QA Contact: gtk → gtk

Comment 28

5 years ago
is this obsolete just as bug 210931 comment 85 ...
(In reply to Joe Drew (:JOEDREW! \o/) from comment #85)
> I imagine this was fixed way back when we started using Cairo. Please reopen
> if that's not the case.
(In reply to Denis Vlasenko from comment #0)
> gpointer
> g_malloc (gulong n_bytes)
> {
>   if (n_bytes)
>     {
>       gpointer mem;
> 
>       mem = glib_mem_vtable.malloc (n_bytes);
>       if (mem)
>         return mem;
> 
>       g_error ("%s: failed to allocate %lu bytes", G_STRLOC, n_bytes);
>     }
> 
>   return NULL;
> }
> 
> So it CAN fail.

g_error is fatal, aborting the program.

Bug 611123 is about the more general case for other callers of malloc.
Bug 709952 is about trying to recover memory when close to failure.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 611123
You need to log in before you can comment on or make changes to this bug.