335951 - Lots of missing checks for out-of-memory.

Reporter

Description

•

19 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060320 Firefox/1.5 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060320 Firefox/1.5 Just one example of many: gfx/src/gtk/nsImageGTK.cpp:572 aTmpImage = gdk_pixmap_new(nsnull, endColumn-startColumn, scaleEndY-scaleStartY, aDepth); #ifdef MOZ_WIDGET_GTK2 if (aDepth != 1) gdk_drawable_set_colormap(GDK_DRAWABLE(aTmpImage), gdk_rgb_get_colormap()); #endif } What will happen if there is no memory for aTmpImage? Big BOOM. Maybe it cannot return (will abort() instead?) Please look here into gtk source: GdkPixmap* gdk_pixmap_new (GdkDrawable *drawable, gint width, gint height, gint depth) { ... pixmap = g_object_new (gdk_pixmap_get_type (), NULL); draw_impl = GDK_DRAWABLE_IMPL_X11 (GDK_PIXMAP_OBJECT (pixmap)->impl); pix_impl = GDK_PIXMAP_IMPL_X11 (GDK_PIXMAP_OBJECT (pixmap)->impl); ... Whoa. Looks like it thinks g_object_new never fails. Now lets take a look into glib source: g_object_new (GType object_type, const gchar *first_property_name, ...) { GObject *object; va_list var_args; g_return_val_if_fail (G_TYPE_IS_OBJECT (object_type), NULL); va_start (var_args, first_property_name); object = g_object_new_valist (object_type, first_property_name, var_args); va_end (var_args); return object; } GObject* g_object_new_valist (GType object_type, const gchar *first_property_name, va_list var_args) { ... params = g_new (GParameter, n_alloced_params); name = first_property_name; while (name) { .... params[n_params].name = name; params[n_params].value.g_type = 0; It, too, thinks that g_new never fails. g_new is #defined to be just a wrapper around g_malloc: gpointer g_malloc (gulong n_bytes) { if (n_bytes) { gpointer mem; mem = glib_mem_vtable.malloc (n_bytes); if (mem) return mem; g_error ("%s: failed to allocate %lu bytes", G_STRLOC, n_bytes); } return NULL; } So it CAN fail. And this is not an isolated bug. There are probably hundreds of such bugs. Many are in Mozilla itself, some are in libraries outside of Mozilla project. Is there any reason, then, why Mozilla source is littered with GraphicsState *state = new GraphicsState(); if (!state) return NS_ERROR_FAILURE; and other similar code? Does it try to plug a boat with million holes in the bottom? It is impractical to audit entire source for correct operation in low-on-memory case, and in normal operation it doesn't happen frequently -> these bugs will never be fixed en masse -> Mozilla will practically always crash badly when low on mem. What about making malloc/new/any_other_alloc just print a message about it on stderr and exiting? We can at least save a bit of code by removing all existing checks for failed allocations. If you think it is rude for GUI app to die without any GUI-sh error message, well, current situation (crashing) is rude too. Something like: void* malloc(int sz) { ... if(we_have_no_mem) { fprintf(stderr, "Failed to allocate %d bytes, exiting\n", needed); if(!may_use_emerg_pool_of_64k_we_allocated_earlier) { /* Try to be nice to GUI users */ may_use_emerg_pool_of_64k_we_allocated_earlier = 1; show_msgbox_with_just_an_ok_button("Failed to allocate %d bytes, exiting\n", needed); } exit(1); } } will be a bit nicer. Reproducible: Didn't try

timeless

Updated

•

19 years ago

Component: General → GFX: Gtk

Product: Firefox → Core

QA Contact: general → gtk

Version: unspecified → Trunk

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 1

•

19 years ago

We would like to move towards handling OOM rather than away from it. Clearly GTK has some work to do there too. Please file the issues you found in GTK as bugs in bugzilla.gnome.org if you haven't already...

Denis Vlasenko

Reporter

Comment 2

•

19 years ago

How do you realistically imagine proper OOM handling? A popup like "Sorry, I cannot show you this page, not enough memory" - and continuing? Nice, but not attainable in practice. There are multiture of reasons: 1. malloc() is used far too often in every corner, every library. You'll need some memory even for the "Out of memory!" message box itself. 2. In practice, when user hits OOM, browser first slows to a crawl because of excessive paging. malloc() does not fail yet. Frequently user just closes it (and maybe starts new browser). 3. However, if user won't close OOMimg browser, it will most probably crash from NULL dereference because if we'll suppose that 0.01% of malloc() sites lacks NULL check or check is buggy in some way, due to malloc() frequency some buggy callsite will be reached in a few seconds on average. This crash is equivalent in functionality to having malloc() which will never return NULL (will just exit(1) on failure) plus (a) we don't get a nice "Out of memory!" message on stderr and (b) we wasted a lot of code on those 99.99% malloc() callsites where we do check for NULL and react to it. 4. Some versions of libc implement large malloc() requests with mmap with on-demand supply of RAM pages. Thus malloc() succeeds, but if there is no free page to satisfy an access to seemingly "already allocated" buffer, application will get SIGBUS and normally will die. Any code which attempts to check for NULLs from malloc will not work in this case, it will be just it - lots of unreachable code. 5. On Linux, desktop boxes are usually configured to "overcommit" memory, i.e. app never sees NULL from malloc. Kernel just kills the "worst" process when it gets REALLY tight on memory. Again, code which attempts to check for NULLs from malloc will not work in this case. I am not proposing that memory allocators should be dumb and assume that RAM is infinite. I'm fine with them detecting low-mem condition and taking some action (purging caches (fonts, images, whatever) alerting user: "Low on mem, please close unneeded windows", etc). I just saying: "If there is genuinely _no memory_ to satisfy this malloc() - just say so and exit immediately. Simple. Small piece f code. Maybe a bit rude, but hey, point me to another program which handles OOM better.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 3

•

19 years ago

Opera.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 4

•

19 years ago

Desktops tend to act as you say (although it's often configurable, and Firefox is the sort of thing you might want to sandbox with ulimit). Embedded platforms don't. We want our code to be usable on embedded platforms. Getting there will be hard, but it is attainable, and we want to move forwards not backwards.

timeless

Comment 5

•

19 years ago

simply put there are companies actively interested in making this work (especially gecko on small devices). and they've recently hired developers who have historically cared about such crashes in mozilla code. that said, this looks like a series of gtk bugs which definitely need to be reported. reporter if you haven't already done so, please file against gtk. if you don't, it'll probably be one of the first things i get to do when i start my job. for people who like using mxr, there's a slightly dated version of gtk available @ http://landfill.mozilla.org/mxr-test/gnome/ident?i=gdk_pixmap_new where you can follow this trail of tears (i'm sorry, the ident db seems slightly out of sync w/ the files, so the line numbers are a bit off, i'll try to correct that later).

Denis Vlasenko

Reporter

Comment 6

•

19 years ago

Ok. GTK bugs are GTK bugs, but there is also a Mozilla bug at gfx/src/gtk/nsImageGTK.cpp:572 (see the beginning of this bug). There is no check for gdk_pixmap_new() OOM failure.

Denis Vlasenko

Reporter

Comment 7

•

19 years ago

Look at bug 336119 for an example of how hopeless is it. Callers of txList::add() and txList::insert() make an assumption that it cannot fail, forgetting about OOM case. Even if bug in txPattern::getSimplePatterns() will be fixed, a further audit is needed for all callers of txPattern::getSimplePatterns(). I still think that it is much more practical to just make operator new() never return NULL.

Denis Vlasenko

Reporter

Comment 8

•

19 years ago

Bug 336145 is another example of subtle bugs which start to trigger only in OOM condition.

Denis Vlasenko

Reporter

Comment 9

•

19 years ago

Bug 336192 shows a pointer member initialized in the constructor with pointer to new'ed object. It can be NULL, but in C++ constructors cannot fail and report OOM, so it is not checked for NULL. We need to add NULL checks to many member functions in order to cover for that. More to it, some of those previously-safe member functions (i.e. they could not fail) now _can fail_ - i.e. they can suddenly return "gee, our constructor had hit OOM" error. We need to add handling for that also. And we need to somehow debug all these changes. Looks horrible. Alternatively, we may add bool nsProfileAccess::isConstructedOk() function and call it on every nsProfileAccess instance after constructon but before actual usage. Not nice either.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 10

•

19 years ago

For that reason, many of our objects have an Init method which can fail. In that bug I would add an Init method to nsProfileAccess which gets called after construction; if it fails, the creator deletes the new object and fails itself. The other methods of nsProfileAccess don't need to change.

Denis Vlasenko

Reporter

Comment 11

•

19 years ago

Is it possible to handle OOM caused by stack growth?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 12

•

19 years ago

No. We limit stack size by limiting the depth of recursion.

Denis Vlasenko

Reporter

Comment 13

•

19 years ago

Limiting recursion depth prevents _unlimited_ stack growth. I am talking about different scenario. Stacks of userspace programs grow on demand. What will happen if there is no available memory for next stack page? void f() { char v[4096]; do_something(v); } void g() { f(); /* What if there is no free page for stack growth? */ }

Denis Vlasenko

Reporter

Comment 14

•

19 years ago

I wanted to check my assumtion that currently Mozilla doesn't handle OOM well. I switched my Linux into "no overcommit" mode, started firefox and opened a zillion of tabs, each with locally saved start page of ccd.cosmotography.com: echo 2 >/proc/sys/vm/overcommit_memory echo 80 >/proc/sys/vm/overcommit_ratio exec firefox "$PWD" and then Ctrl-clicking on the ccd_cosmotography_com.htm. Most of the time firefox was just crashing. Once it said this to stderr (or stdout): terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc but most of the time it says nothing. Strace looks like this: ... mprotect(0x3bb80000, 4096, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3bc00000 munmap(0x3bc00000, 0) = -1 EINVAL (Invalid argument) munmap(0x3bd00000, 1048576) = 0 mprotect(0x3bc00000, 32768, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory) munmap(0x3bc00000, 1048576) = 0 mprotect(0x3bb80000, 8192, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3bc00000 munmap(0x3bc00000, 0) = -1 EINVAL (Invalid argument) munmap(0x3bd00000, 1048576) = 0 mprotect(0x3bc00000, 32768, PROT_READ|PROT_WRITE) = -1 ENOMEM (Cannot allocate memory) munmap(0x3bc00000, 1048576) = 0 brk(0) = 0xa4dc000 brk(0xa4dd000) = 0xa4dc000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) --- SIGSEGV (Segmentation fault) @ 0 (0) --- unlink("/root/.mozilla/firefox/vz0c60y9.default/lock") = 0 rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0 kill(11014, SIGSEGV) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 11014 detached

Denis Vlasenko

Reporter

Comment 15

•

19 years ago

I want to play with "gracefully failing" malloc(), something like the code below. Am I correct that something like this should be incorporated into nsprpub/pr/src/malloc/prmem.c? I am not too familiar with Mozilla source tree. static int fault_injection_ratio = -1; static size_t guard_buffer_size = 64*1024; static size_t guard_buffer_cursz = 64*1024; static void* guard_buffer; extern void malloc_failed(size_t size); extern void malloc_low_on_mem(size_t size, size_t guard_buffer_cursz); static void allocate_guard_buffer(void) { free(guard_buffer); /* guard_buffer = NULL; */ guard_buffer_cursz = guard_buffer_size; do { guard_buffer = malloc(guard_buffer_cursz); if(guard_buffer) { /* Guard against lazy malloc implementations */ memset(guard_buffer, 0xfd, guard_buffer_cursz); break; } guard_buffer_cursz /= 2; } while(guard_buffer_cursz); } static void setup_malloc_from_env(void) { char *str; fault_injection_ratio = 0; str = getenv("MALLOC_FAULT_RATIO"); if(str) fault_injection_ratio = (strtoul(str,NULL,0) & INT_MAX); str = getenv("MALLOC_GUARD_BUFFER_SIZE"); if(str) guard_buffer_size = (strtoul(str,NULL,0) & INT_MAX); allocate_guard_buffer(); } static void* do_malloc(size_t size) { void* p = malloc(size); if(p) return p; free(guard_buffer); guard_buffer = NULL; p = malloc(size); if(!p) { /* It's probably best to abort... */ malloc_failed(size); return NULL; } /* Some memory may be still available, but not much */ /* (less than guard_buffer_cursz). Warn user about it */ malloc_low_on_mem(size, guard_buffer_cursz); allocate_guard_buffer(); return p; } void* my_malloc(size_t size) { again: if(!fault_injection_ratio) return do_malloc(size); if(fault_injection_ratio == -1) { setup_malloc_from_env(); goto again; } if((random() % fault_injection_ratio) == 0) return NULL; return do_malloc(size); }

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 16

•

19 years ago

(In reply to comment #13) > I am talking about different scenario. Stacks of userspace programs grow on > demand. What will happen if there is no available memory for next stack page? Crash, or at least a signal. If you limit recursion, you can in theory ensure that you run in a fixed amount of stack space, say less than 1MB. No-one has really tried to check that systematically, yet, although we have fixed bugs where specially crafted testcase crashed the browser by overflowing the stack.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 17

•

19 years ago

(In reply to comment #14) > I wanted to check my assumtion that currently Mozilla doesn't handle OOM well. We all agree with that assumption. The question is whether we should try to make Mozilla handle OOM well, or just give up. We use new and delete a lot, so for something meaningful I think you should hook into glibc malloc.

Denis Vlasenko

Reporter

Comment 18

•

19 years ago

I implemented a minimalistic solution which: * can allocate an emergency buffer, will warn when malloc need to tap it (controlled by MALLOC_GUARD_SIZE env var, disabled by default) * can inject malloc failures with 1/MALLOC_FAULT_RATIO probability (also controlled by env var, 0 - disabled (default)) * malloc will never fail. It will call malloc_low_on_mem() if emergency buffer needs to be used, and will call malloc_failed() if there is no mem at all. * current version just prints message to stderr. Better one may show a (rate-limited) message box if malloc_low_on_mem() is called. This patch will currently work on glibc systems only - it uses the fact that malloc and realloc are weak symbols and can be replaced. This would not work, for example, with uclibc. With this patch, (1) lots of error checks for out-of-memory can be eliminated and (2) firefox exits cleanly instead of crashing on OOM. Testing with ccd_cosmotography_com.htm as described earlier: # ./firefox malloc(86) failed # MALLOC_GUARD_SIZE=10000000 ./firefox The program 'firefox-bin' received an X Window System error. This probably reflects a bug in the program. The error was 'BadAlloc (insufficient resources for operation)'. (Details: serial 15797 error_code 11 request_code 53 minor_code 0) (Note to programmers: normally, X errors are reported asynchronously; that is, you will receive the error a while after causing it. To debug your program, run it with the --sync command line option to change this behavior. You can then get a meaningful backtrace from your debugger if you break on the gdk_x_error() function.) # ./firefox The program 'firefox-bin' received an X Window System error. This probably reflects a bug in the program. The error was 'BadAlloc (insufficient resources for operation)'. (Details: serial 12427 error_code 11 request_code 53 minor_code 0) (Note to programmers: normally, X errors are reported asynchronously; that is, you will receive the error a while after causing it. To debug your program, run it with the --sync command line option to change this behavior. You can then get a meaningful backtrace from your debugger if you break on the gdk_x_error() function.)

Denis Vlasenko

Reporter

Comment 19

•

19 years ago

Attached patch replace malloc with one which never returns NULL — Details — Splinter Review

Denis Vlasenko

Reporter

Comment 20

•

19 years ago

Test scenario where only firefox process is forced to OOM (X server is not affected): # MALLOC_GUARD_SIZE=1000000 softlimit -a 60000000 firefox /pub/space ...Ctrl-clicking.... Low on mem: less than 1000000 bytes are free Low on mem: less than 1000000 bytes are free Low on mem: less than 500000 bytes are free Low on mem: less than 250000 bytes are free Low on mem: less than 125000 bytes are free Low on mem: less than 62500 bytes are free Low on mem: less than 31250 bytes are free Low on mem: less than 15625 bytes are free Low on mem: less than 7812 bytes are free Low on mem: less than 3906 bytes are free Low on mem: less than 1953 bytes are free Low on mem: less than 976 bytes are free malloc(64) failed

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 21

•

19 years ago

Maybe a good strategy would be to detect nearly-OOM at a high-level and then try to have the browser recover by closing windows and freeing memory in other ways. But there's still the possibility that a single large allocation could fail, or that the reserve buffer could be exhausted accidentally while trying to recover, so I'm not sure that approach could be reliable.

Boris Zbarsky [:bzbarsky]

Comment 22

•

19 years ago

This bug is starting to be too much of a meta bug.... I'd file a separate bug on gfx/src/gtk/nsImageGTK.cpp:572 so it actually gets fixed....

Denis Vlasenko

Reporter

Comment 23

•

19 years ago

Boris, on closer look I found out that GTK (actually, glib) already contains the code which make g_xxx_new allocations to terminate the program on failure, never return NULL. There is no bug at gfx/src/gtk/nsImageGTK.cpp:572...

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 24

•

19 years ago

He's not CCed anymore. I'm going to start a thread on mozilla.dev.platform.

Dan Mosedale (:dmosedale, :dmose)

Comment 25

•

19 years ago

(In reply to comment #21) > Maybe a good strategy would be to detect nearly-OOM at a high-level and then > try to have the browser recover by closing windows and freeing memory in other > ways. But there's still the possibility that a single large allocation could > fail, or that the reserve buffer could be exhausted accidentally while trying > to recover, so I'm not sure that approach could be reliable. > As I recall, the theory behind the memory flushing stuff in the XPCOM allocator was something along these lines.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 26

•

19 years ago

Here's the thread. http://groups.google.com/group/mozilla.dev.platform/browse_frm/thread/afba0c925ba1a8f7/ca83e4b470987dd9#ca83e4b470987dd9 Nokia uses GTK on some of their devices, I wonder what they do about this.

Denis Vlasenko

Reporter

Comment 27

•

19 years ago

Alink to gtk archived mail: http://mail.gnome.org/archives/gtk-app-devel-list/2000-April/msg00369.html

Mats Palmgren (inactive)

Updated

•

17 years ago

Status: UNCONFIRMED → NEW

Component: GFX: Gtk → Widget: Gtk

Ever confirmed: true

QA Contact: gtk → gtk

Wayne Mery (:wsmwk)

Comment 28

•

12 years ago

is this obsolete just as bug 210931 comment 85 ... (In reply to Joe Drew (:JOEDREW! \o/) from comment #85) > I imagine this was fixed way back when we started using Cairo. Please reopen > if that's not the case.

Karl Tomlinson (:karlt)

Comment 29

•

12 years ago

(In reply to Denis Vlasenko from comment #0) > gpointer > g_malloc (gulong n_bytes) > { > if (n_bytes) > { > gpointer mem; > > mem = glib_mem_vtable.malloc (n_bytes); > if (mem) > return mem; > > g_error ("%s: failed to allocate %lu bytes", G_STRLOC, n_bytes); > } > > return NULL; > } > > So it CAN fail. g_error is fatal, aborting the program. Bug 611123 is about the more general case for other callers of malloc. Bug 709952 is about trying to recover memory when close to failure.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → DUPLICATE