862592 - Assert when infallibly allocating large blocks of memory (>1mb?)

Reporter

Description

•

12 years ago

I don't know if we'd want to land this, but it might be interesting to assert when we attempt to infallibly allocate large blocks of memory, in an attempt to root out places that should be fallible more aggressively than waiting for them to show up on crash stats.

Andrew McCreight [:mccr8]

Reporter

Comment 1

•

12 years ago

See bug 829954 where users running out of virtual memory are crashing in some particular graphics patch.

Nicholas Nethercote [inactive]

Updated

•

12 years ago

Whiteboard: [MemShrink] → [MemShrink:P2]

Nicholas Nethercote [inactive]

Comment 2

•

12 years ago

Bug 507249 was about the same idea. Bug 507249 comment 7 said: > We do a good job of figuring this out through crash-stats, so I don't think > this bug is useful anymore. but I'm not convinced -- we really shouldn't be infallibly allocating large (e.g. 1 MiB+) chunks, and catching them ahead of time is much better and more reliable than catching them from OOM crashes.

Boris Zbarsky [:bzbarsky]

Comment 3

•

12 years ago

This sounds like a good idea to me, fwiw....

Nicholas Nethercote [inactive]

Comment 4

•

12 years ago

Judging by memory/mozalloc/mozalloc_macro_wrappers.h, malloc/calloc/realloc are still *fallible*, and it's only operator new/new[] that are infallible. But the infallible new/new[] operators end up going through moz_xmalloc (which is infallible) anyway -- see memory/mozalloc/mozalloc.h. So I *think* that checking within moz_xmalloc() is enough.

Mike Hommey [:glandium]

Comment 5

•

12 years ago

SGTM

Justin Lebar (not reading bugmail)

Comment 6

•

12 years ago

FWIW fallible malloc is not helpful on B2G; I've never seen malloc return null there. Instead, the low-memory killer shoots a process when we're running low on memory. I think fallible malloc is only helpful when either * you're on a system which does not over-commit (Windows), or * the amount of virtual memory available to your process is low. IOW, this might be helpful for Windows, but probably not for *nix (including B2G and Android).

Trevor Saunders (:tbsaunde)

Comment 7

•

12 years ago

(In reply to Justin Lebar [:jlebar] from comment #6) > FWIW fallible malloc is not helpful on B2G; I've never seen malloc return > null there. Instead, the low-memory killer shoots a process when we're > running low on memory. > > I think fallible malloc is only helpful when either > > * you're on a system which does not over-commit (Windows), or I believe you can make linux not over commit if you so wish with /proc/sys/vm/overcommit_{ratio,memory}

Justin Lebar (not reading bugmail)

Comment 8

•

12 years ago

Indeed. I doubt we could turn this on for B2G, though. One issue (out of many, likely) is that when you malloc(x) for x > 1mb, jemalloc rounds up to the next mb. This isn't a big deal when we have overcommit (and although it wastes some vmem, it avoids fragmenting the virtual memory space, which is important). Without overcommit, at the very least we'd need to explicitly decommit the extra memory. I suspect we'd need to change other mmap'ers too, and external code like drivers might become involved. Another issue is that a child process can make a big allocation when the system has plenty of memory available, then the main process can come along and allocate some memory and cause the child process to die. Fallible malloc can't save us there.

Robert Kaiser

Comment 9

•

12 years ago

Well, we know we have some problems on Windows with out-of-memory crashes, so even if this only helps us there (in terms of finding cases that could cause us problems), it can already be a great help. Also, you mention VM space fragmentation and we also know cases where this becomes a problem (see recent analysis that bsmedberg did), so if we can decrease that, this could also help (though we're not yet sure if those fragmentation issues are ours or created e.g. by graphics drivers). > Fallible malloc can't save us there. Might be interesting to find out what can save us, then - probably in yet another bug. ;-)

Jesse Ruderman

Comment 10

•

12 years ago

I am opposed to making this an assertion. I'd be okay with a warning. * There are many allocation sites that only allocate large amounts in unlikely fuzz edge cases. "Fixing" these is not only a waste of time but also adds attack surface. Attack surface that is very hard to test, by the way. * Even if we "fix" all the large allocations, Firefox is still going to misbehave and eventually crash when out of memory. Spreading OOM crashes over a larger number of signatures is not the same as reducing the number of OOM crashes. * It might fire in third-party code we cannot alter. Our last-ditch defense against OOM crashes has to be user-facing: refusing to load more pages or open more tabs when Firefox is using too much memory.

Blocks: 427099

Justin Lebar (not reading bugmail)

Comment 11

•

12 years ago

I don't disagree with anything Jesse said, but I would disagree with something which significantly increases our debug spew. If we did a warning, I'd probably want it to be off by default. > 1 sort operation has occurred for the SQL statement '0x1164053e0'. See > https://developer.mozilla.org/En/Storage/Warnings details. /me grumbles

Nicholas Nethercote [inactive]

Comment 12

•

12 years ago

mccr8 and I weren't necessarily thinking of landing the assertion. The idea (AIUI) was more a "try it locally and see what happens" kind of thing.

Nicholas Nethercote [inactive]

Comment 13

•

12 years ago

Attached patch Rough patch to detect large infallible allocations. — Details — Splinter Review

When using 1 MiB as the threshold: - While browsing around on Linux64 desktop I regularly hit infallible allocs in the 1--3 MiB range in safe-browsing code. We've seen OOMs in that code before in bug 702217. I filed bug 864236 for it. - GDB also stopped a few times in __libc_send which I didn't understand. Stack trace below. When using 256 KiB as the threshold: - I hit 256 KiB allocations at xpcom/base/nsCycleCollector.cpp:506 frequently. It checks for failure, but because it calls NS_Alloc, which is infallible, the failure case never happens. mccr8, any thoughts on whether that's worth modifying? - I also saw the AlphaBoxBlur::Blur one from bug 829954 a bunch of times. #0 0x00007ffff7bcc2cc in __libc_send (fd=<optimised out>, buf=<optimised out>, n=<optimised out>, flags=<optimised out>) at ../sysdeps/unix/sysv/linux/x86_64/send.c:33 #1 0x00007ffff7eb5951 in pt_Send (fd=0x7fffd7622370, buf=0x7fff5c302000, amount=27, flags=0, timeout=4294967295) at /home/njn/moz/mi7/nsprpub/pr/src/pthreads/ptio.c:1914 #2 0x00007ffff6db494f in ssl_DefSend (ss=0x7fff5c58f000, buf=0x7fff5c302000 "\025\003\001", len=27, flags=0) at ssldef.c:95 #3 0x00007ffff6d93ab3 in ssl3_SendRecord (ss=0x7fff5c58f000, epoch=0, type=content_alert, pIn=0x7fffe31fe7c0 "", nIn=0, flags=0) at ssl3con.c:2557 #4 0x00007ffff6d946ee in SSL3_SendAlert (ss=0x7fff5c58f000, level=alert_warning, desc=close_notify) at ssl3con.c:2848 #5 0x00007ffff6dbe9c9 in ssl_SecureClose (ss=0x7fff5c58f000) at sslsecur.c:1061 #6 0x00007ffff6dcaa08 in ssl_Close (fd=0x7fffdcfcda60) at sslsock.c:2095 #7 0x00007ffff29170d7 in nsNSSSocketInfo::CloseSocketAndDestroy ( this=0x7fff5cbfbf00) at /home/njn/moz/mi7/security/manager/ssl/src/nsNSSIOLayer.cpp:759 #8 0x00007ffff2918017 in nsSSLIOLayerClose (fd=0x7fffdcfcda60) at /home/njn/moz/mi7/security/manager/ssl/src/nsNSSIOLayer.cpp:737 #9 0x00007ffff7e878e8 in PR_Close (fd=0x7fffdcfcda60) at /home/njn/moz/mi7/nsprpub/pr/src/io/priometh.c:104 #10 0x00007ffff0d104fb in nsSocketTransport::ReleaseFD_Locked ( this=0x7fff5d8e3f40, fd=0x7fffdcfcda60) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransport2.cpp:1452 #11 0x00007ffff0d15a30 in nsSocketTransport::OnSocketDetached ( this=0x7fff5d8e3f40, fd=0x7fffdcfcda60) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransport2.cpp:1699 #12 0x00007ffff0d1a0c9 in nsSocketTransportService::DetachSocket ( this=0x7ffff6c181a0, listHead=0x7fff6ffa5000, sock=0x7fff6ffa5120) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:181 #13 0x00007ffff0d1c692 in nsSocketTransportService::DoPollIteration ( this=0x7ffff6c181a0, wait=false) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:760 #14 0x00007ffff0d1c107 in nsSocketTransportService::Run (this=0x7ffff6c181a0) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:642 #15 0x00007ffff0d1cccc in non-virtual thunk to nsSocketTransportService::Run() (this=0x7ffff6c181b8) at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:686 #16 0x00007ffff373c6e5 in nsThread::ProcessNextEvent (this=0x7fffe3441420, mayWait=true, result=0x7fffe31fed5e) at /home/njn/moz/mi7/xpcom/threads/nsThread.cpp:627 #17 0x00007ffff36a58f9 in NS_ProcessNextEvent (thread=0x7fffe3441420, mayWait=true) at nsThreadUtils.cpp:238 #18 0x00007ffff373abc7 in nsThread::ThreadFunc (arg=0x7fffe3441420) at /home/njn/moz/mi7/xpcom/threads/nsThread.cpp:265 #19 0x00007ffff7ebb412 in _pt_root (arg=0x7ffff6c44d00) at /home/njn/moz/mi7/nsprpub/pr/src/pthreads/ptthread.c:191 #20 0x00007ffff7bc4e9a in start_thread (arg=0x7fffe31ff700) at pthread_create.c:308 #21 0x00007ffff6ed8cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Nicholas Nethercote [inactive]

Updated

•

12 years ago

Assignee: nobody → n.nethercote

Andrew McCreight [:mccr8]

Reporter

Comment 14

•

12 years ago

Thanks for looking at this! > mccr8, any thoughts on whether that's worth modifying? Well, I should at least get rid of that gross code that isn't actually doing anything. ;) If the current code was modified to a fallible malloc, it could end up freeing things that are live, if you bail out early (though that would be one way to reduce memory usage...). The right way to handle OOM would be to throw out the graph and abandon the CC. If the CC can't free anything, you are probably in big trouble already, so I'm inclined to leave it alone.

Andrew McCreight [:mccr8]

Reporter

Comment 15

•

12 years ago

It may become more of an issue with incremental cycle collection, when then CC data structures will be alive for a second, intermixed with mutator operation, rather than in a single block of 5 to 200ms or so.

Andrew McCreight [:mccr8]

Reporter

Updated

•

12 years ago

Depends on: 864344

Nicholas Nethercote [inactive]

Updated

•

10 years ago

Assignee: n.nethercote → nobody

Andrew McCreight [:mccr8]

Reporter

Comment 16

•

10 years ago

I think this was enough of an investigation that we can close this.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INCOMPLETE