Last Comment Bug 862592 - Assert when infallibly allocating large blocks of memory (>1mb?)
: Assert when infallibly allocating large blocks of memory (>1mb?)
Status: RESOLVED INCOMPLETE
[MemShrink:P2]
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
: Nathan Froyd [:froydnj]
Mentors:
Depends on: 864344
Blocks: 427099
  Show dependency treegraph
 
Reported: 2013-04-16 16:16 PDT by Andrew McCreight [:mccr8]
Modified: 2015-08-19 19:58 PDT (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Rough patch to detect large infallible allocations. (913 bytes, patch)
2013-04-21 23:22 PDT, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review

Description Andrew McCreight [:mccr8] 2013-04-16 16:16:57 PDT
I don't know if we'd want to land this, but it might be interesting to assert when we attempt to infallibly allocate large blocks of memory, in an attempt to root out places that should be fallible more aggressively than waiting for them to show up on crash stats.
Comment 1 Andrew McCreight [:mccr8] 2013-04-16 16:18:07 PDT
See bug 829954 where users running out of virtual memory are crashing in some particular graphics patch.
Comment 2 Nicholas Nethercote [:njn] 2013-04-16 17:16:10 PDT
Bug 507249 was about the same idea.  Bug 507249 comment 7 said:

> We do a good job of figuring this out through crash-stats, so I don't think
> this bug is useful anymore.

but I'm not convinced -- we really shouldn't be infallibly allocating large (e.g. 1 MiB+) chunks, and catching them ahead of time is much better and more reliable than catching them from OOM crashes.
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2013-04-16 18:22:37 PDT
This sounds like a good idea to me, fwiw....
Comment 4 Nicholas Nethercote [:njn] 2013-04-16 22:49:02 PDT
Judging by memory/mozalloc/mozalloc_macro_wrappers.h, malloc/calloc/realloc are still *fallible*, and it's only operator new/new[] that are infallible.

But the infallible new/new[] operators end up going through moz_xmalloc (which is infallible) anyway -- see memory/mozalloc/mozalloc.h.  So I *think* that checking within moz_xmalloc() is enough.
Comment 5 Mike Hommey [:glandium] 2013-04-16 23:14:47 PDT
SGTM
Comment 6 Justin Lebar (not reading bugmail) 2013-04-17 02:03:07 PDT
FWIW fallible malloc is not helpful on B2G; I've never seen malloc return null there.  Instead, the low-memory killer shoots a process when we're running low on memory.

I think fallible malloc is only helpful when either

 * you're on a system which does not over-commit (Windows), or
 * the amount of virtual memory available to your process is low.

IOW, this might be helpful for Windows, but probably not for *nix (including B2G and Android).
Comment 7 Trevor Saunders (:tbsaunde) 2013-04-17 02:14:11 PDT
(In reply to Justin Lebar [:jlebar] from comment #6)
> FWIW fallible malloc is not helpful on B2G; I've never seen malloc return
> null there.  Instead, the low-memory killer shoots a process when we're
> running low on memory.
> 
> I think fallible malloc is only helpful when either
> 
>  * you're on a system which does not over-commit (Windows), or

I believe you can make linux not over commit if you so wish with /proc/sys/vm/overcommit_{ratio,memory}
Comment 8 Justin Lebar (not reading bugmail) 2013-04-17 02:20:45 PDT
Indeed.  I doubt we could turn this on for B2G, though.  One issue (out of many, likely) is that when you malloc(x) for x > 1mb, jemalloc rounds up to the next mb.  This isn't a big deal when we have overcommit (and although it wastes some vmem, it avoids fragmenting the virtual memory space, which is important).  Without overcommit, at the very least we'd need to explicitly decommit the extra memory.  I suspect we'd need to change other mmap'ers too, and external code like drivers might become involved.

Another issue is that a child process can make a big allocation when the system has plenty of memory available, then the main process can come along and allocate some memory and cause the child process to die.  Fallible malloc can't save us there.
Comment 9 Robert Kaiser 2013-04-17 05:05:47 PDT
Well, we know we have some problems on Windows with out-of-memory crashes, so even if this only helps us there (in terms of finding cases that could cause us problems), it can already be a great help.
Also, you mention VM space fragmentation and we also know cases where this becomes a problem (see recent analysis that bsmedberg did), so if we can decrease that, this could also help (though we're not yet sure if those fragmentation issues are ours or created e.g. by graphics drivers).

> Fallible malloc can't save us there.

Might be interesting to find out what can save us, then - probably in yet another bug. ;-)
Comment 10 Jesse Ruderman 2013-04-17 08:36:25 PDT
I am opposed to making this an assertion.  I'd be okay with a warning.

* There are many allocation sites that only allocate large amounts in unlikely fuzz edge cases.  "Fixing" these is not only a waste of time but also adds attack surface.  Attack surface that is very hard to test, by the way.

* Even if we "fix" all the large allocations, Firefox is still going to misbehave and eventually crash when out of memory.  Spreading OOM crashes over a larger number of signatures is not the same as reducing the number of OOM crashes.

* It might fire in third-party code we cannot alter.

Our last-ditch defense against OOM crashes has to be user-facing: refusing to load more pages or open more tabs when Firefox is using too much memory.
Comment 11 Justin Lebar (not reading bugmail) 2013-04-17 08:48:23 PDT
I don't disagree with anything Jesse said, but I would disagree with something which significantly increases our debug spew.  If we did a warning, I'd probably want it to be off by default.

> 1 sort operation has occurred for the SQL statement '0x1164053e0'.  See 
> https://developer.mozilla.org/En/Storage/Warnings details.

/me grumbles
Comment 12 Nicholas Nethercote [:njn] 2013-04-17 15:56:19 PDT
mccr8 and I weren't necessarily thinking of landing the assertion.  The idea (AIUI) was more a "try it locally and see what happens" kind of thing.
Comment 13 Nicholas Nethercote [:njn] 2013-04-21 23:22:09 PDT
Created attachment 740177 [details] [diff] [review]
Rough patch to detect large infallible allocations.

When using 1 MiB as the threshold: 

- While browsing around on Linux64 desktop I regularly hit infallible allocs in
  the 1--3 MiB range in safe-browsing code.  We've seen OOMs in that code
  before in bug 702217.  I filed bug 864236 for it.

- GDB also stopped a few times in __libc_send which I didn't understand.  Stack
  trace below.

When using 256 KiB as the threshold:

- I hit 256 KiB allocations at xpcom/base/nsCycleCollector.cpp:506 frequently.
  It checks for failure, but because it calls NS_Alloc, which is infallible,
  the failure case never happens.  mccr8, any thoughts on whether that's worth
  modifying?

- I also saw the AlphaBoxBlur::Blur one from bug 829954 a bunch of times.


#0  0x00007ffff7bcc2cc in __libc_send (fd=<optimised out>, 
    buf=<optimised out>, n=<optimised out>, flags=<optimised out>)
    at ../sysdeps/unix/sysv/linux/x86_64/send.c:33
#1  0x00007ffff7eb5951 in pt_Send (fd=0x7fffd7622370, buf=0x7fff5c302000, 
    amount=27, flags=0, timeout=4294967295)
    at /home/njn/moz/mi7/nsprpub/pr/src/pthreads/ptio.c:1914
#2  0x00007ffff6db494f in ssl_DefSend (ss=0x7fff5c58f000, 
    buf=0x7fff5c302000 "\025\003\001", len=27, flags=0) at ssldef.c:95
#3  0x00007ffff6d93ab3 in ssl3_SendRecord (ss=0x7fff5c58f000, epoch=0, 
    type=content_alert, pIn=0x7fffe31fe7c0 "", nIn=0, flags=0)
    at ssl3con.c:2557
#4  0x00007ffff6d946ee in SSL3_SendAlert (ss=0x7fff5c58f000, 
    level=alert_warning, desc=close_notify) at ssl3con.c:2848
#5  0x00007ffff6dbe9c9 in ssl_SecureClose (ss=0x7fff5c58f000)
    at sslsecur.c:1061
#6  0x00007ffff6dcaa08 in ssl_Close (fd=0x7fffdcfcda60) at sslsock.c:2095
#7  0x00007ffff29170d7 in nsNSSSocketInfo::CloseSocketAndDestroy (
    this=0x7fff5cbfbf00)
    at /home/njn/moz/mi7/security/manager/ssl/src/nsNSSIOLayer.cpp:759
#8  0x00007ffff2918017 in nsSSLIOLayerClose (fd=0x7fffdcfcda60)
    at /home/njn/moz/mi7/security/manager/ssl/src/nsNSSIOLayer.cpp:737
#9  0x00007ffff7e878e8 in PR_Close (fd=0x7fffdcfcda60)
    at /home/njn/moz/mi7/nsprpub/pr/src/io/priometh.c:104
#10 0x00007ffff0d104fb in nsSocketTransport::ReleaseFD_Locked (
    this=0x7fff5d8e3f40, fd=0x7fffdcfcda60)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransport2.cpp:1452
#11 0x00007ffff0d15a30 in nsSocketTransport::OnSocketDetached (
    this=0x7fff5d8e3f40, fd=0x7fffdcfcda60)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransport2.cpp:1699
#12 0x00007ffff0d1a0c9 in nsSocketTransportService::DetachSocket (
    this=0x7ffff6c181a0, listHead=0x7fff6ffa5000, sock=0x7fff6ffa5120)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:181
#13 0x00007ffff0d1c692 in nsSocketTransportService::DoPollIteration (
    this=0x7ffff6c181a0, wait=false)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:760
#14 0x00007ffff0d1c107 in nsSocketTransportService::Run (this=0x7ffff6c181a0)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:642
#15 0x00007ffff0d1cccc in non-virtual thunk to nsSocketTransportService::Run()
    (this=0x7ffff6c181b8)
    at /home/njn/moz/mi7/netwerk/base/src/nsSocketTransportService2.cpp:686
#16 0x00007ffff373c6e5 in nsThread::ProcessNextEvent (this=0x7fffe3441420, 
    mayWait=true, result=0x7fffe31fed5e)
    at /home/njn/moz/mi7/xpcom/threads/nsThread.cpp:627
#17 0x00007ffff36a58f9 in NS_ProcessNextEvent (thread=0x7fffe3441420, 
    mayWait=true) at nsThreadUtils.cpp:238
#18 0x00007ffff373abc7 in nsThread::ThreadFunc (arg=0x7fffe3441420)
    at /home/njn/moz/mi7/xpcom/threads/nsThread.cpp:265
#19 0x00007ffff7ebb412 in _pt_root (arg=0x7ffff6c44d00)
    at /home/njn/moz/mi7/nsprpub/pr/src/pthreads/ptthread.c:191
#20 0x00007ffff7bc4e9a in start_thread (arg=0x7fffe31ff700)
    at pthread_create.c:308
#21 0x00007ffff6ed8cbd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
Comment 14 Andrew McCreight [:mccr8] 2013-04-22 08:34:50 PDT
Thanks for looking at this!

> mccr8, any thoughts on whether that's worth modifying?

Well, I should at least get rid of that gross code that isn't actually doing anything. ;)

If the current code was modified to a fallible malloc, it could end up freeing things that are live, if you bail out early (though that would be one way to reduce memory usage...).  The right way to handle OOM would be to throw out the graph and abandon the CC.  If the CC can't free anything, you are probably in big trouble already, so I'm inclined to leave it alone.
Comment 15 Andrew McCreight [:mccr8] 2013-04-22 08:39:26 PDT
It may become more of an issue with incremental cycle collection, when then CC data structures will be alive for a second, intermixed with mutator operation, rather than in a single block of 5 to 200ms or so.
Comment 16 Andrew McCreight [:mccr8] 2014-08-23 09:56:15 PDT
I think this was enough of an investigation that we can close this.

Note You need to log in before you can comment on or make changes to this bug.