Closed Bug 709952 Opened 13 years ago Closed 1 year ago

Infallible malloc OOM killer

Categories

(Core :: DOM: Core & HTML, defect, P5)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: cdleary, Unassigned)

Details

I was talking with bz about the possibility of cleaving off and deallocating a big chunk of application memory when we detect an OOM in infallible malloc.

Presumably, you start off by doing easy things flushing any unnecessary caches, and so on. But eventually you may bump up against a hard limit for these fixed size allocations, and your options are limited.

From email:

<<EOF
bz> Right now that's handled entirely inside malloc, which doesn't know what
bz> page you're on.  So we just abort.

Could we teach it? I guess I'm thinking that, before the abort path,
but after the cache flush stuff that doesn't yet exist, we could have
it chop out and delete the current content page from the browser
without using any extra memory. Or, you could go the OOM killer route
and just kill the page we think would be most profitable to kill.

bz> We could try doing that, yes.  The latter would be simpler in some ways, because it bz> wouldn't require keeping pervasive track of the "current page".

Yeah, I was hoping that we wouldn't have to keep track, but that we
could somehow infer it from infallible-malloc-accessible state
somehow.
EOF

How difficult would it be to break ties from a single "content instance" (not familiar with the terminology in this domain) to the rest of the browser?
I don't really understand what you want to do here ... do we want to start killing pages on OOM or something?
That's the proposal, yes.
That sounds ... tricky.  Are we planning to reserve some memory so that this kicks in before we hit "real" OOM and we still have some memory to work with?  Trying to do anything complex when we are actually OOM sounds like a recipe for bad things.

Also, we'll need memory around to run the cycle collector, GC, etc.  I think we should be able to kill the layout side of things pretty easily though.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> Trying to do anything complex when we are actually OOM sounds like a
> recipe for bad things.

I'm not at all familiar with the details, so I'm going to ask dumb questions: would we definitely need extra memory to separate the DOM for a content page from the rest of the browser and deallocate it? Can we keep a ballast, as you suggest, because the extra memory required to do this is easily bounded?

> Also, we'll need memory around to run the cycle collector, GC, etc.  I think
> we should be able to kill the layout side of things pretty easily though.

Can we do these *after* the other resources have been deallocated?
(In reply to Chris Leary [:cdleary] from comment #4)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> > Trying to do anything complex when we are actually OOM sounds like a
> > recipe for bad things.
> 
> I'm not at all familiar with the details, so I'm going to ask dumb
> questions: would we definitely need extra memory to separate the DOM for a
> content page from the rest of the browser and deallocate it?

The DOM, yes.  Layout data, probably not, we should just be able to drop the relevant arenas.

> Can we keep a
> ballast, as you suggest, because the extra memory required to do this is
> easily bounded?

We can keep a ballast, but I'm not sure that the memory is easily bounded in the pathological cases.  We can probably keep enough around for most sane cases though.

> > Also, we'll need memory around to run the cycle collector, GC, etc.  I think
> > we should be able to kill the layout side of things pretty easily though.
> 
> Can we do these *after* the other resources have been deallocated?

We'll need to run the cycle collector to destroy any substantive amount of DOM stuff.  Our experience with MemShrink is that most of the memory usage of webpages is generally in JS, so we'll probably have to run the GC too (and breaking DOM->JS cycles will require running both, of course).
Olli has been thinking about how to make the cycle collector deal with the DOM better, so looping him in.
In my approach to teardown some parts of DOM without CC there is still need for allocation, since
we need to keep things alive during the process.

But, I guess that could be implemented in a bit different way where no allocation is needed.
However, that works only for DOM trees which aren't kept alive anything outside it.
"Outside" being scripts etc.
Here's an interesting question.  For a typical web page, if we nuke the presshell and then BlastSubtreeToPieces on the root, what fraction of nodes go away?  Obviously anything that has js refs to it won't (needs GC/CC), but on many pages this may be a minority of nodes.
And yeah, the actual memory usage for the DOM tree that's not reachable from JS is usualy not that high.  Mine is at most 57MB right now out of a total heap of 1.5GB spread across 88 tabs.  So even if we can drop it, it might not get us the breathing room to GC.  :(
Adding Bill, since I think delayed marking should be a viable way of running GC in JS when we're in an OOM state (since it doesn't require additional memory), but I'm not confident. Would a JS-internal heap collection be enough to free up the references you're talking about in comment 8, bz?
I don't know.  I'm guessing we need cc to clean up js-wrapped nodes, but I could be wrong...
We're not supposed to require memory allocation for GC to succeed. If we do, it's a bug.
Yes, the CC is the only thing that will kill JS-wrapped nodes.  It also requires a ton of memory.  Olli's special-purpose DOM freer idea could work, too, and wouldn't require as much memory.
It sounds like this overlaps with some discussions we've been having in MemShrink about memory pressure events.
Whiteboard: [MemShrink]
(In reply to Andrew McCreight [:mccr8] from comment #14)
> It sounds like this overlaps with some discussions we've been having in
> MemShrink about memory pressure events.

Yeah, probably some overlap -- the reason I started asking about this is that I'm specifically concerned about handling "real OOMs" without aborting the application -- ideally we would display a comprehensible error message to the user when we start nuking their tabs.
(In reply to Olli Pettay [:smaug] from comment #7)
> However, that works only for DOM trees which aren't kept alive anything
> outside it.
> "Outside" being scripts etc.

Is that the common case? If it is the common case, can we turn those external references into "remembered" sets of references that we can NULL out from the referee-side due to OOM killer activity?
> Is that the common case?

For cases where we can drop lots of memory, probably yes.  See my numbers from comment 9.  If we had reporting working for orphaned DOM subtrees (soon, hopefully) I could get you better numbers on those, which must be owned by JS.
This is exactly what I was talking about in the last paragraph of bug 710501 comment 0.
(In reply to Boris Zbarsky (:bz) from comment #17)
> For cases where we can drop lots of memory, probably yes.

So given that it's probably yes, are the pointers that would refer to these DOM trees from the content that we'd be OOM killing (in which case we don't care) or from elsewhere?

I would assume that "elsewhere" would be Chrome, in which case we'd have wrappers anyway, right?
We can, in theory, anyway, catch OOM arbitrarily early with something like bug 670967.

Once malloc() is failing, it's very hard to recover...
(In reply to Justin Lebar [:jlebar] from comment #20)
> Once malloc() is failing, it's very hard to recover...

That's the question that I'm bothering people about in this bug -- how hard is it really to do something more sensible than abort? In the same sense that Chrome displays something sensible when a tab group OOMs (due to catching the intentional crash that they cause), we should evaluate whether we can provide a similar experience without process isolation (and ideally not just on Windows).
To be clear, we can extend bug 670967 to more than Windows, if that approach is successful.

But it's really hard to write code which does something sensible without ever calling malloc.  Consider that, when malloc fails, you pretty much have to return to the event loop before you can do anything serious.  Code will break very hard if malloc() can trigger a GC.  (I speak from experience.  :)  So now you have to audit the whole path down from wherever you happen to be into and through the event loop and make sure there are no allocations.

If we can detect low memory before we're entirely out of virtual memory, that seems to make our life a lot easier.
> are the pointers that would refer to these DOM trees from the content that we'd be OOM
> killing (in which case we don't care)

The pointers are probably from the content JS heap.
(In reply to Justin Lebar [:jlebar] from comment #22)
> Consider that, when malloc fails, you pretty much have
> to return to the event loop before you can do anything serious.  Code will
> break very hard if malloc() can trigger a GC.  (I speak from experience.  :)
> So now you have to audit the whole path down from wherever you happen to be
> into and through the event loop and make sure there are no allocations.

Okay, that's a really good point. The whole benefit of infallible malloc is to avoid having a way of bubbling back to the event loop, so anything that we could do in this OOM scenario would have to be from under (within) the malloc call itself.

Do you have a bug / comment on how code breaks when malloc can trigger GC? After I read about that I'd be happy to close this as INVALID and start reading up on the low-memory approach.
> Do you have a bug / comment on how code breaks when malloc can trigger GC?

It was a while ago, but as I recall, everything went to hell when code called malloc() while holding the GC lock.  One could trivially not trigger a GC in this case, but then do you just give up and crash?
Do you agree this is INVALID/WONTFIX, Chris?
Whiteboard: [MemShrink]
(In reply to Justin Lebar [:jlebar] from comment #26)
> Do you agree this is INVALID/WONTFIX, Chris?

No, if the evidence is just that inserting a GC call under malloc caused some things to fail. Are there systematic rooting issues that could only be solved with a complex static analysis, or is it just having the GC lock held? Like you say, you could abort() in the GC lock held case if it only happens a small percentage of the time and still get most of the gain.
If you want to do an experiment, AvailableMemoryTracker.cpp lets you hook in really deep (when VirtualAlloc is called on Windows).  You could easily hook in there and synchronously fire a memory pressure event, or a GC, or whatever, when rand() <= 0.01 or something.
(In reply to Justin Lebar [:jlebar] from comment #28)

Yeah, I think it's definitely experimentation time. When I get back from holiday I'm going to chat with folks and see what I can hack up.
(In reply to Justin Lebar [:jlebar] from comment #22)
> But it's really hard to write code which does something sensible without
> ever calling malloc.  Consider that, when malloc fails, you pretty much have
> to return to the event loop before you can do anything serious.

How about asking other threads to free *their* memory?  Should be safe regardless of what the OOMing thread is in the middle of doing.
Depending on what "asking other threads to free their memory" means, that could easily lead to deadlocks.  It also has the problem that the main thread owns most of the memory.
Priority: -- → P5
Component: DOM → DOM: Core & HTML
Severity: normal → S3

We do wait for a bit now when we run out of memory to give the OS a chance to kill a process. That's probably more effective than whatever was being discussed here.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.