Closed Bug 744515 Opened 12 years ago Closed 2 years ago

Sometimes, Android forgets to give onLowMemory notification

Categories

(GeckoView :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(blocking-fennec1.0 -)

RESOLVED INVALID
Tracking Status
blocking-fennec1.0 --- -

People

(Reporter: bjacob, Unassigned)

References

Details

This is different from bug 619670: here I'm talking about what looks like a real Android bug.

Test case:
  http://crazybugs.ivank.net/    (see bug 736436)
Actual results:
  OOM crash
Expected results:
  Once bug 736481 lands, we should see Firefox not crashing even if some Android system processes crash and restart. When Android runs out of background processes to kill, it should give us a onLowMemory notification, and once bug 736481 lands, we handle it by losing WebGL contexts, which successfully averts the OOM crash on that bug.

Apparently we're not the only ones to have this problem:

https://www.google.com/#hl=en&q=android+onlowmemory+not+getting+called

It seems that onLowMemory is only called when Android runs out of background processes to kill. Maybe that's a bit late, as that means that at that point we're already pretty much at the top of the list of processes to kill. I hope we can get a notification that we're running out of memory a bit earlier than that.
blocking-fennec1.0: --- → beta+
Who should this be assigned to?
Someone who knows the Android OS well could possibly implement a more reliable onLowMemory() using system metrics (amount of free RAM, etc).

Alternatively, it may be worth talking with any contact we might have at Google, to see if there already is a solution to this problem.
This seems like a good starting place to look at:
http://stackoverflow.com/questions/2298208/how-to-discover-memory-usage-of-my-application-in-android

a first step would be to make a patch periodically dumping that information and then see what the resulting log looks like when we run into one of those OOM crashes where onLowMemory() didn't fire.
This would be hard to get "right".  There are two fundamental problems.  But first, a digression into how the low-mem killer works on android.

Processes are grouped into classes according to visibility/importance to the user.  The system server is highest importance, "core servers" next, the foreground app after that, on down to hidden and empty apps at lowest importance.  Each "importance class" has a low-memory threshold that has to be reached before they start being killed off.  (Low memory as measured by free pages in the kernel.)  For example, by default, "hidden" apps start getting machine-gunned when free memory drops below 28MB.  The foreground app is killed when freemem drops below 8MB (by default).  This all happens within a module in the linux kernel; the android userspace configures the kernel module by telling it processes' "importance classes" using the linux oom_adj mechanism.  (Processes within the same class are killed in order of their OOM-badness score.)

The android userspace only knows that lowmem killing is happening by evidence of linux mowing down processes unexpectedly.  That's why the low-mem notification is not all that reliable/useful: it's fired off by the android userspace in reaction to the kernel murdering userspace processes, not by any early-warning system.  And further, the notification is only sent after all background apps have been killed.  (Corollary: launching several apps and putting them in the background before running the test here should result in more reliable low-mem notifications.)

So what else can we do.  Inside Firefox, we can pretty easily deduce what our importance is.  And the parameters used to configure the linux low-mem killer module are just properties in the propdb, that we can easily read.  So if we could efficiently track the number of free pages in the kernel, we could write our own early warning system.

Here's where the problems arise.  First, it's not possible to efficiently track the number of free pages in the kernel, to my knowledge.  What we really want is a notification when free pages drop below a certain threshold.  The kernel knows this and can track it efficiently.  But way out in userspace, all we can do is poll procfs, inefficiently.

Second, the lowmem killing is a global property, based on the memory usage of the entire OS.  We can efficiently make a pretty good guess at the memory usage of fennec (except of allocations that go through system malloc, and VRAM allocations, and lies told by overcommit).  But that doesn't help us know when we're about to get lowmem killed, because we don't know what other processes/kernel are doing.

So my opinion is that we could spend some time building a free-pages polling mechanism into fennec along with importance deduction, but in the end it would be hard to stay ahead of the lowmem killer while not draining battery.

I think the more important question is, is severely degraded functionality really what we want in this case?  Fennec would keep on living, but the game the user wanted to play would turn into a non-interactive black box, with no indication of why it broke.  In this case the runaway memory usage seems to be caused by WebGL resources, but if instead the page had runaway allocation of JS objects, the experience would be even worse; fennec would just get killed because we couldn't drop the JS heap.

I think we could probably build a reasonable system in some timeframe that
 - moderately efficiently polled for low free pages in the kernel, triggered by large-ish allocations in gecko.  That is, poll free pages every K pages allocated by gecko.  Hope for the best.
 - ruthlessly killed *tabs* that were overallocating.  All resources: webgl, JS heap, gfx, etc.
 - indicated to the user why their tab web away.  Something akin to the chrome frowny-face for crashed renderer process.

But I don't think the timeframe for getting this working well is this beta.
Re-noming, not sure this is worth it for this release as per #4.
blocking-fennec1.0: beta+ → ?
Thanks Chris for the explanation.

Naive question: could we implement something useful by comparing our RSS to the physical RAM size? Say, if our RSS is > 90% of the physical RAM size, or is within 100 M of RAM size, send a memory pressure event? (repetition criteria tbd)?

In the case of WebGL contexts, contrary to JS heap, they are inherently discardable. I.e., the graphics driver / OS could lose our GL context at any time anyway, for example if the phone enters a sleep mode where the GPU is powered off. So losing WebGL contexts is a lot milder than other resources; in fact, I've been wanting to do this 'lose WebGL contexts on memory pressure' for a while, on all platforms, including desktop.

Also, this allowed to avoid a crash in one demo (Quake 3 in bug 736481) and still seems likely to fix another crash in bug 736436.
(In reply to Benoit Jacob [:bjacob] from comment #6)
> Thanks Chris for the explanation.
> 
> Naive question: could we implement something useful by comparing our RSS to
> the physical RAM size? Say, if our RSS is > 90% of the physical RAM size, or
> is within 100 M of RAM size, send a memory pressure event? (repetition
> criteria tbd)?
> 

One could implement something like that, but if it's not accurately tracking the lowmem-killing heuristics, then we have to walk a tightrope between
 - breaking functionality unnecessarily, like killing webgl contexts when memory isn't actually low
 - guessing wrong and getting lowmem killed anyway

> In the case of WebGL contexts, contrary to JS heap, they are inherently
> discardable. I.e., the graphics driver / OS could lose our GL context at any
> time anyway, for example if the phone enters a sleep mode where the GPU is
> powered off. So losing WebGL contexts is a lot milder than other resources;
> in fact, I've been wanting to do this 'lose WebGL contexts on memory
> pressure' for a while, on all platforms, including desktop.
> 

If a webgl context is killed because of lowmem, by what heuristics would we let it be recreated?

> Also, this allowed to avoid a crash in one demo (Quake 3 in bug 736481) and
> still seems likely to fix another crash in bug 736436.

The code as-is works pretty much as well as it can based on the notifications that the android OS gives us.  Without a clearer UX plan for what happens on lowmem, it's hard to say whether investing a lot in going above&beyond android is worth the cost.  IMHO.
the bug we actually want to block on is bug 747445, minus'ing this
blocking-fennec1.0: ? → -
I posted this on bug 736481, but it's more relevant here:

> Something else I thought of while looking at our java code: it's possible
> that dalvik will clear SoftReference instances in java before it terminates
> an activity due to memory pressure. See
> http://developer.android.com/reference/java/lang/ref/SoftReference.html - in
> particular, "all SoftReferences pointing to softly reachable objects are
> guaranteed to be cleared before the VM will throw an OutOfMemoryError"
> 
> You can try to use this as a memory-pressure detector by having some dummy
> object in Java pointed to by a SoftReference, and overriding the dummy
> object's finalize() method to trigger the memory-pressure detector. No idea
> if it'll work, but it might be worth a shot.
Depends on: 713032
See Also: 713032

Moving all open Core::Widget: Android bugs to GeckoView::General (then the triage owner of GeckoView will decide which ones are valuable and which ones should be closed).

Component: Widget: Android → General
Product: Core → GeckoView

Not relevant to GeckoView

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.