Closed Bug 834059 Opened 7 years ago Closed 7 years ago

LMK incorrectly selecting apps that are in the process of starting

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(blocking-b2g:tef+, firefox19 wontfix, firefox20 wontfix, firefox21 fixed, b2g18 fixed, b2g18-v1.0.0 fixed, b2g18-v1.0.1 fixed)

VERIFIED FIXED
B2G C4 (2jan on)
blocking-b2g tef+
Tracking Status
firefox19 --- wontfix
firefox20 --- wontfix
firefox21 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- fixed
b2g18-v1.0.1 --- fixed

People

(Reporter: m1, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [cr 442171])

Attachments

(1 file)

Looks like apps that are being launched in a fully loaded system may be selected by the LMK before they make it into the foreground.

Some use cases we've seen this occur in:
1) Incoming call notification never arrives (communications app killed).
2) Random apps die unexpectedly as monkey try to start them.

One way to reproduce:
1) Put every app in background (other than communications)
2) Make a call to the device
Every so often should see the communication app getting killed by the LMK.

Logging from an independent monkey run attempted to launch the Calendar application showed that the preallocated process remained at OOM_ADJ=6 as it was transformed into the Calendar application, at which point the system ran out of memory and:
  <4>[16239.978646] send sigkill to 28393 (Calendar), adj 6, size 4550
jlebar, do you recall where we flip prelaunch into foreground?
Flags: needinfo?(justin.lebar+bug)
(In reply to Chris Jones [:cjones] [:warhammer] from comment #1)
> jlebar, do you recall where we flip prelaunch into foreground?

Looks like it's only when a top-level window becomes active, which happens when the parent tells the BEC that it's active, which is quite late in the cycle.

I can take this if you'd like.
Flags: needinfo?(justin.lebar+bug)
Sure, sounds good.

We need to ensure that the service-y things like cost control and calendar don't regress and stick in the foreground again.
> We need to ensure that the service-y things like cost control and calendar don't regress 
> and stick in the foreground again.

Can I just give it a 5s timeout and stick it in the background if it's not foreground by then?
I'm not 100% sure how foreground/background state is managed for those currently.  For the purposes of this bug, I only meant we need to verify they're not relying on the current ~buggy TabChild behavior.  (They shouldn't be.)
Assignee: nobody → justin.lebar+bug
blocking-b2g: tef? → tef+
I found a reliable way to reproduce this.  Writing a test.
This should most likely have a righteous impact on startup time too.  Will see if I can get an estimate.
Alas, nothing significant by the stopwatch.  We must be getting lucky and setting ourselves to foreground before doing significant work (or not competing for CPU commonly).
To ensure this bug is updated for those playing along at home, Justin told me that he's working on it but progress has been hampered by build errors.  It's highly unlikely he'll have a fix today but he's still working hard on it.
Issues building aside, this has turned out to be much more complicated than I expected, largely because we're trying to do stuff with the ProcessPriorityManager before XPCOM has finished starting up.  Which means, for example, we can't read prefs...
Attached patch Patch, v1Splinter Review
This passes the acceptance test for me.  Sorry it took me so long to spin this
patch.

A few notes:

* If it's OK with you, I may go back through and clean up the logging.  I
  skipped on this because I wanted to get the patch up tonight.

* I basically rewrote ProcessPriorityManager; it might be easier to apply this
  patch and just look at the resultant file.

* One downside to the way this is structured is that all processes get five
  seconds of foreground time when they "start" (either when the preallocated
  process becomes an app, or when a new process is started due to the
  preallocated process not being there).

  So for example, if you load a tab in the background, it has 5s of foreground
  time, during which it might OOM the foreground tab.  Or if you load an app
  and then hit the home button, the app stays in the fg for 5s.

  We could change this for browser tabs specifically.  It's hard to see how to
  change it overall, though.

  I originally tried saying that if the app comes into the fg "on its own" and
  then goes back into the background, then we give it bg priority.  That seemed
  to be causing problems with the phone app, because it seemed to be "going
  back into the bg" before it showed any UI.  As soon as it went back into the
  bg, we killed it.  But I may have made an error in my testing, and I don't
  have time tonight to go back and try it.

* I think we should give the phone app a priority in-between regular fg and the
  master process.  That's the only way I'll be confident we won't be able to
  run into problems here.

  For example, the phone app loads merely with foreground priority.  That means
  it's competing with the existing fg app, which could OOM it.  The current
  setup does not protect against this.

* I have no idea if this fixes bug 834466, but I kind of doubt it.
Attachment #706702 - Flags: review?(jones.chris.g)
Comment on attachment 706702 [details] [diff] [review]
Patch, v1


>diff --git a/dom/ipc/ContentChild.cpp b/dom/ipc/ContentChild.cpp

> PBrowserChild*
> ContentChild::AllocPBrowser(const IPCTabContext& aContext,
>                             const uint32_t& aChromeFlags)
> {

>+        // If we are the preallocated process transforming into an app process,
>+        // we'll have background priority at this point.  Give ourselves a
>+        // priority boost for a few seconds, so we don't get killed while we're
>+        // loading our first TabChild.
>+        dom::ipc::TemporarilySetProcessPriorityToForeground();

Nit: |using mozilla::dom::ipc| plz.

>diff --git a/dom/ipc/ContentParent.cpp b/dom/ipc/ContentParent.cpp

>@@ -865,34 +866,44 @@ ContentParent::ContentParent(const nsAString& aAppManifestURL,

>+    // Set the subprocess's priority (bg if we're a preallocated process, fg
>+    // otherwise).  We do this first because we're likely /lowering/ its CPU and
>+    // memory priority, which it has inherited from this process.
>+    if (Preferences::GetBool("dom.ipc.processPriorityManager.enabled")) {
>+        ProcessPriority priority;
>+        if (aAppManifestURL == MAGIC_PREALLOCATED_APP_MANIFEST_URL) {
>+            priority = PROCESS_PRIORITY_BACKGROUND;
>+        } else {
>+            priority = PROCESS_PRIORITY_FOREGROUND;
>+        }

I prefer the ?: operator, but up to you.

>diff --git a/dom/ipc/ProcessPriorityManager.cpp b/dom/ipc/ProcessPriorityManager.cpp

> // Enable logging by setting
> //
> //   NSPR_LOG_MODULES=ProcessPriorityManager:5
> //
> // in your environment.
> 
>-#ifdef PR_LOGGING
>+#ifdef ANDROID
>+#include <android/log.h>
>+#define LOG(fmt, ...) \
>+  __android_log_print(ANDROID_LOG_INFO, \
>+      "Gecko:ProcessPriorityManager", \
>+      fmt, ## __VA_ARGS__)

Will want this off for land.

This sort of contradicts the comment above.  Was nspr logging
insufficient for your uses or just inconvenient?

> void
> ProcessPriorityManager::Init()

Add a comment somewhere around here noting that we rely on our process
launcher to initialize the priority appropriately.

Overall looks OK and makes sense.  But what happened to the audio
channel handling?  Is that subsumed by something else or did it get
lost in the refactoring shuffle?
Attachment #706702 - Flags: review?(jones.chris.g)
> Will want this off for land.

> This sort of contradicts the comment above.  Was nspr logging
> insufficient for your uses or just inconvenient?

NSPR is quite inconvenient on the phone.  I don't know if it even goes to logcat if you turn it on, but the easiest way I know of to turn it on is to push a new b2g.sh, and that's more inconvenient than I'd like.

Our logging story on B2G is very, very, bad.  Most people use only __android_log_print, and everyone rolls their own macros (some better than others).  Just be thankful I didn't do it like nsIdleService:

http://mxr.mozilla.org/mozilla-central/source/widget/xpwidgets/nsIdleService.cpp#246  :)

Anyway, I'll turn it off if you don't want it on.  (Personally, I think having some indication of when processes get set to which priorities in the logcat is quite useful for understanding various behaviors, so I intended to leave it on.)

> But what happened to the audio channel handling?

We now call ResetPriority() when the audio channels change.  We do a bit of extra work, because we iterate over all top-level windows and then iterate over the audio channels, when we don't in fact need to iterate over the windows.  But I expect there will be very few top-level windows in a process, and it's much simpler this way:

> +  } else if (!strcmp(aTopic, "inner-window-destroyed") ||
> +             !strcmp(aTopic, "audio-channel-agent-changed")) {
> +    ResetPriority();
(In reply to Justin Lebar [:jlebar] from comment #14)
> > Will want this off for land.
> 
> > This sort of contradicts the comment above.  Was nspr logging
> > insufficient for your uses or just inconvenient?
> 
> NSPR is quite inconvenient on the phone.  I don't know if it even goes to
> logcat if you turn it on

It does.

$ adb shell
# stop b2g
# NSPR_LOG_MODULES="Foo:5" /system/bin/b2g.sh

will get it going.

> Our logging story on B2G is very, very, bad.

s/B2G/Gecko/, my good man! ;)

> Anyway, I'll turn it off if you don't want it on.  (Personally, I think
> having some indication of when processes get set to which priorities in the
> logcat is quite useful for understanding various behaviors, so I intended to
> leave it on.)

I'd love to have it on but always-on "info"-level logging bites in the butt pretty often.  I guess it makes sense to leave this on for a while in case we see new issues with CAF testing.

> > But what happened to the audio channel handling?
> 
> We now call ResetPriority() when the audio channels change.

Whups, I overlooked that!  Carry on.
Thanks guys!  Can this be landed on b2g18 directly in case we don't get an automated uplift before Sunday noon PST
For sure, but I'd like this to cycle on inbound before I land on b2g18, so let's give it a few hours.
https://hg.mozilla.org/mozilla-central/rev/42ba5b4cf32a
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → B2G C4 (2jan on)
Depends on: 835563
> # NSPR_LOG_MODULES="Foo:5" /system/bin/b2g.sh

Except we don't define PR_LOGGING in our B2G release builds, so we're right back where we started.  If I have to modify the code in order to get logging, I'd rather not have to then go into adb shell and reboot gecko every time I reboot the phone.

I guess next time I'll leave the logcat stuff in and ifdef it out.  :(
...actually, maybe that's wrong; I was building with a different srcdir!
Yes, that's general suckage with NSPR logging.  I don't like it either.
(In reply to Justin Lebar [:jlebar] from comment #23)
> > # NSPR_LOG_MODULES="Foo:5" /system/bin/b2g.sh
> 
> Except we don't define PR_LOGGING in our B2G release builds, so we're right
> back where we started.  If I have to modify the code in order to get
> logging, I'd rather not have to then go into adb shell and reboot gecko
> every time I reboot the phone.

You could edit b2g.sh on the phone so then you don't need to restart gecko.
Alternatively, you can edit gonk-misc/b2g.sh so that it's right from the time you flash.

Here are the steps required for changing b2g.sh on the phone:
https://developer.mozilla.org/en-US/docs/Mozilla/Firefox_OS/Debugging/Customizing_the_b2g.sh_script
Depends on: 835825
Depends on: 835795
Depends on: 835829
No longer depends on: 835795
No longer depends on: 835563
No longer depends on: 835829
No need for TC in Moztrap for this issue.
Flags: in-moztrap-
Michael Vines if you can Verify this bug out, it would be much appreciated.
Flags: needinfo?(mvines)
Keywords: verifyme
Status: RESOLVED → VERIFIED
Flags: needinfo?(mvines)
Per comment 29,I clear "Verifyme".
You need to log in before you can comment on or make changes to this bug.