Closed Bug 873284 Opened 7 years ago Closed 6 years ago

Ignore memory-pressure events in the main process and in the child process if the child has FOREGROUND_HIGH priority

Categories

(Firefox OS Graveyard :: General, defect, P3)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:-)

RESOLVED FIXED
1.4 S4 (28mar)
blocking-b2g -

People

(Reporter: justin.lebar+bug, Assigned: gsvelto)

References

Details

(Keywords: perf, Whiteboard: [c=memory p= s=2014.03.28 u=])

Attachments

(1 file, 5 obsolete files)

I noticed us firing memory-pressure events when panning/zooming on nfl.com and loading the communications app.

I didn't catch this one in a profile, but in this case, it's probably better to let a bg process die than it is to fire memory-pressure events, potentially slowing everything down.

I'll write a patch and we can see if it helps.
Note that this means that a FOREGROUND_HIGH process can OOM itself with images.  We just have to be careful and not let this happen.  It shouldn't be a problem for the communications app.
This bug blocks bug 847592. Since bug 847592 is tef+, this one might need to be tef+ as well. Mark this one as tef?.
blocking-b2g: --- → tef?
If you don't mind, I'd like to verify that this is actually helpful before requesting blocking status.  If I get tef+ too early, I'll just encourage managers to ask for my status here more often, and that will slow me down.
tef- as per comment 3, Justin, please renominate when you believe it is convenient
blocking-b2g: tef? → ---
Since I'm fairly familiar with this code I whipped up together a patch for this. Unfortunately I'm having a hard time testing it: I can't seem to be able trigger low-memory notifications anymore, processes get killed because of OOM conditions before the memory pressure watcher triggers.
Attachment #752244 - Flags: feedback?(justin.lebar+bug)
Blocks: 874441
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Comment on attachment 752244 [details] [diff] [review]
[WIP PATCH] Do not send memory-pressure events to the main and child process if the child is runing at high priority

>diff --git a/dom/ipc/ProcessPriorityManager.h b/dom/ipc/ProcessPriorityManager.h
>--- a/dom/ipc/ProcessPriorityManager.h
>+++ b/dom/ipc/ProcessPriorityManager.h

>@@ -58,24 +58,35 @@
>    *
>    * Eventually whatever priority you set here can and probably will be
>    * overwritten by the process priority manager.
>    */
>   static void SetProcessPriority(dom::ContentParent* aContentParent,
>                                  hal::ProcessPriority aPriority);
> 
>   /**
>-   * Returns true iff this process's priority is FOREGROUND*.
>+   * Returns true if this process's priority is FOREGROUND*.

This isn't a typo.  :)

http://en.wikipedia.org/wiki/Iff

>    * Note that because process priorities are set in the main process, it's
>    * possible for this method to return a stale value.  So be careful about
>    * what you use this for.
>    */
>   static bool CurrentProcessIsForeground();
> 
>+  /**
>+   * When called in a child process returns true if the current process has
>+   * FOREGROUND_HIGH priority; when called in the main process returns true
>+   * if at least one child process with FOREGROUND_HIGH priority is present.

I'd prefer to split the notions of "CurrentProcessIsHighPriority" and "a child
process process is high-priority" into two separate methods; I think that will
be a cleaner API.

>diff --git a/dom/ipc/ProcessPriorityManager.cpp b/dom/ipc/ProcessPriorityManager.cpp
>--- a/dom/ipc/ProcessPriorityManager.cpp
>+++ b/dom/ipc/ProcessPriorityManager.cpp

>@@ -496,17 +499,18 @@
>     }
>   }
> }
> 
> bool
> ProcessPriorityManagerImpl::OtherProcessHasHighPriority(
>   ParticularProcessPriorityManager* aParticularManager)
> {
>-  if (mHighPriorityChildIDs.Contains(aParticularManager->ChildID())) {
>+  if ((aParticularManager != nullptr) &&
>+      mHighPriorityChildIDs.Contains(aParticularManager->ChildID())) {

I think it would be clearer if we didn't pass nullptr to this method; instead,
we can just add a new method that returns |mHighPriorityChildIDs.Count() > 0|.

But also, we never compare pointers to nullptr in an if statement; we use
operator! instead.

>diff --git a/widget/gonk/GonkMemoryPressureMonitoring.cpp b/widget/gonk/GonkMemoryPressureMonitoring.cpp
>--- a/widget/gonk/GonkMemoryPressureMonitoring.cpp
>+++ b/widget/gonk/GonkMemoryPressureMonitoring.cpp

>@@ -32,16 +33,21 @@
>   MemoryPressureRunnable(const char *aTopic, const PRUnichar *aData) :
>     mTopic(aTopic), mData(aData)
>   {
>   }
> 
>   NS_IMETHOD Run()
>   {
>     MOZ_ASSERT(NS_IsMainThread());
>+
>+    if (ProcessPriorityManager::CurrentProcessIsHighPriority()) {
>+      return NS_OK;
>+    }

This will need a comment.

The real question is: Does this help?  If not, we shouldn't bother.
Attachment #752244 - Flags: feedback?(justin.lebar+bug) → feedback+
(In reply to Justin Lebar [:jlebar] from comment #6)
> This isn't a typo.  :)
> 
> http://en.wikipedia.org/wiki/Iff

Thanks for the link, I had no idea :)

> I'd prefer to split the notions of "CurrentProcessIsHighPriority" and "a
> child
> process process is high-priority" into two separate methods; I think that
> will
> be a cleaner API.
> [...]
> I think it would be clearer if we didn't pass nullptr to this method;
> instead,
> we can just add a new method that returns |mHighPriorityChildIDs.Count() >
> 0|.

OK, I'll do that.

> The real question is: Does this help?  If not, we shouldn't bother.

I will be doing the tests again but this should have a measurable impact: memory minimization events can take 100's of milliseconds per app so we might save a meaningful amount of time.
Depends on: 876610
Depends on: 877222
Here's an updated patch. I will defer asking for review until I'll have fixed the related issues and proved with testing that this is beneficial when receiving calls.
Attachment #752244 - Attachment is obsolete: true
I've finally had some time to do some serious testing. TL;DR what I found suggests that stopping memory-pressure events from being dispatched does indeed help but we might want to stop the events from being fired entirely.

This is a profile of the Communications app receiving a call in a low-memory situation:

http://people.mozilla.com/~bgirard/cleopatra/#report=63f1cc5e3883136a6e6879d28d56ab4b48a423b9

The low-memory notification is kicking in and causing 730 samples to be wasted during start-up (more on the actual delay later). The b2g process profile shows the same behavior but with even longer pauses (look towards the end of the trace):

http://people.mozilla.com/~bgirard/cleopatra/#report=fa95368a54b86d6b8609774ba068896e04e883f9

What only shows up tangentially in the profile is that the low-memory notification itself is causing a significant increase in CPU load as background apps (homescreen, usage, ...) wake up and start collecting. If you look at the communications profile the average sample time is 45ms (vs the expected 10ms) due to this; overall the process of responding to the call is greatly hampered.

In the light of this behavior I would be keen on changing my patch to turn off low-memory notifications entirely when a FOREGROUND_HIGH process is present. This would remove the additional CPU load and increase the chance of bg apps getting killed which under these conditions is IMHO a desired side-effect as it helps getting things out of the way faster.
Here's an alternative patch that stops low-memory notifications altogether when a high priority process is present. Ironically enough this matches the behavior I was seeing on the Unagi while testing bug 812059.

I did a few tests on the Otoro and it does seem to help: having low-memory pressure notifications being delivered almost always causes the incoming call to be dropped altogether and the dialer does not show up at all. Reproducing the same situation with this patch applied often helps: the dialer tends to show up more reliably albeit with the typical delay seen when under heavy load.
Attachment #760414 - Flags: feedback?(justin.lebar+bug)
> If you look at the communications profile the average sample time is 45ms (vs the 
> expected 10ms) due to this; overall the process of responding to the call is greatly 
> hampered.

Why do we think that the communications process's 45ms sample interval is due to bg processes waking up and collecting?  What's the difference in the interval when we allow bg processes to collect and when we don't?  (That is, what's the difference in this value with the two patches in this bug?)

I ask because if bg processes' gc'ing slows us down significantly as is claimed here, then either something is wrong with our process prioritization or we need to revisit how we think about CPU priorities.  If the claim here is correct, maybe we should never run low-mem notifications in bg processes; we already ran this code when they went into the bg, so there likely isn't much to collect.

A perf record -a trace would settle this once and for all.
Attachment #760414 - Flags: feedback?(justin.lebar+bug)
(In reply to Justin Lebar [:jlebar] from comment #11)
> Why do we think that the communications process's 45ms sample interval is
> due to bg processes waking up and collecting?  What's the difference in the
> interval when we allow bg processes to collect and when we don't?  (That is,
> what's the difference in this value with the two patches in this bug?)

I'll try it today, it takes some time though because reproducing these conditions usually takes me a couple of tries.

> I ask because if bg processes' gc'ing slows us down significantly as is
> claimed here, then either something is wrong with our process prioritization
> or we need to revisit how we think about CPU priorities.  If the claim here
> is correct, maybe we should never run low-mem notifications in bg processes;
> we already ran this code when they went into the bg, so there likely isn't
> much to collect.

Yeah, that's something I was also thinking about especially if we want to go ahead and freeze bg applications (almost) completely as per bug 872177.

> A perf record -a trace would settle this once and for all.

I'll do that; I haven't done yet because of the time required to get the necessary infrastructure in place but it's about time now :-)
Slow progress on this one last week because I've been busy with RIL-related stuff. I haven't got the perf build done yet but I'm getting there. In the meantime I did a couple of extra tests and I came up with two theories regarding the very slow behavior experienced under low-memory conditions (and little apparent CPU activity). perf should prove or dismiss them with certainty once I get it working:

- Under low-memory conditions it is likely that the dialer process is not started from the pre-launched process but instead is being cold-launched. If that's the case then it's likely that it will run with background priority until it manages to catch the CPU wake lock which essentially means that the entire startup will be slowed down a lot. I could be wrong on this one but it's worth verifying it. Anyway if this is really a problem fixing it might be tricky.

- The second thing I want to inspect is related to a surprising amount of CPU time which often goes in the 'mmcqd' and 'kswapd' kernel threads when under low memory conditions. When running out of memory the kernel will flush its FS caches and my theory is that it's caching a fairly significant amount of writes which will be flushed out at what will be the worst possible time for us. This won't consume CPU time but might significantly slow down any other process fighting over I/O (the b2g process comes to mind).

Anyway perf should shed some light on both theories as soon as I get it running :)
> - Under low-memory conditions it is likely that the dialer process is not started from 
> the pre-launched process but instead is being cold-launched.

Quite possible.

> If that's the case then it's likely that it will run with background priority until it 
> manages to catch the CPU wake lock

I don't think this is the case.

The dialer app should get high CPU priority immediately after being forked, if it's cold-started.  If it's started from the preallocated process, upping its CPU priority is also one of the first things we do.

This happens because the dialer app is a high-priority frame, and we take the CPU wake lock on behalf of that process immediately after forking it / converting the preallocated process into it.

> When running out of memory the kernel will flush its FS caches and my theory is that it's 
> caching a fairly significant amount of writes which will be flushed out at what will be 
> the worst possible time for us. 

Could be!
(In reply to Justin Lebar [:jlebar] from comment #14)
> The dialer app should get high CPU priority immediately after being forked,
> if it's cold-started. [...]
> This happens because the dialer app is a high-priority frame, and we take
> the CPU wake lock on behalf of that process immediately after forking it /
> converting the preallocated process into it.

This is exactly the piece of information I was missing, thanks :)

> Could be!

Time to test it!
Blocks: 812059
I finally had some more time to come back to this bug and re-test this scenario which I deemed important after the fix for bug 887192 landed. Before doing a perf-enabled build I just tested the current master branch and found that it behaves surprisingly well now.

In particular I don't see the behavior I've described in comment 10; pretty much every call comes in a 3-4s range (measured as I had described in bug 861441 comment 15) and I didn't get any dropped calls out of 10 tries.

While testing I've double-checked that when the call hit memory-pressure notifications were triggered and apparently in this scenario the dialer doesn't seem to to be slowed down much.

Additionally I noticed a couple of things happening consistently that might be the effect of fixing bug 887192: the homescreen was always killed when receiving the incoming call (probably to make room for the dialer) and if the membuster app had already allocated too much memory it was also killed with the dialer showing up right after it.

I couldn't test the process buster stress scenario because the button from the page at http://bit.ly/membuster doesn't seem to work anymore (at least on master). It pops up a slider that asks what program to use instead of opening multiple browser windows and even clicking on the browser app doesn't seem to yield any effect.
Gabriele, any update on this issue?

FxOS Perf Triage
Flags: needinfo?(gsvelto)
Keywords: perf
Whiteboard: [c=memory p= s= u=] [MemShrink]
(In reply to Mike Lee [:mlee] from comment #17)
> Gabriele, any update on this issue?

I haven't tested anymore after comment 16. It seems to me that after bug 887192 the issue we had with delayed incoming calls is gone. I would still be keen on landing attachment 760414 [details] [diff] [review] because the rationale behind it still makes sense (if we have a high-priority process we prefer killing background apps rather than putting additional strain on the system with garbage collections). Who could help reviewing this now that Justin is gone?
Flags: needinfo?(gsvelto)
Whiteboard: [c=memory p= s= u=] [MemShrink] → [c=memory p= s= u=]
Gabriele,
Will you please set the review? flags on your patches?  Kyle and/or Olli will review them.
Flags: needinfo?(gsvelto)
(In reply to Dave Huseby [:huseby] from comment #20)
> Will you please set the review? flags on your patches?  Kyle and/or Olli
> will review them.

The patch I want to finish (attachment 760414 [details] [diff] [review]) has bit-rotted and besides cleaning it up I would like to write some unit-tests for it too before asking for review. Unfortunately this will have to wait another week at least as I've got a handful of koi+ bugs to finish and there's only two weeks left to land those.
Flags: needinfo?(gsvelto)
I finally managed to free enough time to get back at this and un-bitrot the patch. I would like to add unit-tests for it but I'm not familiar with the other process priority manager tests yet so it might take a while (I'm also unsure if it's easy/possible to simulate low-memory situations).
Attachment #755448 - Attachment is obsolete: true
Attachment #760414 - Attachment is obsolete: true
Attachment #8339052 - Flags: review?(khuey)
Priority: -- → P3
Un-bitrotted patch, not asking for review just yet as I'm re-testing this.
Attachment #8339052 - Attachment is obsolete: true
Attachment #8339052 - Flags: review?(khuey)
Refreshed again after bug 970895 landed; I've tested this under the most common scenario for which this change is relevant (receiving an incoming call while memory pressure is building up) and it seems to work well.
Attachment #8377191 - Attachment is obsolete: true
Attachment #8377540 - Flags: review?(khuey)
Comment on attachment 8377540 [details] [diff] [review]
[PATCH v4] When a content process is running at high priority do not send memory-pressure events

And bug 973824 just got backed out, reverting to the previous version of this patch.
Attachment #8377540 - Attachment is obsolete: true
Attachment #8377540 - Flags: review?(khuey)
Attachment #8377191 - Attachment is obsolete: false
Attachment #8377191 - Flags: review?(khuey)
blocking-b2g: --- → 1.3T?
per https://bugzilla.mozilla.org/show_bug.cgi?id=973596#c39, this bug does not seem to help tarako scenarios, minus
blocking-b2g: 1.3T? → -
Comment on attachment 8377191 [details] [diff] [review]
[PATCH v3] When a content process is running at high priority do not send memory-pressure events

Review of attachment 8377191 [details] [diff] [review]:
-----------------------------------------------------------------

It took me a while to convince myself that this code is correct. r=me

Reasoning about this stuff is hard :/
Attachment #8377191 - Flags: review?(khuey) → review+
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #27)
> It took me a while to convince myself that this code is correct. r=me

Thanks for the review! The try run is here: https://tbpl.mozilla.org/?tree=Try&rev=ff58bfe1b584

> Reasoning about this stuff is hard :/

Yeah, the process priority manager may have become more convoluted than it needs to be though changing it is quite risky so I don't think anybody would consider refactoring it in the near future.
Try was green, time for check-in.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2c636b3c79c1
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → 1.4 S4 (28mar)
OS: Linux → Gonk (Firefox OS)
Hardware: x86_64 → ARM
Whiteboard: [c=memory p= s= u=] → [c=memory p= s=2014.03.28 u=]
You need to log in before you can comment on or make changes to this bug.