Last Comment Bug 893172 - ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, leading to leaks in B2G
: ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, lead...
Status: RESOLVED FIXED
[MemShrink] [LeoVB+]
:
Product: Core
Classification: Components
Component: IPC (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla25
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
: 870588 (view as bug list)
Depends on:
Blocks: 880940 886217 889261 893012 894191
  Show dependency treegraph
 
Reported: 2013-07-12 13:33 PDT by Justin Lebar (not reading bugmail)
Modified: 2013-08-18 03:30 PDT (History)
15 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
leo+
wontfix
wontfix
fixed
fixed
wontfix
wontfix
fixed


Attachments
Patch, v1 (5.11 KB, patch)
2013-07-12 14:23 PDT, Justin Lebar (not reading bugmail)
bent.mozilla: review+
Details | Diff | Review
test result with Attachment 774909 (18.15 KB, image/png)
2013-07-14 21:22 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
the dmd report of devices1 (1.78 MB, application/x-compressed-tar)
2013-07-14 21:28 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
the dmd report of devices2 (2.54 MB, application/x-compressed-tar)
2013-07-14 21:28 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
Patch, v2 (for checkin-needed) (5.28 KB, patch)
2013-07-14 22:27 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Review
top 5 heap-unclassified (4.08 KB, text/plain)
2013-07-15 09:15 PDT, Andrew McCreight [:mccr8]
no flags Details
Call ShutDownProcess() when appropriate. (1023 bytes, patch)
2013-07-15 18:16 PDT, Jinho Hwang [:jeffhwang]
no flags Details | Diff | Review

Description Justin Lebar (not reading bugmail) 2013-07-12 13:33:15 PDT
As a result, ContentParent leaves its channel open.  As a result of that, we allow messages to pile up in this channel arbitrarily.  This causes leaks which manifest in many ways; see the dep list.
Comment 1 Justin Lebar (not reading bugmail) 2013-07-12 14:23:54 PDT
Created attachment 774909 [details] [diff] [review]
Patch, v1

If we don't do this, a ContentParent's channel stays open after its
process is SIGKIL'ed.  This causes us to continue to send it broadcast
messages.  These messages build up arbitrarily and manifest as leaks
under a variety of endurance-test scenarios.
Comment 2 Justin Lebar (not reading bugmail) 2013-07-12 14:25:04 PDT
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Benjamin, are you willing and able to review this?

This is a critical bug for Leo; we need to get it in asap.
Comment 3 Justin Lebar (not reading bugmail) 2013-07-12 14:55:48 PDT
I tried regular Close() before I tried CloseWithError(), and it seems to work OK, but not quite as well.  In particular, I get a bunch of assertions while tearing down the IPDL objects under ContentParent, which I don't get with CloseWithError().
Comment 4 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-07-12 16:04:49 PDT
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Review of attachment 774909 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for tracking this down! I think this is the right thing to do with one caveat below (and one silly name thing):

::: dom/ipc/ContentParent.cpp
@@ +729,5 @@
>      return SendSetProcessPrivileges(aPrivs);
>  }
>  
>  void
> +ContentParent::ShutDownProcess(bool aForce)

Nit: "force" is the wrong thing here. We're not forcing anything. How about just "aFromActorDestroy"?

@@ +751,1 @@
>      mIsDestroyed = true;

Move this before calling Close() to ensure we don't reenter.
Comment 5 Justin Lebar (not reading bugmail) 2013-07-12 16:52:52 PDT
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Jeremy, I'd like Leo to test this before we check it in.  Is that something you're able to do?

This will hopefully erase the pickle problem.  We do have one known remaining memory leak, bug 893012, but the leak should be much smaller.
Comment 6 jeremy.kim.leo [:jeremykimleo] 2013-07-12 17:40:03 PDT
ok. i will notify result, ASAP 
thanks~
Comment 7 jeremy.kim.leo [:jeremykimleo] 2013-07-14 21:22:50 PDT
Created attachment 775458 [details]
test result with Attachment 774909 [details] [diff]
Comment 8 jeremy.kim.leo [:jeremykimleo] 2013-07-14 21:28:24 PDT
Created attachment 775459 [details]
the dmd report of devices1
Comment 9 jeremy.kim.leo [:jeremykimleo] 2013-07-14 21:28:53 PDT
Created attachment 775460 [details]
the dmd report of devices2
Comment 10 jeremy.kim.leo [:jeremykimleo] 2013-07-14 22:02:55 PDT
all devices is tested with same way Bug 889261#C34

"device 1" & "device 2" is applied Attachment 774909 [details] [diff]. 
"nothing" isn't applied anythings.

the result is good better than "nothing".
we checked and logged the count of output_queue_.size(). b2g has some channel with unhandling ipc message.
I/Channel (  145): channel id : 0x4fdae000  pipe : 269 queue size : 30
I/Channel (  145): channel id : 0x5064c000  pipe : 145 queue size : 54
I/Channel (  145): channel id : 0x5106c000  pipe : 259 queue size : 70
I/Channel (  145): channel id : 0x50646000  pipe : 123 queue size : 73

this is patch to log output_queue_.size()

diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
index 6f3d43a..2a1e548 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -31,6 +31,10 @@
 #include "chrome/common/file_descriptor_set_posix.h"
 #include "chrome/common/ipc_logging.h"
 #include "chrome/common/ipc_message_utils.h"
+#include "android/log.h"
+
+#define LOGA(args...)                                            \
+    __android_log_print(ANDROID_LOG_INFO, "Channel" , ## args)
 
 namespace IPC {
 
@@ -720,6 +724,9 @@ bool Channel::ChannelImpl::Send(Message* message) {
     }
   }
 
+  if(output_queue_.size()) {
+    LOGA("channel id : %p  pipe : %d queue size : %d", this, pipe_, output_queue_.size());
+  }
   return true;
 }
Comment 11 Justin Lebar (not reading bugmail) 2013-07-14 22:19:27 PDT
Thanks a lot for testing, Jeremy.

we should definitely check this in; based on comment 7, it's a big improvement over where we were.

It's not clear to me from comment 10 that we're actually leaking these messages.  It's possible that every time you read the channel size, there are some messages in the queue, but those messages eventually get sent.

It's also not clear that, if we are in fact leaking messages, those are the main cause of the gradual memory increase in the graph.  Indeed, looking at the DMD report in device 2, I see only ~130kb of pickle data:

> Unreported: ~34 blocks from ~5 stack trace records in stack frame record 512 of 3,495
>  ~139,162 bytes (~139,162 requested / ~0 slop)
>  0.22% of the heap;  0.00% of unreported
>  PC is
>    Pickle::Resize(unsigned int) /home001/jeremy.kim/workspace/d300/b2g/gecko/ipc/chromium/src/base/pickle.cc:531 (0x418e737e libxul.so+0xac937e)

That might still be a memory leak, but it doesn't look like that's what's causing the large gradual increase in memory usage from comment 7.
Comment 12 Justin Lebar (not reading bugmail) 2013-07-14 22:21:00 PDT
In bug 893242 I'm adding a memory reporter that should tell us how many outstanding messages each IPC channel has; that patch should do something similar to the code in comment 10.
Comment 13 Justin Lebar (not reading bugmail) 2013-07-14 22:27:53 PDT
Created attachment 775465 [details] [diff] [review]
Patch, v2 (for checkin-needed)

Trees are closed now.  I can check this in tomorrow, but feel free to check in if you get to this first.
Comment 14 Jinho Hwang [:jeffhwang] 2013-07-14 23:09:32 PDT
(In reply to Justin Lebar [:jlebar] from comment #13)
> Created attachment 775465 [details] [diff] [review]
> Patch, v2 (for checkin-needed)

Hi Justin, Since LMK runs to kill an application, the child process of b2g doesn't have a chance to send back __delete__ message to TabParent, see http://dxr.mozilla.org/mozilla-central/source/dom/ipc/TabChild.cpp?from=RecvDestroy#l2099.

That's why the Patch, V1 doesn't help to solve memory leak problem in b2g.
I think the Patch V2 will work fine, because it calls ShutDownProcess in MakeAsDead(). 
I only have one question for Patch V2, do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before call the ShutDownProcess()? 

ContentParent check the value at http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#l975 and if it is not same with numLiveTabs, it just stop to processing NOtifyTabDestorying().
Comment 15 Justin Lebar (not reading bugmail) 2013-07-14 23:27:52 PDT
I checked the interdiff between the v1 and v2 patches in this bug, and they do not seem to be different in the way I understand you are describing.  Indeed, patches v1 and v2 are meant to be identical in behavior.

I understood comment 7 to say that patch v1 fixed a large portion of the leak you guys had observed.  Is that not correct?  Do you expect v2 to behave differently?

Maybe you mean something different by "patch v1" and "patch v2" than I understand here?

$ interdiff <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=774909) <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=775465)

> do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before 
> call the ShutDownProcess()? 

I don't think this matters.

The only reason we care about what happens there is to make sure that we call ContentParent::KillHard when appropriate.

In the case when the child process was SIGKIL'ed, we obviously don't need to run ContentParent::KillHard.

In the case when the child process was not SIGKIL'ed, our behavior should not be changed by either patch.

If you think this matters, could you be more explicit about what you think could go wrong?
Comment 16 Jinho Hwang [:jeffhwang] 2013-07-15 02:25:52 PDT
(In reply to Justin Lebar [:jlebar] from comment #15)
Sorry for my misunderstanding.

> I checked the interdiff between the v1 and v2 patches in this bug, and they
> do not seem to be different in the way I understand you are describing. 
> Indeed, patches v1 and v2 are meant to be identical in behavior.

Sorry for my misunderstanding on Patch V2. Yes, you are right, they don't seem to be different. 

> I understood comment 7 to say that patch v1 fixed a large portion of the
> leak you guys had observed.  Is that not correct?  
  I guess It helps to fix the memory leak in b2g process in this case.

> Do you expect v2 to behave differently? 
  No, I don't. I misunderstood ..

> If you think this matters, could you be more explicit about what you think
> could go wrong?

Okay , let me clear all of these things so you & I both will have better understanding on this problem.

What I thought is ..
As I know so far, When LMK kills one of child processes, ContentParent doesn't run the ShutDownProcess() because the child killed by LMK doesn't have any chance to send __delete__ massage to TabParent which trigger the NotifyTabDestoryed to run ShutDownProcess(). 

Hence, we needed to call ShutDownProcess() when LMK kills an application. 
That's in Patch V1 and V2 in ActorDestroy(). 
It's good to have chance to clean up. 

What I'm thinking is that ...
 when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to call ActorDestroy() and Patch V1,2 is working on this.
 is it correct?
 if then, when we don't get any OnChannelError from Link, we have no chance to run ActorDestory() so what I'd like to suggest is to Call ShutDownProcess in NotifyTabDestroying() with a timer. So, If we don't get the NotifyTabDestoryed() in a given time, we can clean up the ContentParent as well as the channel.

Here is what the patch looks like...
I'm just imaging this.. didn't tried.

void ContentParent::NotifyTabDestroying(PBrowserParent* aTab) {
....
	// recycled during its shutdown procedure.
	MarkAsDead();

+    MessageLoop::current()->PostDelayedTask(
+            FROM_HERE,
+            NewRunnableMethod(this, &ContentParent::ShutDownProcess),
+            TIMEOUTSECS * 1000);

	int32_t timeoutSecs = Preferences::GetInt(
			"dom.ipc.tabs.shutdownTimeoutSecs", 5);

As you know I don't have much experience on B2G.
So, I'd like to have your feedback.
Comment 17 Ryan VanderMeulen [:RyanVM] 2013-07-15 06:00:19 PDT
https://hg.mozilla.org/projects/birch/rev/fb0206edf864
Comment 18 Andrew McCreight [:mccr8] 2013-07-15 09:15:34 PDT
Created attachment 775691 [details]
top 5 heap-unclassified

Here's the top 5 heap-unclassified.

1, 2 and 4 look like they are associated with ContentParent, which is consistent with the theory that we're still leaking some kind of ContentParent-related stuff.  In total, those three account for about 9.5% of the heap.

Number 3 is 1.3mb of DNS-related stuff, which might be worth looking into as well.  Number 5 is 667k in a single block, for some kind of OpenGL thing, which is probably a legitimate buffer.  It might be nice to hook that up to a memory reporter.
Comment 19 Justin Lebar (not reading bugmail) 2013-07-15 09:27:20 PDT
> when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to 
> call ActorDestroy() and Patch V1,2 is working on this.
> is it correct?

Yes, I think so.

> if then, when we don't get any OnChannelError from Link, we have no chance to run 
> ActorDestory()

Do you mean, in the normal shutdown case?  Are you saying that, in the normal shutdown case (e.g. when you close a tab in the browser), we never run ContentParent::ActorDestroy?  That's an easy thing for you to test, if you think that's true; just attach gdb to the main process and put a breakpoint there.  If it's true, that's definitely a bug.
Comment 20 Andrew McCreight [:mccr8] 2013-07-15 09:51:51 PDT
I filed bug 893857 and bug 893865 for the other 2 problems I saw in the top 5 heap-unclassified, but they don't block this bug per se.  I'll try to figure out the appropriate meta-bug to make them block.
Comment 21 Ryan VanderMeulen [:RyanVM] 2013-07-15 14:15:39 PDT
https://hg.mozilla.org/mozilla-central/rev/fb0206edf864
Comment 22 Ryan VanderMeulen [:RyanVM] 2013-07-15 14:44:33 PDT
https://hg.mozilla.org/releases/mozilla-b2g18/rev/a0fce6b5bdb2
Comment 23 Jinho Hwang [:jeffhwang] 2013-07-15 18:16:54 PDT
Created attachment 776090 [details] [diff] [review]
Call ShutDownProcess() when appropriate.
Comment 24 Justin Lebar (not reading bugmail) 2013-07-15 18:23:15 PDT
> Call ShutDownProcess() when appropriate.

Can we please do this in a different bug?
Comment 25 Jinho Hwang [:jeffhwang] 2013-07-15 18:37:53 PDT
Oh, sure.
Sorry about this.
I will open a new bug.
Comment 27 Andrew McCreight [:mccr8] 2013-07-19 12:00:30 PDT
*** Bug 870588 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.