Closed Bug 893172 Opened 6 years ago Closed 6 years ago

ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, leading to leaks in B2G

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla25
blocking-b2g leo+
Tracking Status
firefox23 --- wontfix
firefox24 --- wontfix
firefox25 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [MemShrink] [LeoVB+])

Attachments

(5 files, 2 obsolete files)

As a result, ContentParent leaves its channel open.  As a result of that, we allow messages to pile up in this channel arbitrarily.  This causes leaks which manifest in many ways; see the dep list.
Blocks: 893012
Attached patch Patch, v1 (obsolete) — Splinter Review
If we don't do this, a ContentParent's channel stays open after its
process is SIGKIL'ed.  This causes us to continue to send it broadcast
messages.  These messages build up arbitrarily and manifest as leaks
under a variety of endurance-test scenarios.
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Benjamin, are you willing and able to review this?

This is a critical bug for Leo; we need to get it in asap.
Attachment #774909 - Attachment description: Call ShutDownProcess in ContentParent::ActorDestroy. → Patch, v1
Attachment #774909 - Flags: review?(benjamin)
Assignee: nobody → justin.lebar+bug
blocking-b2g: --- → leo+
I tried regular Close() before I tried CloseWithError(), and it seems to work OK, but not quite as well.  In particular, I get a bunch of assertions while tearing down the IPDL objects under ContentParent, which I don't get with CloseWithError().
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Review of attachment 774909 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for tracking this down! I think this is the right thing to do with one caveat below (and one silly name thing):

::: dom/ipc/ContentParent.cpp
@@ +729,5 @@
>      return SendSetProcessPrivileges(aPrivs);
>  }
>  
>  void
> +ContentParent::ShutDownProcess(bool aForce)

Nit: "force" is the wrong thing here. We're not forcing anything. How about just "aFromActorDestroy"?

@@ +751,1 @@
>      mIsDestroyed = true;

Move this before calling Close() to ensure we don't reenter.
Attachment #774909 - Flags: review?(benjamin) → review+
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Jeremy, I'd like Leo to test this before we check it in.  Is that something you're able to do?

This will hopefully erase the pickle problem.  We do have one known remaining memory leak, bug 893012, but the leak should be much smaller.
Attachment #774909 - Flags: feedback?(jeremy.kim.leo)
ok. i will notify result, ASAP 
thanks~
all devices is tested with same way Bug 889261#C34

"device 1" & "device 2" is applied Attachment 774909 [details] [diff]. 
"nothing" isn't applied anythings.

the result is good better than "nothing".
we checked and logged the count of output_queue_.size(). b2g has some channel with unhandling ipc message.
I/Channel (  145): channel id : 0x4fdae000  pipe : 269 queue size : 30
I/Channel (  145): channel id : 0x5064c000  pipe : 145 queue size : 54
I/Channel (  145): channel id : 0x5106c000  pipe : 259 queue size : 70
I/Channel (  145): channel id : 0x50646000  pipe : 123 queue size : 73

this is patch to log output_queue_.size()

diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
index 6f3d43a..2a1e548 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -31,6 +31,10 @@
 #include "chrome/common/file_descriptor_set_posix.h"
 #include "chrome/common/ipc_logging.h"
 #include "chrome/common/ipc_message_utils.h"
+#include "android/log.h"
+
+#define LOGA(args...)                                            \
+    __android_log_print(ANDROID_LOG_INFO, "Channel" , ## args)
 
 namespace IPC {
 
@@ -720,6 +724,9 @@ bool Channel::ChannelImpl::Send(Message* message) {
     }
   }
 
+  if(output_queue_.size()) {
+    LOGA("channel id : %p  pipe : %d queue size : %d", this, pipe_, output_queue_.size());
+  }
   return true;
 }
Thanks a lot for testing, Jeremy.

we should definitely check this in; based on comment 7, it's a big improvement over where we were.

It's not clear to me from comment 10 that we're actually leaking these messages.  It's possible that every time you read the channel size, there are some messages in the queue, but those messages eventually get sent.

It's also not clear that, if we are in fact leaking messages, those are the main cause of the gradual memory increase in the graph.  Indeed, looking at the DMD report in device 2, I see only ~130kb of pickle data:

> Unreported: ~34 blocks from ~5 stack trace records in stack frame record 512 of 3,495
>  ~139,162 bytes (~139,162 requested / ~0 slop)
>  0.22% of the heap;  0.00% of unreported
>  PC is
>    Pickle::Resize(unsigned int) /home001/jeremy.kim/workspace/d300/b2g/gecko/ipc/chromium/src/base/pickle.cc:531 (0x418e737e libxul.so+0xac937e)

That might still be a memory leak, but it doesn't look like that's what's causing the large gradual increase in memory usage from comment 7.
Attachment #774909 - Flags: feedback?(jeremy.kim.leo)
In bug 893242 I'm adding a memory reporter that should tell us how many outstanding messages each IPC channel has; that patch should do something similar to the code in comment 10.
Trees are closed now.  I can check this in tomorrow, but feel free to check in if you get to this first.
Attachment #774909 - Attachment is obsolete: true
Summary: ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed → ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, leading to leaks in B2G
Whiteboard: [MemShrink]
(In reply to Justin Lebar [:jlebar] from comment #13)
> Created attachment 775465 [details] [diff] [review]
> Patch, v2 (for checkin-needed)

Hi Justin, Since LMK runs to kill an application, the child process of b2g doesn't have a chance to send back __delete__ message to TabParent, see http://dxr.mozilla.org/mozilla-central/source/dom/ipc/TabChild.cpp?from=RecvDestroy#l2099.

That's why the Patch, V1 doesn't help to solve memory leak problem in b2g.
I think the Patch V2 will work fine, because it calls ShutDownProcess in MakeAsDead(). 
I only have one question for Patch V2, do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before call the ShutDownProcess()? 

ContentParent check the value at http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#l975 and if it is not same with numLiveTabs, it just stop to processing NOtifyTabDestorying().
Flags: needinfo?(justin.lebar+bug)
I checked the interdiff between the v1 and v2 patches in this bug, and they do not seem to be different in the way I understand you are describing.  Indeed, patches v1 and v2 are meant to be identical in behavior.

I understood comment 7 to say that patch v1 fixed a large portion of the leak you guys had observed.  Is that not correct?  Do you expect v2 to behave differently?

Maybe you mean something different by "patch v1" and "patch v2" than I understand here?

$ interdiff <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=774909) <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=775465)

> do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before 
> call the ShutDownProcess()? 

I don't think this matters.

The only reason we care about what happens there is to make sure that we call ContentParent::KillHard when appropriate.

In the case when the child process was SIGKIL'ed, we obviously don't need to run ContentParent::KillHard.

In the case when the child process was not SIGKIL'ed, our behavior should not be changed by either patch.

If you think this matters, could you be more explicit about what you think could go wrong?
Flags: needinfo?(justin.lebar+bug)
(In reply to Justin Lebar [:jlebar] from comment #15)
Sorry for my misunderstanding.

> I checked the interdiff between the v1 and v2 patches in this bug, and they
> do not seem to be different in the way I understand you are describing. 
> Indeed, patches v1 and v2 are meant to be identical in behavior.

Sorry for my misunderstanding on Patch V2. Yes, you are right, they don't seem to be different. 

> I understood comment 7 to say that patch v1 fixed a large portion of the
> leak you guys had observed.  Is that not correct?  
  I guess It helps to fix the memory leak in b2g process in this case.

> Do you expect v2 to behave differently? 
  No, I don't. I misunderstood ..

> If you think this matters, could you be more explicit about what you think
> could go wrong?

Okay , let me clear all of these things so you & I both will have better understanding on this problem.

What I thought is ..
As I know so far, When LMK kills one of child processes, ContentParent doesn't run the ShutDownProcess() because the child killed by LMK doesn't have any chance to send __delete__ massage to TabParent which trigger the NotifyTabDestoryed to run ShutDownProcess(). 

Hence, we needed to call ShutDownProcess() when LMK kills an application. 
That's in Patch V1 and V2 in ActorDestroy(). 
It's good to have chance to clean up. 

What I'm thinking is that ...
 when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to call ActorDestroy() and Patch V1,2 is working on this.
 is it correct?
 if then, when we don't get any OnChannelError from Link, we have no chance to run ActorDestory() so what I'd like to suggest is to Call ShutDownProcess in NotifyTabDestroying() with a timer. So, If we don't get the NotifyTabDestoryed() in a given time, we can clean up the ContentParent as well as the channel.

Here is what the patch looks like...
I'm just imaging this.. didn't tried.

void ContentParent::NotifyTabDestroying(PBrowserParent* aTab) {
....
	// recycled during its shutdown procedure.
	MarkAsDead();

+    MessageLoop::current()->PostDelayedTask(
+            FROM_HERE,
+            NewRunnableMethod(this, &ContentParent::ShutDownProcess),
+            TIMEOUTSECS * 1000);

	int32_t timeoutSecs = Preferences::GetInt(
			"dom.ipc.tabs.shutdownTimeoutSecs", 5);

As you know I don't have much experience on B2G.
So, I'd like to have your feedback.
Flags: needinfo?(justin.lebar+bug)
Here's the top 5 heap-unclassified.

1, 2 and 4 look like they are associated with ContentParent, which is consistent with the theory that we're still leaking some kind of ContentParent-related stuff.  In total, those three account for about 9.5% of the heap.

Number 3 is 1.3mb of DNS-related stuff, which might be worth looking into as well.  Number 5 is 667k in a single block, for some kind of OpenGL thing, which is probably a legitimate buffer.  It might be nice to hook that up to a memory reporter.
> when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to 
> call ActorDestroy() and Patch V1,2 is working on this.
> is it correct?

Yes, I think so.

> if then, when we don't get any OnChannelError from Link, we have no chance to run 
> ActorDestory()

Do you mean, in the normal shutdown case?  Are you saying that, in the normal shutdown case (e.g. when you close a tab in the browser), we never run ContentParent::ActorDestroy?  That's an easy thing for you to test, if you think that's true; just attach gdb to the main process and put a breakpoint there.  If it's true, that's definitely a bug.
Depends on: 893857
Depends on: 893865
No longer depends on: 893857, 893865
I filed bug 893857 and bug 893865 for the other 2 problems I saw in the top 5 heap-unclassified, but they don't block this bug per se.  I'll try to figure out the appropriate meta-bug to make them block.
Flags: needinfo?(justin.lebar+bug)
https://hg.mozilla.org/mozilla-central/rev/fb0206edf864
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
Attachment #776090 - Flags: feedback?(justin.lebar+bug)
> Call ShutDownProcess() when appropriate.

Can we please do this in a different bug?
Oh, sure.
Sorry about this.
I will open a new bug.
Attachment #776090 - Attachment is obsolete: true
Attachment #776090 - Flags: feedback?(justin.lebar+bug)
Blocks: 894191
No longer blocks: 870588
Duplicate of this bug: 870588
Whiteboard: [MemShrink] → [MemShrink] [LeoVB+]
You need to log in before you can comment on or make changes to this bug.