The default bug view has changed. See this FAQ.

ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, leading to leaks in B2G

RESOLVED FIXED in Firefox 25, Firefox OS v1.1hd

Status

()

Core
IPC
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: Justin Lebar (not reading bugmail), Assigned: Justin Lebar (not reading bugmail))

Tracking

Trunk
mozilla25
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18 fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

Details

(Whiteboard: [MemShrink] [LeoVB+])

Attachments

(5 attachments, 2 obsolete attachments)

(Assignee)

Description

4 years ago
As a result, ContentParent leaves its channel open.  As a result of that, we allow messages to pile up in this channel arbitrarily.  This causes leaks which manifest in many ways; see the dep list.
(Assignee)

Updated

4 years ago
Blocks: 893012
(Assignee)

Comment 1

4 years ago
Created attachment 774909 [details] [diff] [review]
Patch, v1

If we don't do this, a ContentParent's channel stays open after its
process is SIGKIL'ed.  This causes us to continue to send it broadcast
messages.  These messages build up arbitrarily and manifest as leaks
under a variety of endurance-test scenarios.
(Assignee)

Comment 2

4 years ago
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Benjamin, are you willing and able to review this?

This is a critical bug for Leo; we need to get it in asap.
Attachment #774909 - Attachment description: Call ShutDownProcess in ContentParent::ActorDestroy. → Patch, v1
Attachment #774909 - Flags: review?(benjamin)
(Assignee)

Updated

4 years ago
Assignee: nobody → justin.lebar+bug
(Assignee)

Updated

4 years ago
blocking-b2g: --- → leo+
(Assignee)

Comment 3

4 years ago
I tried regular Close() before I tried CloseWithError(), and it seems to work OK, but not quite as well.  In particular, I get a bunch of assertions while tearing down the IPDL objects under ContentParent, which I don't get with CloseWithError().
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Review of attachment 774909 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for tracking this down! I think this is the right thing to do with one caveat below (and one silly name thing):

::: dom/ipc/ContentParent.cpp
@@ +729,5 @@
>      return SendSetProcessPrivileges(aPrivs);
>  }
>  
>  void
> +ContentParent::ShutDownProcess(bool aForce)

Nit: "force" is the wrong thing here. We're not forcing anything. How about just "aFromActorDestroy"?

@@ +751,1 @@
>      mIsDestroyed = true;

Move this before calling Close() to ensure we don't reenter.
Attachment #774909 - Flags: review?(benjamin) → review+
(Assignee)

Comment 5

4 years ago
Comment on attachment 774909 [details] [diff] [review]
Patch, v1

Jeremy, I'd like Leo to test this before we check it in.  Is that something you're able to do?

This will hopefully erase the pickle problem.  We do have one known remaining memory leak, bug 893012, but the leak should be much smaller.
Attachment #774909 - Flags: feedback?(jeremy.kim.leo)
ok. i will notify result, ASAP 
thanks~
Created attachment 775458 [details]
test result with Attachment 774909 [details] [diff]
Created attachment 775459 [details]
the dmd report of devices1
Created attachment 775460 [details]
the dmd report of devices2
all devices is tested with same way Bug 889261#C34

"device 1" & "device 2" is applied Attachment 774909 [details] [diff]. 
"nothing" isn't applied anythings.

the result is good better than "nothing".
we checked and logged the count of output_queue_.size(). b2g has some channel with unhandling ipc message.
I/Channel (  145): channel id : 0x4fdae000  pipe : 269 queue size : 30
I/Channel (  145): channel id : 0x5064c000  pipe : 145 queue size : 54
I/Channel (  145): channel id : 0x5106c000  pipe : 259 queue size : 70
I/Channel (  145): channel id : 0x50646000  pipe : 123 queue size : 73

this is patch to log output_queue_.size()

diff --git a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
index 6f3d43a..2a1e548 100644
--- a/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
+++ b/ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ -31,6 +31,10 @@
 #include "chrome/common/file_descriptor_set_posix.h"
 #include "chrome/common/ipc_logging.h"
 #include "chrome/common/ipc_message_utils.h"
+#include "android/log.h"
+
+#define LOGA(args...)                                            \
+    __android_log_print(ANDROID_LOG_INFO, "Channel" , ## args)
 
 namespace IPC {
 
@@ -720,6 +724,9 @@ bool Channel::ChannelImpl::Send(Message* message) {
     }
   }
 
+  if(output_queue_.size()) {
+    LOGA("channel id : %p  pipe : %d queue size : %d", this, pipe_, output_queue_.size());
+  }
   return true;
 }
(Assignee)

Comment 11

4 years ago
Thanks a lot for testing, Jeremy.

we should definitely check this in; based on comment 7, it's a big improvement over where we were.

It's not clear to me from comment 10 that we're actually leaking these messages.  It's possible that every time you read the channel size, there are some messages in the queue, but those messages eventually get sent.

It's also not clear that, if we are in fact leaking messages, those are the main cause of the gradual memory increase in the graph.  Indeed, looking at the DMD report in device 2, I see only ~130kb of pickle data:

> Unreported: ~34 blocks from ~5 stack trace records in stack frame record 512 of 3,495
>  ~139,162 bytes (~139,162 requested / ~0 slop)
>  0.22% of the heap;  0.00% of unreported
>  PC is
>    Pickle::Resize(unsigned int) /home001/jeremy.kim/workspace/d300/b2g/gecko/ipc/chromium/src/base/pickle.cc:531 (0x418e737e libxul.so+0xac937e)

That might still be a memory leak, but it doesn't look like that's what's causing the large gradual increase in memory usage from comment 7.
(Assignee)

Updated

4 years ago
Attachment #774909 - Flags: feedback?(jeremy.kim.leo)
(Assignee)

Comment 12

4 years ago
In bug 893242 I'm adding a memory reporter that should tell us how many outstanding messages each IPC channel has; that patch should do something similar to the code in comment 10.
(Assignee)

Comment 13

4 years ago
Created attachment 775465 [details] [diff] [review]
Patch, v2 (for checkin-needed)

Trees are closed now.  I can check this in tomorrow, but feel free to check in if you get to this first.
Attachment #774909 - Attachment is obsolete: true
(Assignee)

Updated

4 years ago
Keywords: checkin-needed
(Assignee)

Updated

4 years ago
Summary: ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed → ContentParent doesn't run ShutDownProcess when its process is SIGKIL'ed, leading to leaks in B2G
(Assignee)

Updated

4 years ago
Whiteboard: [MemShrink]
(In reply to Justin Lebar [:jlebar] from comment #13)
> Created attachment 775465 [details] [diff] [review]
> Patch, v2 (for checkin-needed)

Hi Justin, Since LMK runs to kill an application, the child process of b2g doesn't have a chance to send back __delete__ message to TabParent, see http://dxr.mozilla.org/mozilla-central/source/dom/ipc/TabChild.cpp?from=RecvDestroy#l2099.

That's why the Patch, V1 doesn't help to solve memory leak problem in b2g.
I think the Patch V2 will work fine, because it calls ShutDownProcess in MakeAsDead(). 
I only have one question for Patch V2, do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before call the ShutDownProcess()? 

ContentParent check the value at http://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#l975 and if it is not same with numLiveTabs, it just stop to processing NOtifyTabDestorying().
Flags: needinfo?(justin.lebar+bug)
(Assignee)

Comment 15

4 years ago
I checked the interdiff between the v1 and v2 patches in this bug, and they do not seem to be different in the way I understand you are describing.  Indeed, patches v1 and v2 are meant to be identical in behavior.

I understood comment 7 to say that patch v1 fixed a large portion of the leak you guys had observed.  Is that not correct?  Do you expect v2 to behave differently?

Maybe you mean something different by "patch v1" and "patch v2" than I understand here?

$ interdiff <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=774909) <(curl -L https://bug893172.bugzilla.mozilla.org/attachment.cgi?id=775465)

> do you think it is okay we don't change the mNumDestroyingTabs, count of tabs, before 
> call the ShutDownProcess()? 

I don't think this matters.

The only reason we care about what happens there is to make sure that we call ContentParent::KillHard when appropriate.

In the case when the child process was SIGKIL'ed, we obviously don't need to run ContentParent::KillHard.

In the case when the child process was not SIGKIL'ed, our behavior should not be changed by either patch.

If you think this matters, could you be more explicit about what you think could go wrong?
Flags: needinfo?(justin.lebar+bug)
(In reply to Justin Lebar [:jlebar] from comment #15)
Sorry for my misunderstanding.

> I checked the interdiff between the v1 and v2 patches in this bug, and they
> do not seem to be different in the way I understand you are describing. 
> Indeed, patches v1 and v2 are meant to be identical in behavior.

Sorry for my misunderstanding on Patch V2. Yes, you are right, they don't seem to be different. 

> I understood comment 7 to say that patch v1 fixed a large portion of the
> leak you guys had observed.  Is that not correct?  
  I guess It helps to fix the memory leak in b2g process in this case.

> Do you expect v2 to behave differently? 
  No, I don't. I misunderstood ..

> If you think this matters, could you be more explicit about what you think
> could go wrong?

Okay , let me clear all of these things so you & I both will have better understanding on this problem.

What I thought is ..
As I know so far, When LMK kills one of child processes, ContentParent doesn't run the ShutDownProcess() because the child killed by LMK doesn't have any chance to send __delete__ massage to TabParent which trigger the NotifyTabDestoryed to run ShutDownProcess(). 

Hence, we needed to call ShutDownProcess() when LMK kills an application. 
That's in Patch V1 and V2 in ActorDestroy(). 
It's good to have chance to clean up. 

What I'm thinking is that ...
 when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to call ActorDestroy() and Patch V1,2 is working on this.
 is it correct?
 if then, when we don't get any OnChannelError from Link, we have no chance to run ActorDestory() so what I'd like to suggest is to Call ShutDownProcess in NotifyTabDestroying() with a timer. So, If we don't get the NotifyTabDestoryed() in a given time, we can clean up the ContentParent as well as the channel.

Here is what the patch looks like...
I'm just imaging this.. didn't tried.

void ContentParent::NotifyTabDestroying(PBrowserParent* aTab) {
....
	// recycled during its shutdown procedure.
	MarkAsDead();

+    MessageLoop::current()->PostDelayedTask(
+            FROM_HERE,
+            NewRunnableMethod(this, &ContentParent::ShutDownProcess),
+            TIMEOUTSECS * 1000);

	int32_t timeoutSecs = Preferences::GetInt(
			"dom.ipc.tabs.shutdownTimeoutSecs", 5);

As you know I don't have much experience on B2G.
So, I'd like to have your feedback.
Flags: needinfo?(justin.lebar+bug)
https://hg.mozilla.org/projects/birch/rev/fb0206edf864
Keywords: checkin-needed
Created attachment 775691 [details]
top 5 heap-unclassified

Here's the top 5 heap-unclassified.

1, 2 and 4 look like they are associated with ContentParent, which is consistent with the theory that we're still leaking some kind of ContentParent-related stuff.  In total, those three account for about 9.5% of the heap.

Number 3 is 1.3mb of DNS-related stuff, which might be worth looking into as well.  Number 5 is 667k in a single block, for some kind of OpenGL thing, which is probably a legitimate buffer.  It might be nice to hook that up to a memory reporter.
(Assignee)

Comment 19

4 years ago
> when LMK Kills an application, we get "ChannelError" and it triggers DestorySubtree() to 
> call ActorDestroy() and Patch V1,2 is working on this.
> is it correct?

Yes, I think so.

> if then, when we don't get any OnChannelError from Link, we have no chance to run 
> ActorDestory()

Do you mean, in the normal shutdown case?  Are you saying that, in the normal shutdown case (e.g. when you close a tab in the browser), we never run ContentParent::ActorDestroy?  That's an easy thing for you to test, if you think that's true; just attach gdb to the main process and put a breakpoint there.  If it's true, that's definitely a bug.
Depends on: 893857
Depends on: 893865
No longer depends on: 893857, 893865
I filed bug 893857 and bug 893865 for the other 2 problems I saw in the top 5 heap-unclassified, but they don't block this bug per se.  I'll try to figure out the appropriate meta-bug to make them block.
(Assignee)

Updated

4 years ago
Flags: needinfo?(justin.lebar+bug)
https://hg.mozilla.org/mozilla-central/rev/fb0206edf864
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
https://hg.mozilla.org/releases/mozilla-b2g18/rev/a0fce6b5bdb2
status-b2g18: --- → fixed
status-b2g18-v1.0.0: --- → wontfix
status-b2g18-v1.0.1: --- → wontfix
status-b2g-v1.1hd: --- → affected
status-firefox23: --- → wontfix
status-firefox24: --- → wontfix
status-firefox25: --- → fixed
Created attachment 776090 [details] [diff] [review]
Call ShutDownProcess() when appropriate.
Attachment #776090 - Flags: feedback?(justin.lebar+bug)
(Assignee)

Comment 24

4 years ago
> Call ShutDownProcess() when appropriate.

Can we please do this in a different bug?
Oh, sure.
Sorry about this.
I will open a new bug.
https://hg.mozilla.org/releases/mozilla-b2g18_v1_1_0_hd/rev/a0fce6b5bdb2
status-b2g-v1.1hd: affected → fixed
Attachment #776090 - Attachment is obsolete: true
Attachment #776090 - Flags: feedback?(justin.lebar+bug)
(Assignee)

Updated

4 years ago
Blocks: 894191
No longer blocks: 870588
Duplicate of this bug: 870588

Updated

4 years ago
Whiteboard: [MemShrink] → [MemShrink] [LeoVB+]
You need to log in before you can comment on or make changes to this bug.