Leak CompositorThread when Nuwa is enabled on debug builds

RESOLVED FIXED in mozilla36



DOM: Content Processes
4 years ago
4 years ago


(Reporter: ting, Assigned: ting)


Gonk (Firefox OS)

Firefox Tracking Flags

(Not tracked)



(3 attachments, 3 obsolete attachments)



4 years ago
+++ This bug was initially created as a clone of Bug #1050122 +++

The content parent of Nuwa opens PCompositor and PImageBridge which the corresponding CrossProcessCompositorParent and ImageBridgeParent both holds a reference to CompositorParent::sCompositorThreadHolder. This makes mochitest time out when it is finished since the refcnt of sCompositorThreadHolder does not decrease to 0 and CompositorParent::ShutDown() never ends.

1. Remove "#ifndef DEBUG" wraps "dom.ipc.processPrelaunch.enabled" in b2g.js to enable it always.
2. Build with B2G_DEBUG=1 for emulator.
3. Run any mochitest, e.g., |$ ./mach -v mochitest-remote dom/inputmethod/|
4. The test will time out after a while, the message is something like:
   DeviceRunner TEST-UNEXPECTED-FAIL | ... | application timed out after 450.0 seconds with no output


4 years ago
Assignee: nobody → tchou

Comment 1

4 years ago
Cervantes and Partick suggest use KillHard() to kill Nuwa process and destroy its CompositorParent and ImageBridgeParent. I tried, and the mochitest then can finish, but the leakcheck receives nothing due to crash:

  PROCESS-CRASH | mozrunner-startup | application crashed [@ libxul.so + 0x361760]
  Crash dump filename: /tmp/tmpBEvMO0/15711654-47ed-56ac-1614d5aa-56ddbfad.dmp
  TEST-INFO | leakcheck | threshold set at 5180 bytes
  WARNING | leakcheck | missing output line for total leaks!

The partial stack from minidump:

   0  libxul.so + 0x361760
        nsCycleCollector::Suspect(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*)

        MOZ_ASSERT(HasParticipant(aPtr, aParti),
                   "Suspected nsISupports pointer must QI to nsXPCOMCycleCollectionParticipant");

   1  libxul.so + 0x1fe3015
   2  libxul.so + 0xf196a5
        nsCycleCollectingAutoRefCnt::decr(void*, nsCycleCollectionParticipant*, bool*)
   3  libxul.so + 0xf19739
   4  libxul.so + 0xf146af
   5  libxul.so + 0xf146d7
   6  libxul.so + 0x575ef3
   7  libxul.so + 0x6f3df3
   8  libxul.so + 0x1fd82c3
   9  libxul.so + 0x5776c1
  10  0x7ffe
It looks like that is too late to destroy ContentParent.  The cycle collector is probably shut down already.

Comment 3

4 years ago
Yes, you're right, thank you for saving my day. I am seeking other proper places for KillHard(), but seems "xpcom-shutdown" is the earliest event for the ContentParent...
Can you figure out why we are putting IPC events into the event queue when we call KillHard?

Comment 5

4 years ago
Which IPC events did you mean?

I put KillHard() after Close() in ContentParent::ShutDownProcess() for comment 1, which is triggered by "xpcom-shutdown".

Comment 6

4 years ago
I am not sure I understand correctly, but if you meant MessageLoop::DeletePendingTasks() in the stack, in KillHard() there's a delayed task ContentParent::ShutDownProcess post to MessageLoop::current().
Try removing that delay.  Justin and I never understood why it was there.  See Bug 894191 comment 4, etc.

Comment 8

4 years ago
It now shutdowns nicely, I will double check bug 894191.

Comment 9

4 years ago
From the code, the delayed ShutDownProcess seems redundant as ActorDestroy always calls it. I have asked Jeff on IRC why it was added, seems it's because ShutDownProcess didn't always get called, but as it was 1 year ago, he couldn't remember it clearly. He mentioned the issue happened while running pressure test, which launch apps repeatedly for 1 to 5 days and trigger a lot of OOM killing.

I checked with with 2.2 on my Falme, when an app is killed by SIGKILL, KillHard does not get called, but ContentParent::OnChannelError, ActorDestroy and ShutDownProcess.

My guess is when KillHard gets called, somehow channel error is not triggered. Do we have such kind of test to make sure removing the delay won't repro bug 894191?

Comment 10

4 years ago
Created attachment 8495093 [details] [diff] [review]
wip v1

KillHard Nuwa process and removed the delayed ShutDownProcess.


Comment 11

4 years ago
Another try with the patch of bug 1050122: https://tbpl.mozilla.org/?tree=Try&rev=e312f82202f9, the timeout goes away, and comes new crashes in xpcshell tests.

Comment 12

4 years ago
Created attachment 8498769 [details] [diff] [review]
patch v1
Attachment #8495093 - Attachment is obsolete: true
Attachment #8498769 - Flags: review?(khuey)
I'm surprised that you're getting rid of the call to ShutdownProcess entirely.  Why is that ok?

Comment 14

4 years ago
You're right, I should test before removing it.

The delay ShutdownProcess looks like a workaround, and I think it's a chance to replace it with a proper fix which why OnChannelError is not always happened when child process gets killed.

I am thinking either 1) simply add a condition checking before sending delay ShutdownProcess, or 2) remove the delay ShutdownProcess and use Marionette to test like how LG did test before, see if bug 894191 can be reproduced, if yes, fix it.

How do you think?
Flags: needinfo?(tchou)
#2 is clearly the correct path, but I'll accept #1 if you don't have time to do that.

Comment 16

4 years ago
Comment on attachment 8498769 [details] [diff] [review]
patch v1

I see. Will do #2 later when I have time.
Attachment #8498769 - Flags: review?(khuey)

Comment 17

4 years ago
Created attachment 8502397 [details]

I will test this on my Flame, and see whether the number of ContentParent::OnChannelError() invocations equals to the number of "send sigkill" from dmesg.

Comment 18

4 years ago
I left the device (Flame 319MB) running attachment 8502397 [details] during the weekend, the script ran ~20 hours and stopped for some exception from Marionette but the device still works.

There were 2993 ContentParent::OnChannelError invocations, and 2842 "send sigkill to [pid]" from dmesg. I am not sure why it's not matched, but bug 894191 was not reproduced, though ~20 hours is not long enough compare to what LG tested.

Comment 19

4 years ago
(In reply to Ting-Yu Chou [:ting] from comment #18)
> There were 2993 ContentParent::OnChannelError invocations, and 2842 "send
> sigkill to [pid]" from dmesg. I am not sure why it's not matched

It's because I missed OOM killer, I'll try more runs.

Comment 20

4 years ago
Created attachment 8506774 [details]

Remove the sleep after back to homescreen to launch app more often.
Attachment #8502397 - Attachment is obsolete: true

Comment 21

4 years ago
Created attachment 8506783 [details]

I had 3 test runs this week, and does not repro bug 894191 with the patch (attachment 8498769 [details] [diff] [review]):

1. ~24h10m: 5603 kill, 5601 ContentParent::OnChannelError
            2 kills are not content process (2 * Binder_2)
2. ~23h15m: 4558 kill, 4554 ContentParent::OnChannelError
            4 kills are not content process (2 * Binder_1, 1 * Analysis Helper, 1 * Binder_2)
3. ~15h25m: 3060 kill, 3058 ContentParent::OnChannelError
            2 kills are not content process (2 * Analysis Helper)

Besides, I learnt from Thinker & Shelly that there was a bug 811636 which parent is not notified when child process dies, and the patch was landed on mozilla_b2g18 one week after bug 894191. I doubt bug 894191 is a dupe of bug 811636.

Comment 22

4 years ago
Kyle, how do you think for comment 21?
Flags: needinfo?(khuey)
Let's land it!
Flags: needinfo?(khuey)


4 years ago
Attachment #8498769 - Flags: review?(khuey)

Comment 24

4 years ago
Try is not green this time: https://tbpl.mozilla.org/?tree=Try&rev=58089e267227, I'll check it.

Comment 25

4 years ago
For some reasons, the Close() call in ShutdownProcess(0) now goes to ActorDestroy() synchronously and clear mSubprocess.

Comment 26

4 years ago
Created attachment 8509377 [details] [diff] [review]
patch v2

Added a protector before accessing mSubprocess in KillHard().

Try: https://tbpl.mozilla.org/?tree=Try&rev=174b091b15d9. There're a bunch of ScriptTimeoutException in marionette-webapi tests, I am not quite sure are they caused by the patch, the exception is from marionette.py. Test xpcshell-2 is also failed with timeout, I'll check it.
Attachment #8498769 - Attachment is obsolete: true

Comment 27

4 years ago
With the latest code today, I couldn't reprouce the fail (b2g_emulator_vm try debug test xpcshell-2) of comment 26 locally or on Try.

Comment 28

4 years ago
Try looks good: https://tbpl.mozilla.org/?tree=Try&rev=82327d4451fc

I didn't ask for another review for patch v2 since I added only a null check and comment.
Keywords: checkin-needed
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.