Closed Bug 1070912 Opened 10 years ago Closed 10 years ago

Leak CompositorThread when Nuwa is enabled on debug builds

Categories

(Core :: DOM: Content Processes, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: ting, Assigned: ting)

References

Details

Attachments

(3 files, 3 obsolete files)

+++ This bug was initially created as a clone of Bug #1050122 +++

The content parent of Nuwa opens PCompositor and PImageBridge which the corresponding CrossProcessCompositorParent and ImageBridgeParent both holds a reference to CompositorParent::sCompositorThreadHolder. This makes mochitest time out when it is finished since the refcnt of sCompositorThreadHolder does not decrease to 0 and CompositorParent::ShutDown() never ends.

STR:
1. Remove "#ifndef DEBUG" wraps "dom.ipc.processPrelaunch.enabled" in b2g.js to enable it always.
2. Build with B2G_DEBUG=1 for emulator.
3. Run any mochitest, e.g., |$ ./mach -v mochitest-remote dom/inputmethod/|
4. The test will time out after a while, the message is something like:
   DeviceRunner TEST-UNEXPECTED-FAIL | ... | application timed out after 450.0 seconds with no output
Assignee: nobody → tchou
Status: NEW → ASSIGNED
Cervantes and Partick suggest use KillHard() to kill Nuwa process and destroy its CompositorParent and ImageBridgeParent. I tried, and the mochitest then can finish, but the leakcheck receives nothing due to crash:

  PROCESS-CRASH | mozrunner-startup | application crashed [@ libxul.so + 0x361760]
  Crash dump filename: /tmp/tmpBEvMO0/15711654-47ed-56ac-1614d5aa-56ddbfad.dmp
  TEST-INFO | leakcheck | threshold set at 5180 bytes
  WARNING | leakcheck | missing output line for total leaks!

The partial stack from minidump:

   0  libxul.so + 0x361760
        nsCycleCollector::Suspect(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*)
        /home/ting/w/fx/os/emulator/gecko/xpcom/base/nsCycleCollector.cpp:3443

        MOZ_ASSERT(HasParticipant(aPtr, aParti),
                   "Suspected nsISupports pointer must QI to nsXPCOMCycleCollectionParticipant");

   1  libxul.so + 0x1fe3015
   2  libxul.so + 0xf196a5
        nsCycleCollectingAutoRefCnt::decr(void*, nsCycleCollectionParticipant*, bool*)
        /home/ting/w/fx/os/emulator/objdir-gecko/dom/ipc/../../dist/include/nsISupportsImpl.h:282
   3  libxul.so + 0xf19739
        mozilla::dom::ContentParent::Release()
        /home/ting/w/fx/os/emulator/gecko/dom/ipc/ContentParent.cpp:2543
   4  libxul.so + 0xf146af
        RunnableMethodTraits<mozilla::dom::ContentParent>::ReleaseCallee(mozilla::dom::ContentParent*)
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:263
   5  libxul.so + 0xf146d7
        ~RunnableMethod
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:303
   6  libxul.so + 0x575ef3
        MessageLoop::DeletePendingTasks()
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:422
   7  libxul.so + 0x6f3df3
        mozJSComponentLoader::UnloadModules()
        /home/ting/w/fx/os/emulator/gecko/js/xpconnect/loader/mozJSComponentLoader.cpp:1075
   8  libxul.so + 0x1fd82c3
   9  libxul.so + 0x5776c1
        ~MessageLoop
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:176
  10  0x7ffe
It looks like that is too late to destroy ContentParent.  The cycle collector is probably shut down already.
Yes, you're right, thank you for saving my day. I am seeking other proper places for KillHard(), but seems "xpcom-shutdown" is the earliest event for the ContentParent...
Can you figure out why we are putting IPC events into the event queue when we call KillHard?
Which IPC events did you mean?

I put KillHard() after Close() in ContentParent::ShutDownProcess() for comment 1, which is triggered by "xpcom-shutdown".
I am not sure I understand correctly, but if you meant MessageLoop::DeletePendingTasks() in the stack, in KillHard() there's a delayed task ContentParent::ShutDownProcess post to MessageLoop::current().
Try removing that delay.  Justin and I never understood why it was there.  See Bug 894191 comment 4, etc.
It now shutdowns nicely, I will double check bug 894191.
From the code, the delayed ShutDownProcess seems redundant as ActorDestroy always calls it. I have asked Jeff on IRC why it was added, seems it's because ShutDownProcess didn't always get called, but as it was 1 year ago, he couldn't remember it clearly. He mentioned the issue happened while running pressure test, which launch apps repeatedly for 1 to 5 days and trigger a lot of OOM killing.

I checked with with 2.2 on my Falme, when an app is killed by SIGKILL, KillHard does not get called, but ContentParent::OnChannelError, ActorDestroy and ShutDownProcess.

My guess is when KillHard gets called, somehow channel error is not triggered. Do we have such kind of test to make sure removing the delay won't repro bug 894191?
Attached patch wip v1 (obsolete) — Splinter Review
KillHard Nuwa process and removed the delayed ShutDownProcess.

https://tbpl.mozilla.org/?tree=Try&rev=687282a7c5bc
Another try with the patch of bug 1050122: https://tbpl.mozilla.org/?tree=Try&rev=e312f82202f9, the timeout goes away, and comes new crashes in xpcshell tests.
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #8495093 - Attachment is obsolete: true
Attachment #8498769 - Flags: review?(khuey)
I'm surprised that you're getting rid of the call to ShutdownProcess entirely.  Why is that ok?
You're right, I should test before removing it.

The delay ShutdownProcess looks like a workaround, and I think it's a chance to replace it with a proper fix which why OnChannelError is not always happened when child process gets killed.

I am thinking either 1) simply add a condition checking before sending delay ShutdownProcess, or 2) remove the delay ShutdownProcess and use Marionette to test like how LG did test before, see if bug 894191 can be reproduced, if yes, fix it.

How do you think?
Flags: needinfo?(tchou)
#2 is clearly the correct path, but I'll accept #1 if you don't have time to do that.
Comment on attachment 8498769 [details] [diff] [review]
patch v1

I see. Will do #2 later when I have time.
Attachment #8498769 - Flags: review?(khuey)
Attached file test.py (obsolete) —
I will test this on my Flame, and see whether the number of ContentParent::OnChannelError() invocations equals to the number of "send sigkill" from dmesg.
I left the device (Flame 319MB) running attachment 8502397 [details] during the weekend, the script ran ~20 hours and stopped for some exception from Marionette but the device still works.

There were 2993 ContentParent::OnChannelError invocations, and 2842 "send sigkill to [pid]" from dmesg. I am not sure why it's not matched, but bug 894191 was not reproduced, though ~20 hours is not long enough compare to what LG tested.
(In reply to Ting-Yu Chou [:ting] from comment #18)
> There were 2993 ContentParent::OnChannelError invocations, and 2842 "send
> sigkill to [pid]" from dmesg. I am not sure why it's not matched

It's because I missed OOM killer, I'll try more runs.
Attached file test.py
Remove the sleep after back to homescreen to launch app more often.
Attachment #8502397 - Attachment is obsolete: true
Attached file test_logs.zip
I had 3 test runs this week, and does not repro bug 894191 with the patch (attachment 8498769 [details] [diff] [review]):

1. ~24h10m: 5603 kill, 5601 ContentParent::OnChannelError
            2 kills are not content process (2 * Binder_2)
2. ~23h15m: 4558 kill, 4554 ContentParent::OnChannelError
            4 kills are not content process (2 * Binder_1, 1 * Analysis Helper, 1 * Binder_2)
3. ~15h25m: 3060 kill, 3058 ContentParent::OnChannelError
            2 kills are not content process (2 * Analysis Helper)

Besides, I learnt from Thinker & Shelly that there was a bug 811636 which parent is not notified when child process dies, and the patch was landed on mozilla_b2g18 one week after bug 894191. I doubt bug 894191 is a dupe of bug 811636.
Kyle, how do you think for comment 21?
Flags: needinfo?(khuey)
Let's land it!
Flags: needinfo?(khuey)
Attachment #8498769 - Flags: review?(khuey)
Try is not green this time: https://tbpl.mozilla.org/?tree=Try&rev=58089e267227, I'll check it.
For some reasons, the Close() call in ShutdownProcess(0) now goes to ActorDestroy() synchronously and clear mSubprocess.
Attached patch patch v2Splinter Review
Added a protector before accessing mSubprocess in KillHard().

Try: https://tbpl.mozilla.org/?tree=Try&rev=174b091b15d9. There're a bunch of ScriptTimeoutException in marionette-webapi tests, I am not quite sure are they caused by the patch, the exception is from marionette.py. Test xpcshell-2 is also failed with timeout, I'll check it.
Attachment #8498769 - Attachment is obsolete: true
With the latest code today, I couldn't reprouce the fail (b2g_emulator_vm try debug test xpcshell-2) of comment 26 locally or on Try.
Try looks good: https://tbpl.mozilla.org/?tree=Try&rev=82327d4451fc

I didn't ask for another review for patch v2 since I added only a null check and comment.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c48992ed7bfc
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: