Closed
Bug 1070912
Opened 10 years ago
Closed 10 years ago
Leak CompositorThread when Nuwa is enabled on debug builds
Categories
(Core :: DOM: Content Processes, defect)
Tracking
()
RESOLVED
FIXED
mozilla36
People
(Reporter: ting, Assigned: ting)
References
Details
Attachments
(3 files, 3 obsolete files)
855 bytes,
text/x-python
|
Details | |
471.25 KB,
application/zip
|
Details | |
2.34 KB,
patch
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1050122 +++ The content parent of Nuwa opens PCompositor and PImageBridge which the corresponding CrossProcessCompositorParent and ImageBridgeParent both holds a reference to CompositorParent::sCompositorThreadHolder. This makes mochitest time out when it is finished since the refcnt of sCompositorThreadHolder does not decrease to 0 and CompositorParent::ShutDown() never ends. STR: 1. Remove "#ifndef DEBUG" wraps "dom.ipc.processPrelaunch.enabled" in b2g.js to enable it always. 2. Build with B2G_DEBUG=1 for emulator. 3. Run any mochitest, e.g., |$ ./mach -v mochitest-remote dom/inputmethod/| 4. The test will time out after a while, the message is something like: DeviceRunner TEST-UNEXPECTED-FAIL | ... | application timed out after 450.0 seconds with no output
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → tchou
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•10 years ago
|
||
Cervantes and Partick suggest use KillHard() to kill Nuwa process and destroy its CompositorParent and ImageBridgeParent. I tried, and the mochitest then can finish, but the leakcheck receives nothing due to crash: PROCESS-CRASH | mozrunner-startup | application crashed [@ libxul.so + 0x361760] Crash dump filename: /tmp/tmpBEvMO0/15711654-47ed-56ac-1614d5aa-56ddbfad.dmp TEST-INFO | leakcheck | threshold set at 5180 bytes WARNING | leakcheck | missing output line for total leaks! The partial stack from minidump: 0 libxul.so + 0x361760 nsCycleCollector::Suspect(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*) /home/ting/w/fx/os/emulator/gecko/xpcom/base/nsCycleCollector.cpp:3443 MOZ_ASSERT(HasParticipant(aPtr, aParti), "Suspected nsISupports pointer must QI to nsXPCOMCycleCollectionParticipant"); 1 libxul.so + 0x1fe3015 2 libxul.so + 0xf196a5 nsCycleCollectingAutoRefCnt::decr(void*, nsCycleCollectionParticipant*, bool*) /home/ting/w/fx/os/emulator/objdir-gecko/dom/ipc/../../dist/include/nsISupportsImpl.h:282 3 libxul.so + 0xf19739 mozilla::dom::ContentParent::Release() /home/ting/w/fx/os/emulator/gecko/dom/ipc/ContentParent.cpp:2543 4 libxul.so + 0xf146af RunnableMethodTraits<mozilla::dom::ContentParent>::ReleaseCallee(mozilla::dom::ContentParent*) /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:263 5 libxul.so + 0xf146d7 ~RunnableMethod /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:303 6 libxul.so + 0x575ef3 MessageLoop::DeletePendingTasks() /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:422 7 libxul.so + 0x6f3df3 mozJSComponentLoader::UnloadModules() /home/ting/w/fx/os/emulator/gecko/js/xpconnect/loader/mozJSComponentLoader.cpp:1075 8 libxul.so + 0x1fd82c3 9 libxul.so + 0x5776c1 ~MessageLoop /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:176 10 0x7ffe
It looks like that is too late to destroy ContentParent. The cycle collector is probably shut down already.
Assignee | ||
Comment 3•10 years ago
|
||
Yes, you're right, thank you for saving my day. I am seeking other proper places for KillHard(), but seems "xpcom-shutdown" is the earliest event for the ContentParent...
Can you figure out why we are putting IPC events into the event queue when we call KillHard?
Assignee | ||
Comment 5•10 years ago
|
||
Which IPC events did you mean? I put KillHard() after Close() in ContentParent::ShutDownProcess() for comment 1, which is triggered by "xpcom-shutdown".
Assignee | ||
Comment 6•10 years ago
|
||
I am not sure I understand correctly, but if you meant MessageLoop::DeletePendingTasks() in the stack, in KillHard() there's a delayed task ContentParent::ShutDownProcess post to MessageLoop::current().
Try removing that delay. Justin and I never understood why it was there. See Bug 894191 comment 4, etc.
Assignee | ||
Comment 8•10 years ago
|
||
It now shutdowns nicely, I will double check bug 894191.
Assignee | ||
Comment 9•10 years ago
|
||
From the code, the delayed ShutDownProcess seems redundant as ActorDestroy always calls it. I have asked Jeff on IRC why it was added, seems it's because ShutDownProcess didn't always get called, but as it was 1 year ago, he couldn't remember it clearly. He mentioned the issue happened while running pressure test, which launch apps repeatedly for 1 to 5 days and trigger a lot of OOM killing. I checked with with 2.2 on my Falme, when an app is killed by SIGKILL, KillHard does not get called, but ContentParent::OnChannelError, ActorDestroy and ShutDownProcess. My guess is when KillHard gets called, somehow channel error is not triggered. Do we have such kind of test to make sure removing the delay won't repro bug 894191?
Assignee | ||
Comment 10•10 years ago
|
||
KillHard Nuwa process and removed the delayed ShutDownProcess. https://tbpl.mozilla.org/?tree=Try&rev=687282a7c5bc
Assignee | ||
Comment 11•10 years ago
|
||
Another try with the patch of bug 1050122: https://tbpl.mozilla.org/?tree=Try&rev=e312f82202f9, the timeout goes away, and comes new crashes in xpcshell tests.
Assignee | ||
Comment 12•10 years ago
|
||
Attachment #8495093 -
Attachment is obsolete: true
Attachment #8498769 -
Flags: review?(khuey)
I'm surprised that you're getting rid of the call to ShutdownProcess entirely. Why is that ok?
Flags: needinfo?(tchou)
Assignee | ||
Comment 14•10 years ago
|
||
You're right, I should test before removing it. The delay ShutdownProcess looks like a workaround, and I think it's a chance to replace it with a proper fix which why OnChannelError is not always happened when child process gets killed. I am thinking either 1) simply add a condition checking before sending delay ShutdownProcess, or 2) remove the delay ShutdownProcess and use Marionette to test like how LG did test before, see if bug 894191 can be reproduced, if yes, fix it. How do you think?
Flags: needinfo?(tchou)
#2 is clearly the correct path, but I'll accept #1 if you don't have time to do that.
Assignee | ||
Comment 16•10 years ago
|
||
Comment on attachment 8498769 [details] [diff] [review] patch v1 I see. Will do #2 later when I have time.
Attachment #8498769 -
Flags: review?(khuey)
Assignee | ||
Comment 17•10 years ago
|
||
I will test this on my Flame, and see whether the number of ContentParent::OnChannelError() invocations equals to the number of "send sigkill" from dmesg.
Assignee | ||
Comment 18•10 years ago
|
||
I left the device (Flame 319MB) running attachment 8502397 [details] during the weekend, the script ran ~20 hours and stopped for some exception from Marionette but the device still works. There were 2993 ContentParent::OnChannelError invocations, and 2842 "send sigkill to [pid]" from dmesg. I am not sure why it's not matched, but bug 894191 was not reproduced, though ~20 hours is not long enough compare to what LG tested.
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #18) > There were 2993 ContentParent::OnChannelError invocations, and 2842 "send > sigkill to [pid]" from dmesg. I am not sure why it's not matched It's because I missed OOM killer, I'll try more runs.
Assignee | ||
Comment 20•10 years ago
|
||
Remove the sleep after back to homescreen to launch app more often.
Attachment #8502397 -
Attachment is obsolete: true
Assignee | ||
Comment 21•10 years ago
|
||
I had 3 test runs this week, and does not repro bug 894191 with the patch (attachment 8498769 [details] [diff] [review]): 1. ~24h10m: 5603 kill, 5601 ContentParent::OnChannelError 2 kills are not content process (2 * Binder_2) 2. ~23h15m: 4558 kill, 4554 ContentParent::OnChannelError 4 kills are not content process (2 * Binder_1, 1 * Analysis Helper, 1 * Binder_2) 3. ~15h25m: 3060 kill, 3058 ContentParent::OnChannelError 2 kills are not content process (2 * Analysis Helper) Besides, I learnt from Thinker & Shelly that there was a bug 811636 which parent is not notified when child process dies, and the patch was landed on mozilla_b2g18 one week after bug 894191. I doubt bug 894191 is a dupe of bug 811636.
Let's land it!
Flags: needinfo?(khuey)
Assignee | ||
Updated•10 years ago
|
Attachment #8498769 -
Flags: review?(khuey)
Attachment #8498769 -
Flags: review?(khuey) → review+
Assignee | ||
Comment 24•10 years ago
|
||
Try is not green this time: https://tbpl.mozilla.org/?tree=Try&rev=58089e267227, I'll check it.
Assignee | ||
Comment 25•10 years ago
|
||
For some reasons, the Close() call in ShutdownProcess(0) now goes to ActorDestroy() synchronously and clear mSubprocess.
Assignee | ||
Comment 26•10 years ago
|
||
Added a protector before accessing mSubprocess in KillHard(). Try: https://tbpl.mozilla.org/?tree=Try&rev=174b091b15d9. There're a bunch of ScriptTimeoutException in marionette-webapi tests, I am not quite sure are they caused by the patch, the exception is from marionette.py. Test xpcshell-2 is also failed with timeout, I'll check it.
Attachment #8498769 -
Attachment is obsolete: true
Assignee | ||
Comment 27•10 years ago
|
||
With the latest code today, I couldn't reprouce the fail (b2g_emulator_vm try debug test xpcshell-2) of comment 26 locally or on Try.
Assignee | ||
Comment 28•10 years ago
|
||
Try looks good: https://tbpl.mozilla.org/?tree=Try&rev=82327d4451fc I didn't ask for another review for patch v2 since I added only a null check and comment.
Keywords: checkin-needed
Comment 29•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/c48992ed7bfc
Keywords: checkin-needed
Comment 30•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/c48992ed7bfc
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
You need to log in
before you can comment on or make changes to this bug.
Description
•