Closed Bug 1070912 Opened 10 years ago Closed 10 years ago

Leak CompositorThread when Nuwa is enabled on debug builds

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla36

People

(Reporter: ting, Assigned: ting)

References

Details

Attachments

(3 files, 3 obsolete files)

wip v1 10 years ago Ting-Yu Chou [:ting] (away) 2.29 KB, patch		Details \| Diff \| Splinter Review
patch v1 10 years ago Ting-Yu Chou [:ting] (away) 1.91 KB, patch	khuey : review+	Details \| Diff \| Splinter Review
test.py 10 years ago Ting-Yu Chou [:ting] (away) 871 bytes, text/x-python		Details
test.py 10 years ago Ting-Yu Chou [:ting] (away) 855 bytes, text/x-python		Details
test_logs.zip 10 years ago Ting-Yu Chou [:ting] (away) 471.25 KB, application/zip		Details
patch v2 10 years ago Ting-Yu Chou [:ting] (away) 2.34 KB, patch		Details \| Diff \| Splinter Review

Ting-Yu Chou [:ting] (away)

Assignee

Description

•

10 years ago

+++ This bug was initially created as a clone of Bug #1050122 +++

The content parent of Nuwa opens PCompositor and PImageBridge which the corresponding CrossProcessCompositorParent and ImageBridgeParent both holds a reference to CompositorParent::sCompositorThreadHolder. This makes mochitest time out when it is finished since the refcnt of sCompositorThreadHolder does not decrease to 0 and CompositorParent::ShutDown() never ends.

STR:
1. Remove "#ifndef DEBUG" wraps "dom.ipc.processPrelaunch.enabled" in b2g.js to enable it always.
2. Build with B2G_DEBUG=1 for emulator.
3. Run any mochitest, e.g., |$ ./mach -v mochitest-remote dom/inputmethod/|
4. The test will time out after a while, the message is something like:
   DeviceRunner TEST-UNEXPECTED-FAIL | ... | application timed out after 450.0 seconds with no output

Ting-Yu Chou [:ting] (away)

Assignee

Updated

•

10 years ago

Assignee: nobody → tchou

Status: NEW → ASSIGNED

Ting-Yu Chou [:ting] (away)

Assignee

Comment 1

•

10 years ago

Cervantes and Partick suggest use KillHard() to kill Nuwa process and destroy its CompositorParent and ImageBridgeParent. I tried, and the mochitest then can finish, but the leakcheck receives nothing due to crash:

  PROCESS-CRASH | mozrunner-startup | application crashed [@ libxul.so + 0x361760]
  Crash dump filename: /tmp/tmpBEvMO0/15711654-47ed-56ac-1614d5aa-56ddbfad.dmp
  TEST-INFO | leakcheck | threshold set at 5180 bytes
  WARNING | leakcheck | missing output line for total leaks!

The partial stack from minidump:

   0  libxul.so + 0x361760
        nsCycleCollector::Suspect(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*)
        /home/ting/w/fx/os/emulator/gecko/xpcom/base/nsCycleCollector.cpp:3443

        MOZ_ASSERT(HasParticipant(aPtr, aParti),
                   "Suspected nsISupports pointer must QI to nsXPCOMCycleCollectionParticipant");

   1  libxul.so + 0x1fe3015
   2  libxul.so + 0xf196a5
        nsCycleCollectingAutoRefCnt::decr(void*, nsCycleCollectionParticipant*, bool*)
        /home/ting/w/fx/os/emulator/objdir-gecko/dom/ipc/../../dist/include/nsISupportsImpl.h:282
   3  libxul.so + 0xf19739
        mozilla::dom::ContentParent::Release()
        /home/ting/w/fx/os/emulator/gecko/dom/ipc/ContentParent.cpp:2543
   4  libxul.so + 0xf146af
        RunnableMethodTraits<mozilla::dom::ContentParent>::ReleaseCallee(mozilla::dom::ContentParent*)
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:263
   5  libxul.so + 0xf146d7
        ~RunnableMethod
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/task.h:303
   6  libxul.so + 0x575ef3
        MessageLoop::DeletePendingTasks()
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:422
   7  libxul.so + 0x6f3df3
        mozJSComponentLoader::UnloadModules()
        /home/ting/w/fx/os/emulator/gecko/js/xpconnect/loader/mozJSComponentLoader.cpp:1075
   8  libxul.so + 0x1fd82c3
   9  libxul.so + 0x5776c1
        ~MessageLoop
        /home/ting/w/fx/os/emulator/gecko/ipc/chromium/src/base/message_loop.cc:176
  10  0x7ffe

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 2

•

10 years ago

It looks like that is too late to destroy ContentParent.  The cycle collector is probably shut down already.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 3

•

10 years ago

Yes, you're right, thank you for saving my day. I am seeking other proper places for KillHard(), but seems "xpcom-shutdown" is the earliest event for the ContentParent...

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 4

•

10 years ago

Can you figure out why we are putting IPC events into the event queue when we call KillHard?

Ting-Yu Chou [:ting] (away)

Assignee

Comment 5

•

10 years ago

Which IPC events did you mean?

I put KillHard() after Close() in ContentParent::ShutDownProcess() for comment 1, which is triggered by "xpcom-shutdown".

Ting-Yu Chou [:ting] (away)

Assignee

Comment 6

•

10 years ago

I am not sure I understand correctly, but if you meant MessageLoop::DeletePendingTasks() in the stack, in KillHard() there's a delayed task ContentParent::ShutDownProcess post to MessageLoop::current().

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 7

•

10 years ago

Try removing that delay.  Justin and I never understood why it was there.  See Bug 894191 comment 4, etc.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 8

•

10 years ago

It now shutdowns nicely, I will double check bug 894191.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 9

•

10 years ago

From the code, the delayed ShutDownProcess seems redundant as ActorDestroy always calls it. I have asked Jeff on IRC why it was added, seems it's because ShutDownProcess didn't always get called, but as it was 1 year ago, he couldn't remember it clearly. He mentioned the issue happened while running pressure test, which launch apps repeatedly for 1 to 5 days and trigger a lot of OOM killing.

I checked with with 2.2 on my Falme, when an app is killed by SIGKILL, KillHard does not get called, but ContentParent::OnChannelError, ActorDestroy and ShutDownProcess.

My guess is when KillHard gets called, somehow channel error is not triggered. Do we have such kind of test to make sure removing the delay won't repro bug 894191?

Ting-Yu Chou [:ting] (away)

Assignee

Comment 10

•

10 years ago

Attached patch wip v1 (obsolete) — Details — Splinter Review

KillHard Nuwa process and removed the delayed ShutDownProcess.

https://tbpl.mozilla.org/?tree=Try&rev=687282a7c5bc

Ting-Yu Chou [:ting] (away)

Assignee

Comment 11

•

10 years ago

Another try with the patch of bug 1050122: https://tbpl.mozilla.org/?tree=Try&rev=e312f82202f9, the timeout goes away, and comes new crashes in xpcshell tests.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 12

•

10 years ago

Attached patch patch v1 (obsolete) — Details — Splinter Review

Attachment #8495093 - Attachment is obsolete: true

Attachment #8498769 - Flags: review?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 13

•

10 years ago

I'm surprised that you're getting rid of the call to ShutdownProcess entirely.  Why is that ok?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

10 years ago

Flags: needinfo?(tchou)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 14

•

10 years ago

You're right, I should test before removing it.

The delay ShutdownProcess looks like a workaround, and I think it's a chance to replace it with a proper fix which why OnChannelError is not always happened when child process gets killed.

I am thinking either 1) simply add a condition checking before sending delay ShutdownProcess, or 2) remove the delay ShutdownProcess and use Marionette to test like how LG did test before, see if bug 894191 can be reproduced, if yes, fix it.

How do you think?

Flags: needinfo?(tchou)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 15

•

10 years ago

#2 is clearly the correct path, but I'll accept #1 if you don't have time to do that.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 16

•

10 years ago

Comment on attachment 8498769 [details] [diff] [review]
patch v1

I see. Will do #2 later when I have time.

Attachment #8498769 - Flags: review?(khuey)

Ting-Yu Chou [:ting] (away)

Assignee

Comment 17

•

10 years ago

Attached file test.py (obsolete) — Details

I will test this on my Flame, and see whether the number of ContentParent::OnChannelError() invocations equals to the number of "send sigkill" from dmesg.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 18

•

10 years ago

I left the device (Flame 319MB) running attachment 8502397 [details] during the weekend, the script ran ~20 hours and stopped for some exception from Marionette but the device still works.

There were 2993 ContentParent::OnChannelError invocations, and 2842 "send sigkill to [pid]" from dmesg. I am not sure why it's not matched, but bug 894191 was not reproduced, though ~20 hours is not long enough compare to what LG tested.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 19

•

10 years ago

(In reply to Ting-Yu Chou [:ting] from comment #18)
> There were 2993 ContentParent::OnChannelError invocations, and 2842 "send
> sigkill to [pid]" from dmesg. I am not sure why it's not matched

It's because I missed OOM killer, I'll try more runs.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 20

•

10 years ago

Attached file test.py — Details

Remove the sleep after back to homescreen to launch app more often.

Attachment #8502397 - Attachment is obsolete: true

Ting-Yu Chou [:ting] (away)

Assignee

Comment 21

•

10 years ago

Attached file test_logs.zip — Details

I had 3 test runs this week, and does not repro bug 894191 with the patch (attachment 8498769 [details] [diff] [review]):

1. ~24h10m: 5603 kill, 5601 ContentParent::OnChannelError
            2 kills are not content process (2 * Binder_2)
2. ~23h15m: 4558 kill, 4554 ContentParent::OnChannelError
            4 kills are not content process (2 * Binder_1, 1 * Analysis Helper, 1 * Binder_2)
3. ~15h25m: 3060 kill, 3058 ContentParent::OnChannelError
            2 kills are not content process (2 * Analysis Helper)

Besides, I learnt from Thinker & Shelly that there was a bug 811636 which parent is not notified when child process dies, and the patch was landed on mozilla_b2g18 one week after bug 894191. I doubt bug 894191 is a dupe of bug 811636.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 22

•

10 years ago

Kyle, how do you think for comment 21?

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 23

•

10 years ago

Let's land it!

Flags: needinfo?(khuey)

Ting-Yu Chou [:ting] (away)

Assignee

Updated

•

10 years ago

Attachment #8498769 - Flags: review?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

10 years ago

Attachment #8498769 - Flags: review?(khuey) → review+

Ting-Yu Chou [:ting] (away)

Assignee

Comment 24

•

10 years ago

Try is not green this time: https://tbpl.mozilla.org/?tree=Try&rev=58089e267227, I'll check it.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 25

•

10 years ago

For some reasons, the Close() call in ShutdownProcess(0) now goes to ActorDestroy() synchronously and clear mSubprocess.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 26

•

10 years ago

Attached patch patch v2 — Details — Splinter Review

Added a protector before accessing mSubprocess in KillHard().

Try: https://tbpl.mozilla.org/?tree=Try&rev=174b091b15d9. There're a bunch of ScriptTimeoutException in marionette-webapi tests, I am not quite sure are they caused by the patch, the exception is from marionette.py. Test xpcshell-2 is also failed with timeout, I'll check it.

Attachment #8498769 - Attachment is obsolete: true

Ting-Yu Chou [:ting] (away)

Assignee

Comment 27

•

10 years ago

With the latest code today, I couldn't reprouce the fail (b2g_emulator_vm try debug test xpcshell-2) of comment 26 locally or on Try.

Ting-Yu Chou [:ting] (away)

Assignee

Comment 28

•

10 years ago

Try looks good: https://tbpl.mozilla.org/?tree=Try&rev=82327d4451fc

I didn't ask for another review for patch v2 since I added only a null check and comment.

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 29

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c48992ed7bfc

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 30

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/c48992ed7bfc

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

You need to log in before you can comment on or make changes to this bug.