Closed Bug 945992 Opened 7 years ago Closed 6 years ago

Crash in CompositorChild::ActorDestroy while receiving calls and SMS continuously

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED INVALID
blocking-b2g 1.3+

People

(Reporter: ggrisco, Assigned: jerry)

References

Details

(Keywords: crash, Whiteboard: [caf-crash 48][caf priority: p2][CR 584125][b2g-crash])

Attachments

(5 files)

1. Enable auto answer.
2. Receive MT call and MT SMS continuously.
3. After few hours of run, mini dumps are generated in the phone.

[@ mozalloc_abort(char const*) | NS_DebugBreak | mozilla::layers::CompositorChild::ActorDestroy(mozilla::ipc::IProtocolManager::ActorDestroyReason) | mozilla::layers::PCompositorChild::DestroySubtree(mozilla::ipc::IProtocolManager::ActorDestroyReason) ]
from the .extra file:

Notes=xpcom_runtime_abort([Child 3865] ###!!! ABORT: ActorDestroy by IPC channel failure at CompositorChild: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2044278/checkout/gecko/gfx/layers/ipc/CompositorChild.cpp, line 120)
blocking-b2g: --- → 1.3?
Component: General → Graphics: Layers
Keywords: crash
Product: Firefox OS → Core
Whiteboard: [CR 584125] → [CR 584125][b2g-crash]
blocking-b2g: 1.3? → 1.3+
Ken, do you know who could help this crash issue?
Flags: needinfo?(kchang)
CJ, although this bug is happened in graphics module, it seems not a graphic issue. But is it possible to have someone from your team to analyse this bug first?
Flags: needinfo?(kchang) → needinfo?(cku)
Jerry, please do a favor on this crash
Flags: needinfo?(cku) → needinfo?(hshih)
Hi Greg,

According to bug 925324 and bug 860892, we get crash when ipc broken.

Can you get the backtrace when we meet PostErrorNotifyTask()?
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1466
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1364

We need to know why the ipc broken.
Flags: needinfo?(hshih) → needinfo?(ggrisco)
There are some use cases ipc causes error. Bug 866483 is one example.
Hi Greg,

Can you also provide the logcat/dmesg log?
We can check if it is the same problem as bug 866483.
Hi Jerry,

we're waiting to see if this crash reproduces again; we haven't seen it in last couple test runs.  I'll keep the ni? and check back in if we start seeing it again.
We also want to know which process abort.
Is it sms app or other app(ex. homescreen)?
Each process has their CompositorChild.
Thanks!
In a submitted crash report, or in the .extra file mentioned in comment #2, there should be a URL which points to the app manifest, and that should be able to tell us which app process crashed.
Notes=xpcom_runtime_abort([Child 3865] ###!!! ABORT: ActorDestroy by IPC channel failure at CompositorChild: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2044278/checkout/gecko/gfx/layers/ipc/CompositorChild.cpp, line 120)
URL={{template}}
Jerry,

Any updates on next steps?
Flags: needinfo?(hshih)
According to the mini dump, comment 11 and comment 12, I still can't know why the ipc brocken and what process abort.

In comment 12, we see the URL is "template."
Does this "template" app relate to SMS?

In mini dump, we just see that b2g might post the ipc error handling task at:
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1466
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1364

Greg, what's your device? Can we have more detail step to reproduce this issue?
Flags: needinfo?(hshih)
Still only seen once on QRD.  The build was:

gaia:  f23aebebcd28c4d19347def77c4836c8baebc2ce
gecko: 0e3362fb5625eb6d98c7617b1b3019a2cc553d47

I don't have any other STR details beyond what's in the Description.  These are stability scripts that are run over the weekend, doing multiple things in combination.
Can we have a logcat/dmesg log? attachment 8342015 [details] just says IPC error happened by some reason. It have no information about the cause of the IPC error. We can not reproduce the creash. So, without the logcat/dmesg, we can not analyze the bug more.
Flags: needinfo?
Flags: needinfo?
Greg, can you provide logcat/dmesg log as in Comment 16?
Flags: needinfo?(ggrisco)
Hi Greg,

Can you provide the logcat and dmesg log when this issue happened?
If you can use gdb, please also add the break point at:
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1466
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1364
then dump the call stack when we stop at these break points. 

The ipc system also provides debug message.
If we build a debug version b2g, we can export MOZ_IPC_MESSAGE_LOG env variable to enable the debug function.

adb shell
root@android:/ # stop b2g                         //stop the b2g first
root@android:/ # export MOZ_IPC_MESSAGE_LOG=1
root@android:/ # b2g.sh

You will see some message like
[time:1365410665694116][916][PContentChild] Sending Msg_AsyncMessage([TODO])
[time:1365410665701592][916][PContentChild] Sending Msg_SyncMessage([TODO])
[time:1365410665702599][780][PContentParent] Received Msg_AsyncMessage([TODO])
[time:1365410665710229][780][PContentParent] Received Msg_SyncMessage([TODO])
[time:1365410665716943][780][PContentParent] Sending reply Reply_SyncMessage([TODO])
[time:1365410665718224][916][PContentChild] Received reply Reply_SyncMessage([TODO])
[time:1365410665739037][916][PIndexedDBDatabaseChild] Sending Msg_PIndexedDBTransactionConstructor([TODO])
ABORT is intentionally called at CompositorChild::ActorDestroy(ActorDestroyReason aWhy)
- http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorChild.cpp#119

When CompositorChild is shutdown by IPC error, this abort is always called. It is added by Bug 860892. It is added in a half year ago. At the time, b2g graphic layers are very unstable . IPC errors happned by some reasons. After IPC error happens, all IPC layer tree is deleted by DestroySubtree(), but layers code continue to use gfx layers IPC code, and then causes the crash in almost random point in gfx codec. Since last spring, gfx layer code was refactored, IPC related code is protected more and finally, Bug 944703 fixed the remaining part. It might be better to remove the abort in the above.

NB. But the removing the abort does not fix the cause of the IPC error.
Unfortunately, I can't provide what you're asking for if we're not able to reproduce the problem.  I'm waiting to see if this happens more regularly then we can add the logging that you're looking for.  If we don't see this again within the next week then we can probably close this bug.
Flags: needinfo?(ggrisco)
Closing for now until since it is not reproducing
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(ggrisco)
Resolution: --- → WORKSFORME
(In reply to Greg Grisco from comment #21)
> Closing for now until since it is not reproducing

Please provide the log if it occurs again.
Issue reproduced in stability testing.
Greg, can you please add the logs from our internal issue report.
Status: RESOLVED → REOPENED
Flags: needinfo?(ggrisco)
Resolution: WORKSFORME → ---
Attached file adb logfile (zipped)
Flags: needinfo?(ggrisco)
Attached file dmesg (zipped)
Jerry, any update here? Did you get a chance to look into the log messages?
Flags: needinfo?(hshih)
Assigning to Jerry as he was looking at it anyway.
Assignee: nobody → hshih
Jerry,

Any update here??
I am checking the log now.

Greg, does the test device base on android jelly beans or upon?
Bug 940851 has fixed one ipc error problem on JB platform.
The fix is committed after you test environment.
(according the comment 15, gecko: 0e3362fb5625eb6d98c7617b1b3019a2cc553d47)
It perhaps relates to this problem.
Flags: needinfo?(hshih) → needinfo?(ggrisco)
Hi Jerry,

yes, 1.3 is based on JB.

The history of this crash is that we've seen it once on a build built on 12/02 and then once again on a build made on 12/25.  The fix for Bug 940851 came on 12/03 it looks like, so this crash was still seen after that fix.
Flags: needinfo?(ggrisco)
Greg, what version of gaia and gecko does the test use now?
I need to sync my test env as yours.
Flags: needinfo?(ggrisco)
In logcat log(attachment 8356354 [details]), we can see there is a "pipe error".

12-30 16:19:07.117  5593  5603 I Gecko   : [Child 5593] WARNING: pipe error (3): Connection reset by peer: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 446
.......
12-30 16:19:07.117  5593  5593 I Gecko   : ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost
12-30 16:19:07.147   300   300 E GeckoConsole: [JavaScript Error: "AbortError"]


I have tried to break the parent side ipc channel, and I can see the child side abort() if I try to send message via ipc.

I/Gecko   ( 2351): ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
I/Gecko   ( 2351): ###!!! [Child][MessageChannel] Error: Channel error: cannot send/recv
I/Gecko   ( 2351): [Child 2351] ###!!! ABORT: ActorDestroy by IPC channel failure at LayerTransactionChild: file ../../../gfx/layers/ipc/LayerTransactionChild.cpp, line 94


So I think that the crash flow is:
1. pipe error occur.
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#446
2. call OnChannelError() in OnFileCanReadWithoutBlocking().
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#829
3. process OnChannelError(). Finally call CompositorChild::ActorDestroy() with "AbnormalShutdown" status and abort.
http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorChild.cpp#132


I can't get more info from logcat and dmesg log. I just know that the ipc error triggers this abort.
Maybe parent side process dies or closes the ipc channel or other reason.
I need more time to think how to narrow down this problem.

--
Change component category to ipc. I think this is not a graphic layer problem.
Component: Graphics: Layers → IPC
Hi Greg, I can reproduce the same error message "pipe error (3): Connection reset by peer" by closing the parent side socket fd.

I/Gecko   ( 3444): [Parent 3444] WARNING: pipe error (95): Bad file number: file /tmp_data/mozilla/gecko/git/mozilla_central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 489    <=close parent side socket
I/Gecko   ( 3663): [Child 3663] WARNING: pipe error (3): Connection reset by peer: file /tmp_data/mozilla/gecko/git/mozilla_central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 489    <=get "pipe error (3): Connection reset by peer" message


Maybe this crash was made by someone close the socket fd at parent side.
What's your gecko and gaia git commit id(use "git log") for your test environment?
Do you use hardware composer?
Is bug 940851 applied?
jerry, I think there's some mixing of cause and effect here. A pipe error typically means that the other process crashed. So if a child process (presumably a content process) is seeing a pipe error, that would normally mean that the main b2g process had crashed.

Is that what's actually happening in this log? Or is the main process still alive?
Flags: needinfo?(hshih)
Hi Benjamin,

Here are the logcat log(attachment 8356354 [details]).

                    pid   tid
12-30 16:19:06.907   300   300 I Gonk    : Changed nice for pid 5593 from 18 to 18.
12-30 16:19:06.987  5593  5593 I Gecko   : [SharedSurface_Gralloc 0xb3a8cd40] destroyed
12-30 16:19:07.117  5593  5603 I Gecko   : [Child 5593] WARNING: pipe error (3): Connection reset by peer: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 446
12-30 16:19:07.117   300   300 I Gecko   : -*- QCMessageManager_QC_B2G: receiveMessage: child-process-shutdown arrived from content process
12-30 16:19:07.117   300   300 I Gecko   : -*- QCMessageManager_QC_B2G: Unregistered mobileconnection target: [object ChromeMessageSender]
12-30 16:19:07.117   300   300 I Gecko   : -*- QCMessageManager_QC_B2G: Unregistered icc target: [object ChromeMessageSender]
12-30 16:19:07.117  5593  5593 I Gecko   : 
12-30 16:19:07.117  5593  5593 I Gecko   : ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost
12-30 16:19:07.117  5593  5593 I Gecko   : 
12-30 16:19:07.147   300   300 E GeckoConsole: [JavaScript Error: "AbortError"]
12-30 16:19:07.487   300   300 I Gonk    : Setting nice for pid 1220 to 1
12-30 16:19:07.487   300   300 I Gonk    : Changed nice for pid 1220 from 0 to 1.
12-30 16:19:07.487   300   300 I Gonk    : Setting nice for pid 979 to 18

I guess the b2g process pid is 300 because it initials some system-like function.
We can see that the pid 300 is still alive.

In comment 33, I have tried to close a b2g parent side ipc socket which connects to a content process.
If I close the b2g parent side socket before reading data, the child side will get a pipe error message and abort at some checking condition. In my test, b2g process is also alive.
Flags: needinfo?(hshih)
When you say "I have tried to close a b2g parent side ipc socket" what does that mean? You manually called close() on the pipe fd?

Most content process messages are managed through the PContent IPDL hierarchy. If the parent process dies or the socket is closed for some other reason, the process should crash (we do it intentionally in ContentChild::ActorDestroy).

PCompositor is another IPDL hierarchy, so the question is really why the channel is closing unexpectedly, and whether this is ever an expected condition.

Normally the toplevel actor (PCompositorParent) is closed by calling its Close() method. Perhaps somebody is destroying a CompositorParent without closing it first? That would trigger this sequence of events.

If this is ever expected, you should remove the abort and make sure that we clean up all the graphics resources properly: this means that basically all of the PCompositorChild subtree implementations would need an ActorDestroy method to perform cleanup.

If closing is never expected, then aborting is correct, and you really need to figure out why the parent process is closing the pipe without first sending the correct closing message. Perhaps adding an assert to CompositorParent that it has already been closed would be a place to start debugging. It's a little hard because MessageChannel doesn't appear to expose its open/close state (stored on mChannelState).
Benjamin

How do we proceed here? What next steps are needed to move on?
Flags: needinfo?(benjamin)
Well, you need somebody from graphics to tell us whether deleting a CompositorParent without closing it is normal. Then you can either add asserts to CompositorParent and debug this to figure out why we're deleting it without closing it, *or* fix the abort in CompositorChild.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #36)
> When you say "I have tried to close a b2g parent side ipc socket" what does
> that mean? You manually called close() on the pipe fd?
> 

We have a bug 940851 which closes an unexpected fd in HAL.
I guess that we might have the same problem here, so I try to close the pipe manually here.
Hi Greg and Inder,

Can you check the comment 33?
What frequency does this crash occur?
Can we reproduce this problem at mozilla's develop phone(ex. unagi, nexus4)?
Flags: needinfo?(ikumar)
(In reply to Jerry Shih[:jerry] from comment #40)
> Hi Greg and Inder,
> 
> Can you check the comment 33?
> What frequency does this crash occur?
comment 30, twice.
> Can we reproduce this problem at mozilla's develop phone(ex. unagi, nexus4)?
> Can we reproduce this problem at mozilla's develop phone(ex. unagi, nexus4)?
I guess you should be able to but may require multiple attempts considering its happening so rarely. The test steps are in the bug Description. Not sure if it matters but the issue was first reported on JB based reference device.
Flags: needinfo?(ikumar)
Flags: needinfo?(nical.bugzilla)
The interesting buts of PCompositor's shutdown that I know of are mostly here:

http://dxr.mozilla.org/mozilla-central/source/widget/xpwidgets/nsBaseWidget.cpp?from=nsBaseWidget.cpp#162
I am not sure whether B2G uses that or if it does it's own thing, I need to dig a bit further, but so far, grepping around, I haven't found other places where we destroy CompositorParent.
Interestingly the Close() call here is commented out with a reference to a fixed bug so maybe it should be uncommented back. So it looks like we don't Close the channel (on the non-B2G case at least).

> this means that basically all of the PCompositorChild subtree implementations would need an ActorDestroy method 
> to perform cleanup.

All or most actors need to implement ActorDestroy anyway (at least if is is managing shared resources) because an app can be killed at any moment so we need to be sure that the compositor process survives any brutal IPC shutdown.
Flags: needinfo?(nical.bugzilla)
I may be confused here, but the "normal" rules for IPDL actor shutdown is that *Parent actors should always have an ActorDestroy method and perform cleanup. But normally *Child actors don't, because if a method fails the process aborts anyway.

Compositor might be special, since I wasn't aware that there was a special compositor *process* (I thought it was part of the main/b2g process).
Removing my ni (comment 31) since this was already repro'd.
Flags: needinfo?(ggrisco)
Jerry, if I understand correctly, we can't reproduce this on one of the devices we have access to? Is there enough information in this bug to keep going, or are we stuck?
Flags: needinfo?(hshih)
Flags: needinfo?(cku)
(In reply to Jerry Shih[:jerry] from comment #40)
> Hi Greg and Inder,
> 
> Can you check the comment 33?
> What frequency does this crash occur?
> Can we reproduce this problem at mozilla's develop phone(ex. unagi, nexus4)?

From Jerry's investigation, we are suspicious this issue was related to bug 940851 that the HWComposer tried to close the fd which may be used for IPC.

Greg, could you help to update the gecko commit you used or check below patch is included in your testing image?

https://bug940851.bugzilla.mozilla.org/attachment.cgi?id=8341461


(In reply to Milan Sreckovic [:milan] from comment #46)
> Jerry, if I understand correctly, we can't reproduce this on one of the
> devices we have access to? Is there enough information in this bug to keep
> going, or are we stuck?

From comment 0, it was reproduced by QCom's test script. I think it is hard to reproduce in Mozilla side if we didn't have the same testing environment.
Flags: needinfo?(ggrisco)
The fix for bug 940851 is in our tree
Flags: needinfo?(ggrisco)
(In reply to Milan Sreckovic [:milan] from comment #46)
> Jerry, if I understand correctly, we can't reproduce this on one of the
> devices we have access to? Is there enough information in this bug to keep
> going, or are we stuck?

I can't reproduce this on my devices and I don't have the test script.
My next step is to search whether we close the socket fd or delete the ipc actor for other reason in b2g code.
Maybe we can hook the close() function.

Actually, I have no idea about this problem.


It happens so rarely(comment 42). Should we block 1.3+?
Flags: needinfo?(milan)
Flags: needinfo?(hshih)
Flags: needinfo?(cku)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #48)
> The fix for bug 940851 is in our tree

Hi Michael,

Could you provide the gaia and gecko part commit id?
I need to sync the test environment.
Flags: needinfo?(mvines)
We have still seen this crash only twice, once on 12/02 build and another time on 12/25 build.
(In reply to Jerry Shih[:jerry] from comment #50)
> Could you provide the gaia and gecko part commit id?
> I need to sync the test environment.

https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/commit/?h=release&id=b0853f9cc1ea018c23371204f3fa557dd210314e
Flags: needinfo?(mvines)
QC has this blocking their CS milestone, so, yes, we have to stay blocking for 1.3 until and unless they remove that dependency.
Flags: needinfo?(milan)
Let's just close this one for now, will re-open if we see it more consistently.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WORKSFORME
FYI, attachment 8362156 [details] [diff] [review] in Bug 946720 can be used to add ipc log to logcat.

And from the following log in attachment 8356354 [details] and Bug 946720 Comment 37. There is a one possibility that b2g's compositor thread is crashed or not responding. But it is just a one possibility.

[Child 5593] WARNING: pipe error (3): Connection reset by peer: file /local/mnt/workspace/lnxbuild/project/release_dev_msm8610_2199240/checkout/gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 446
We also meet this crash when video playback.
Status: RESOLVED → REOPENED
Flags: needinfo?(ttsai)
Resolution: WORKSFORME → ---
(In reply to James Zhang from comment #56)
> We also meet this crash when video playback.

James, can you open a new bug for it? CompositorChild::ActorDestroy() just says that IPC was shutdown because of IPC failure. The crash log say nothing about the actual cause of failure. To debug the problem, need to get logcat with IPC log enabled.
Flags: needinfo?(james.zhang)
This is a already closed bug. Close a again. New problem should be handled by a new bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(ttsai)
(In reply to Sotaro Ikeda [:sotaro] from comment #58)
> This is a already closed bug. Close a again. New problem should be handled
> by a new bug.

Ying, please file a new bug to track, thanks.
Flags: needinfo?(james.zhang) → needinfo?(ying.xu)
Flags: needinfo?(ying.xu)
Whiteboard: [CR 584125][b2g-crash] → [caf priority: p2][CR 584125][b2g-crash]
We are seeing this crash again (3 times on AU 60).  Comment 60 and comment 61 contain the latest minidump and extra file (with logs)
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Greg - This is an old bug from 1.3 - Can you open a new bug for this issue?
Flags: needinfo?(ggrisco)
Whiteboard: [caf priority: p2][CR 584125][b2g-crash] → [caf-crash 48][caf priority: p2][CR 584125][b2g-crash]
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_JB_3.2.01.03.00.112.255
Moz BuildID: 20140226004002
B2G Version: 1.3
Gecko Version: 28.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8039a5cb7519adfa81677df577f494c6a4de6140
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=7599b5ccc906f556076798fc062a9e51e2c0eece
(In reply to Jason Smith [:jsmith] from comment #63)
> Greg - This is an old bug from 1.3 - Can you open a new bug for this issue?

Ok, closing this one first.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Flags: needinfo?(ggrisco)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.