Closed
Bug 1174243
Opened 9 years ago
Closed 9 years ago
Crash in telephony while running stability tests
Categories
(Firefox OS Graveyard :: Vendcom, defect, P1)
Tracking
(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master fixed)
RESOLVED
FIXED
blocking-b2g | 2.2+ |
People
(Reporter: ggrisco, Assigned: pgravel)
References
Details
(Keywords: crash, Whiteboard: [POVB])
Crash Data
Attachments
(16 files, 1 obsolete file)
140.01 KB,
text/plain
|
Details | |
128.24 KB,
text/plain
|
Details | |
138.32 KB,
text/plain
|
Details | |
160.88 KB,
text/plain
|
Details | |
141.77 KB,
text/plain
|
Details | |
161.31 KB,
text/plain
|
Details | |
141.51 KB,
text/plain
|
Details | |
135.80 KB,
text/plain
|
Details | |
150.92 KB,
text/plain
|
Details | |
127.49 KB,
text/plain
|
Details | |
5.50 MB,
text/plain
|
Details | |
141.79 KB,
text/plain
|
Details | |
135.80 KB,
text/plain
|
Details | |
7.27 KB,
patch
|
Details | Diff | Splinter Review | |
141.34 KB,
text/plain
|
Details | |
152.71 KB,
text/plain
|
Details |
Seeing a frequent crash (SIGSEGV 0x0) with the following crash signature:
[@ mozilla::dom::telephony::TelephonyRequestChild::DoResponse | mozilla::dom::telephony::PTelephonyRequestChild::OnMessageReceived | mozilla::dom::PContentChild::OnMessageReceived | mozilla::ipc::MessageChannel::DispatchAsyncMessage ]
while running stability tests. We have seen this 20 times in only 100 hours of testing so far, so we definitely need a fix for this.
cafbot will upload logs in a bit.
Comment 1•9 years ago
|
||
Comment 2•9 years ago
|
||
Updated•9 years ago
|
Whiteboard: [CR 853223] → [caf priority: p1][CR 853223]
Updated•9 years ago
|
Whiteboard: [caf priority: p1][CR 853223] → [b2g-crash][caf-crash 645][caf priority: p1][CR 853223]
Comment 3•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.180
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.180.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663
Comment 4•9 years ago
|
||
Comment 5•9 years ago
|
||
Comment 6•9 years ago
|
||
Hsinyi, could you find someone on this? thanks.
blocking-b2g: 2.2? → 2.2+
Flags: needinfo?(htsai)
Priority: -- → P1
Comment 7•9 years ago
|
||
(In reply to Bobby Chien [:bchien] from comment #6)
> Hsinyi, could you find someone on this? thanks.
Sure. This is already on our team's radar. Ben will help this.
Flags: needinfo?(htsai)
Comment 8•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.181
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.181.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663
Comment 9•9 years ago
|
||
Comment 10•9 years ago
|
||
Comment 11•9 years ago
|
||
Hi HsinYi,
I change the owner to Ben and please feel free to change the owner if someone else can help. Thanks!
Assignee: nobody → bhsu
Comment 12•9 years ago
|
||
This is causing major instability in our builds. Can you please provide an ETA for the fix?
Comment 13•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.183
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.183.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663
Comment 14•9 years ago
|
||
Comment 15•9 years ago
|
||
Comment 16•9 years ago
|
||
Hi Inder,
I cannot provide an ETA for this issue, since we are still finding its root cause. Now, we are suspecting that the crash is caused by a failure when |do_QueryInterface(...)| [1], and it further induces a segmentation fault in [2].
In fact, there are two types of callbacks (nsITelephonyCallback and nsITelephonyDialCallback), we should be use the two callbacks carefully, especially when involing their methods. After crawling into the code, I didn't find any unintentionally |NotifyDialMMIError| in |TelephonyService,js| so far. Therefore, I suggest you to check there is any in vender's RIL.
On the other hand, the logs and the mini-dumps don't provide enough information to solve this issue, so it would be great if you apply this patch and provid |objdir-gecko/dist/lib/libxul.so| and |libxul.so.sym|. Besides, a longer logcat is also appriciated.
[1] http://lxr.mozilla.org/mozilla-b2g37_v2_2/source/dom/telephony/ipc/TelephonyChild.cpp#265
[2] http://lxr.mozilla.org/mozilla-b2g37_v2_2/source/dom/telephony/ipc/TelephonyChild.cpp#272
Flags: needinfo?(ikumar)
Comment 17•9 years ago
|
||
Nick, can you please apply this patch in our build.
Flags: needinfo?(ikumar) → needinfo?(ntroast)
Comment 18•9 years ago
|
||
Comment 19•9 years ago
|
||
Updated•9 years ago
|
Attachment #8624821 -
Attachment description: EXTRA file attachment - → EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186
Flags: needinfo?(ntroast)
Updated•9 years ago
|
Attachment #8624822 -
Attachment description: decoded minidump - → decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186
Comment 20•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663, bug 1174243
Comment 21•9 years ago
|
||
Comment 22•9 years ago
|
||
Ben - we have the complete logs for the crash with your debug patch. Please look into it and provide a fix.
Thanks,
-Inder
PS: not sure we understand the need for the libxul.so with the debug patch applied. I am assuming you should be able to build it too, right?
Flags: needinfo?(bhsu)
Comment 23•9 years ago
|
||
(In reply to Ben Hsu [:bhsu] from comment #16)
> Created attachment 8623602 [details] [diff] [review]
> Enhance Debug messages
>
> In fact, there are two types of callbacks (nsITelephonyCallback and
> nsITelephonyDialCallback), we should be use the two callbacks carefully,
> especially when involing their methods. After crawling into the code, I
> didn't find any unintentionally |NotifyDialMMIError| in
> |TelephonyService,js| so far. Therefore, I suggest you to check there is any
> in vender's RIL.
Hi Inder,
Have you also checked your RIL per Ben's suggestion?
Thanks!
Flags: needinfo?(ikumar)
Comment 24•9 years ago
|
||
Yes, Anshul and Phil already took a look.
Adding ni on them to provide more details.
Flags: needinfo?(pgravel)
Flags: needinfo?(ikumar)
Flags: needinfo?(anshulj)
Comment 25•9 years ago
|
||
(In reply to Inder from comment #24)
> Yes, Anshul and Phil already took a look.
> Adding ni on them to provide more details.
I read through attachment 8623602 [details] [diff] [review] and saw the lines below, which seemed point that CAF telephony service somehow unintentionally called |NotifyDialMMIError| as Ben reasoned. According to the log, I suspected that the error might happen in the handler of RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE. Any more idea, Anshul?
=== log snippet ===
32243 06-19 18:59:59.604 22963 22963 I Gecko : Bug 1174243: Invoked, SN(1839118962)^M
32244 06-19 18:59:59.604 22963 22963 I Gecko : Bug 1174243: do_QueryInterface failed.
======
Assignee | ||
Comment 26•9 years ago
|
||
In theory the logic on our side should protect us from sending things on the wrong interface. In order to determine the type of callback, we simply do (pseudo-code):
> nsCOMPtr<nsITelephonyDialCallback> dialCb(do_QueryInterface(callback));
> if (dialCb) {
> // do nsITelephonyDialCallback stuff
> } else {
> // do nsITelephonyCallback stuff
> }
All our callbacks go through this, so this crash *should* never occur. Obviously it is occurring though, which is somewhat perplexing.
This crash occurs in a test is basically dial->hangup->kill(dialer)->dial->hangup->kill(dialer)->etc..., and only seems to crash after several iterations of that test. Is there any chance callback object might get freed or invalid as a result of killing the dialer app?
In all the logs I see the following happen when it crashes
> ->Dial(callA)
> ->Dial(callB)
> ->Hangup(callB) -> callback SN(120564633)
> ->Resume(callA) -> callback SN(1782489557) -> RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE
> --GaiaApps.kill(dialer)
> ->Resume(callA) -> callback SN(1839118962) -> 2nd resume request while there is a pending resume will cause telephony to call callback.NotifyError()
> <-nsITelephonyCallback.NotifyError(generic failure) on 1839118962
> <-RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE {success}
> -- Bug 1174243: Invoked, SN(1839118962) - do_QueryInterface failed
> *crash*
Thanks to the SN# we can tell from the logs the crash doesn't occur due to the first Resume() request, but rather due to the instantly failed 2nd Resume() request that comes right after GaiaApps.kill() is invoked.
I verified that the error notification due to the 2nd Resume() goes through the same interface check I mentioned above, so it should be routed to the correct interface... unless something wonky is happening with the given callback pointer.
I can say that in all my tests I can see the NotifyError() being called on nsITelephonyCallback, but with the caveat I also have yet to reproduce the crash.
I will keep on trying to find a way to locally reproduce this crash and try to catch it in gdb, but in the meantime I'd welcome any insight on the matter.
Flags: needinfo?(pgravel) → needinfo?(htsai)
Comment 27•9 years ago
|
||
Thanks for all the details, Phil. Good to learn the basic test flow. I will review the code again and see what I could find. Let's keep posted on bugzilla, thank you :)
Comment 28•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.188
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.188.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663, bug 1174243
Comment 29•9 years ago
|
||
Comment 30•9 years ago
|
||
Comment 31•9 years ago
|
||
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #27)
> Thanks for all the details, Phil. Good to learn the basic test flow. I will
> review the code again and see what I could find. Let's keep posted on
> bugzilla, thank you :)
sorry that I haven't got any ideas yet, but still want to share my current thoughts and questions so far... any input is welcome!
1) If the crash is caused by "killing the app" then according to the current protection in ActorDestroy, TelephonyRequestParent avoids to send a message and TelephonRequestChild will set mCallback to nullptr. However, per the log CAF provided, |static_cast<TelephonyCallback*>(mCallback.get())->CALLBACK_ID| was still valid that did't look like ActorDestroy had been called. Any possibility that leads to a non-null mCallback even an app is killed?
2) How could that happen that TelephonyParent sends "ErrorResponse" in [1] but TelephonyChild receives "DialResponseMMIError" in [2], even in the app being killed scenario?
[1] https://dxr.mozilla.org/mozilla-central/source/dom/telephony/ipc/TelephonyParent.cpp#451
[2] https://dxr.mozilla.org/mozilla-central/source/dom/telephony/ipc/TelephonyChild.cpp?from=TelephonyChild.cpp#132
Comment 32•9 years ago
|
||
Sorry for such a late response. First, I have to make one thing clear since it might been misunderstood. The Serial Number(SN) does not grow increamentally, so it *doesn't* show nor gaurantee the sequence. If you look into the patch, then you'll soon find out it's just a random number.
However, SN is still pretty usesful. Line 848 of attachment 8624821 [details] shows the SN is |1839118962|, and from the SN we can findout the callback which failed in queryInterface is generated in line 709. Here, we can see the backtrace below Line 709. Since the backtrace is just PCes in the log file, so we need an object files with Symbols to find out which WebAPI induces the generation of the callback, and further identify the root cause of this issue.
(In reply to Inder from comment #22)
> PS: not sure we understand the need for the libxul.so with the debug patch
> applied. I am assuming you should be able to build it too, right?
I don't think I can build a binary identical libxul.so as the one you used in your test platform. As mentioned above, I need libxul.so to find out who generates the callback which later fails to |queryInterface(...)| into |nsITelephonyDialCallback|. However, If you cannot provide the libxul.so you used, then can you please retrive the function names from the PCes in the log by |addr2line| and |c++filt| which you can find in {B2G}/prebuilts/{your platform}/{ISA}/{architecture}/bin? Then, I believe we can have a progress in this issue. Thanks.
Flags: needinfo?(bhsu)
Comment 33•9 years ago
|
||
Sorry for my misunderstanding Phil's comment. It seems that Phil already provides where the callback is generated.
Comment 34•9 years ago
|
||
Attachment #8623602 -
Attachment is obsolete: true
Updated•9 years ago
|
Attachment #8625719 -
Attachment description: Enhance Debug messages → Enhance Debug messages (V2)
Comment 35•9 years ago
|
||
Hi Phil,
After discussing with my colleagues, we think there might be two possibilities causing this issue, where the first one is an IPC error as Hsinyi mentioned in comment 31, and the other is still an unintentional invoking |NotifyDialMMIError|. Sadly, each of them seems hard to happen. To identify whether the first possibility is the root cause, I slightly modified the debug message enhencement patch (attachment 8625719 [details] [diff] [review]) which further logs which method is invoked in the IPC parent side which includes |TelephonyService.js|. Do you mind trying this patch?
On the other hand, after tracking more related code, I found out that at the IPC parent side, there is only one type of actor which implements |nsITelephonyDialCallback|. The reason there is only one actor here is |nsITelephonyDialCallback| inherits |nsITelephonyCallback|, and thus the actor can serve both |nsITelephonyDialCallback| and |nsITelephonyCallback|.
(In reply to pgravel from comment #26)
> In theory the logic on our side should protect us from sending things on the
> wrong interface. In order to determine the type of callback, we simply do
> (pseudo-code):
>
> > nsCOMPtr<nsITelephonyDialCallback> dialCb(do_QueryInterface(callback));
> > if (dialCb) {
> > // do nsITelephonyDialCallback stuff
> > } else {
> > // do nsITelephonyCallback stuff
> > }
>
As a result, the check you implement here might be useless in the IPC parent side, since the callback here can be always queried to |nsITelephonyDialCallback|, and the first branch is always entered. In addition, the falsy |queryInterface(...)| is induce by a |resume(...)|, which uses |nsITelephonyCallback| as its parameter, so I suggest you check the vender RIL again with this two clues.
Comment 36•9 years ago
|
||
Hi Phil,
The actor of IPC parent side is defined here:
http://lxr.mozilla.org/mozilla-b2g37_v2_2/source/dom/telephony/ipc/TelephonyParent.h#86
Assignee | ||
Comment 37•9 years ago
|
||
Thanks Ben,
After our discussion and with the help of your analysis, I can now reproduce the crash consistently.
You are correct that the cause of the crash is that the Resume() call comes through TelephonyIPCService, which causes the ril to get a TelephonyRequestParent that supports both callback types, even though the caller only supports nsITelephonyCallback.
As discussed on the phone, ideally we would want to fix behavior such that we can can rely on QueryInterface to tell us which type of callback we have in both IPC and non-IPC cases, but for the sake of closing this bug we will be implementing a change in our ril to not depend on QI in this manner.
There is still an open question as to why and how this test results in a Resume() coming through TelephonyIPCService instead of the normal non-ipc direct nsITelephonyService::Resume() call (via TelephonyCall::Resume).
Comment 38•9 years ago
|
||
Clearing the NI flag on me as folks seemed got great progress in Whistler, thank you all!
Flags: needinfo?(htsai)
Comment 39•9 years ago
|
||
(In reply to pgravel from comment #37)
> Thanks Ben,
>
> After our discussion and with the help of your analysis, I can now reproduce
> the crash consistently.
>
> You are correct that the cause of the crash is that the Resume() call comes
> through TelephonyIPCService, which causes the ril to get a
> TelephonyRequestParent that supports both callback types, even though the
> caller only supports nsITelephonyCallback.
>
> As discussed on the phone, ideally we would want to fix behavior such that
> we can can rely on QueryInterface to tell us which type of callback we have
> in both IPC and non-IPC cases, but for the sake of closing this bug we will
> be implementing a change in our ril to not depend on QI in this manner.
>
>
> There is still an open question as to why and how this test results in a
> Resume() coming through TelephonyIPCService instead of the normal non-ipc
> direct nsITelephonyService::Resume() call (via TelephonyCall::Resume).
From the log I saw the 2nd failed TelephonyCall::Resume request was called in Content, while the 1st was in Chrome, told by log PID. Probably should starting from checking your test script to grab more clues first.
Comment 40•9 years ago
|
||
Observed on:
Device: msm8909
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.190
Moz BuildID: 20150606002503
Manifest: https://www.codeaurora.org/cgit/quic/lf/b2g/manifest/tree/caf_AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.190.xml?h=release
B2G Version: v2.2
Gecko Version: 37.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8fc797527a3eca7665bc1d1828848f2fb77ca99f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e0045f9c8b7e84fc52ba628141688c8ecb4b7a52
Patches: bug 1133147, bug 1167799, bug 1162663, bug 1174243
Comment 41•9 years ago
|
||
Comment 42•9 years ago
|
||
Whiteboard: [b2g-crash][caf-crash 645][caf priority: p1][CR 853223]
Assignee | ||
Comment 43•9 years ago
|
||
Hi Ben,
Before we forget, could you please file a bugzilla for modifying the IPC layer to not always report both interfaces as supported, as discussed in our call? Even if it's not for v2.2, it'll be nice to tracking purposes.
Thanks!
Flags: needinfo?(bhsu)
Comment 44•9 years ago
|
||
(In reply to pgravel from comment #43)
> Hi Ben,
>
> Before we forget, could you please file a bugzilla for modifying the IPC
> layer to not always report both interfaces as supported, as discussed in our
> call? Even if it's not for v2.2, it'll be nice to tracking purposes.
>
> Thanks!
No problem, in fact, I'm thinking this question these days, and almost come out with a plan. Thanks for your kind remind.
Flags: needinfo?(bhsu)
Comment 45•9 years ago
|
||
Per comment 37, changing the bug component to Vendcom and assign to Phil for the comRIL.
Assignee: bhsu → pgravel
Component: RIL → Vendcom
Comment 46•9 years ago
|
||
Fixed in QC builds.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•