Closed Bug 1174243 Opened 9 years ago Closed 9 years ago

Crash in telephony while running stability tests

Categories

(Firefox OS Graveyard :: Vendcom, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
blocking-b2g 2.2+
Tracking Status
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: ggrisco, Assigned: pgravel)

References

Details

(Keywords: crash, Whiteboard: [POVB])

Crash Data

Attachments

(16 files, 1 obsolete file)

140.01 KB, text/plain
Details
128.24 KB, text/plain
Details
138.32 KB, text/plain
Details
160.88 KB, text/plain
Details
141.77 KB, text/plain
Details
161.31 KB, text/plain
Details
141.51 KB, text/plain
Details
135.80 KB, text/plain
Details
150.92 KB, text/plain
Details
127.49 KB, text/plain
Details
5.50 MB, text/plain
Details
141.79 KB, text/plain
Details
135.80 KB, text/plain
Details
7.27 KB, patch
Details | Diff | Splinter Review
141.34 KB, text/plain
Details
152.71 KB, text/plain
Details
Seeing a frequent crash (SIGSEGV 0x0) with the following crash signature:

[@ mozilla::dom::telephony::TelephonyRequestChild::DoResponse | mozilla::dom::telephony::PTelephonyRequestChild::OnMessageReceived | mozilla::dom::PContentChild::OnMessageReceived | mozilla::ipc::MessageChannel::DispatchAsyncMessage ]

while running stability tests.  We have seen this 20 times in only 100 hours of testing so far, so we definitely need a fix for this.

cafbot will upload logs in a bit.
Whiteboard: [CR 853223] → [caf priority: p1][CR 853223]
Whiteboard: [caf priority: p1][CR 853223] → [b2g-crash][caf-crash 645][caf priority: p1][CR 853223]
Keywords: crash
Hsinyi, could you find someone on this? thanks.
blocking-b2g: 2.2? → 2.2+
Flags: needinfo?(htsai)
Priority: -- → P1
(In reply to Bobby Chien [:bchien] from comment #6)
> Hsinyi, could you find someone on this? thanks.

Sure. This is already on our team's radar. Ben will help this.
Flags: needinfo?(htsai)
Hi HsinYi,
I change the owner to Ben and please feel free to change the owner if someone else can help. Thanks!
Assignee: nobody → bhsu
This is causing major instability in our builds. Can you please provide an ETA for the fix?
Attached patch Enhance Debug messages (obsolete) — Splinter Review
Hi Inder,

I cannot provide an ETA for this issue, since we are still finding its root cause. Now, we are suspecting that the crash is caused by a failure when |do_QueryInterface(...)| [1], and it further induces a segmentation fault in [2].

In fact, there are two types of callbacks (nsITelephonyCallback and nsITelephonyDialCallback), we should be use the two callbacks carefully, especially when involing their methods. After crawling into the code, I didn't find any unintentionally |NotifyDialMMIError| in |TelephonyService,js| so far. Therefore, I suggest you to check there is any in vender's RIL.

On the other hand, the logs and the mini-dumps don't provide enough information to solve this issue, so it would be great if you apply this patch and provid |objdir-gecko/dist/lib/libxul.so| and |libxul.so.sym|. Besides, a longer logcat is also appriciated.

[1] http://lxr.mozilla.org/mozilla-b2g37_v2_2/source/dom/telephony/ipc/TelephonyChild.cpp#265
[2] http://lxr.mozilla.org/mozilla-b2g37_v2_2/source/dom/telephony/ipc/TelephonyChild.cpp#272
Flags: needinfo?(ikumar)
Nick, can you please apply this patch in our build.
Flags: needinfo?(ikumar) → needinfo?(ntroast)
Attachment #8624821 - Attachment description: EXTRA file attachment - → EXTRA file attachment - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186
Flags: needinfo?(ntroast)
Attachment #8624822 - Attachment description: decoded minidump - → decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.186
Ben - we have the complete logs for the crash with your debug patch. Please look into it and provide a fix.

Thanks,
-Inder
PS: not sure we understand the need for the libxul.so with the debug patch applied. I am assuming you should be able to build it too, right?
Flags: needinfo?(bhsu)
(In reply to Ben Hsu [:bhsu] from comment #16)
> Created attachment 8623602 [details] [diff] [review]
> Enhance Debug messages
> 

> In fact, there are two types of callbacks (nsITelephonyCallback and
> nsITelephonyDialCallback), we should be use the two callbacks carefully,
> especially when involing their methods. After crawling into the code, I
> didn't find any unintentionally |NotifyDialMMIError| in
> |TelephonyService,js| so far. Therefore, I suggest you to check there is any
> in vender's RIL.

Hi Inder,
Have you also checked your RIL per Ben's suggestion? 

Thanks!
Flags: needinfo?(ikumar)
Yes, Anshul and Phil already took a look.
Adding ni on them to provide more details.
Flags: needinfo?(pgravel)
Flags: needinfo?(ikumar)
Flags: needinfo?(anshulj)
(In reply to Inder from comment #24)
> Yes, Anshul and Phil already took a look.
> Adding ni on them to provide more details.

I read through attachment 8623602 [details] [diff] [review] and saw the lines below, which seemed point that CAF telephony service somehow unintentionally called |NotifyDialMMIError| as Ben reasoned. According to the log, I suspected that the error might happen in the handler of RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE. Any more idea, Anshul?

=== log snippet ===
32243 06-19 18:59:59.604 22963 22963 I Gecko   : Bug 1174243: Invoked, SN(1839118962)^M
32244 06-19 18:59:59.604 22963 22963 I Gecko   : Bug 1174243: do_QueryInterface failed.
======
In theory the logic on our side should protect us from sending things on the wrong interface. In order to determine the type of callback, we simply do (pseudo-code):

> nsCOMPtr<nsITelephonyDialCallback> dialCb(do_QueryInterface(callback));
> if (dialCb) {
>   // do nsITelephonyDialCallback stuff
> } else {
>   // do nsITelephonyCallback stuff
> }

All our callbacks go through this, so this crash *should* never occur. Obviously it is occurring though, which is somewhat perplexing.


This crash occurs in a test is basically dial->hangup->kill(dialer)->dial->hangup->kill(dialer)->etc..., and only seems to crash after several iterations of that test. Is there any chance callback object might get freed or invalid as a result of killing the dialer app?

In all the logs I see the following happen when it crashes

> ->Dial(callA)
> ->Dial(callB)
> ->Hangup(callB) -> callback SN(120564633)
> ->Resume(callA) -> callback SN(1782489557) -> RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE
> --GaiaApps.kill(dialer)
> ->Resume(callA) -> callback SN(1839118962) -> 2nd resume request while there is a pending resume will cause telephony to call callback.NotifyError()
> <-nsITelephonyCallback.NotifyError(generic failure) on 1839118962
> <-RIL_REQUEST_SWITCH_WAITING_OR_HOLDING_AND_ACTIVE {success}
> -- Bug 1174243: Invoked, SN(1839118962) - do_QueryInterface failed
> *crash*

Thanks to the SN# we can tell from the logs the crash  doesn't occur due to the first Resume() request, but rather due to the instantly failed 2nd Resume() request that comes right after GaiaApps.kill() is invoked.

I verified that the error notification due to the 2nd Resume() goes through the same interface check I mentioned above, so it should be routed to the correct interface... unless something wonky is happening with the given callback pointer.
I can say that in all my tests I can see the NotifyError() being called on nsITelephonyCallback, but with the caveat I also have yet to reproduce the crash.


I will keep on trying to find a way to locally reproduce this crash and try to catch it in gdb, but in the meantime I'd welcome any insight on the matter.
Flags: needinfo?(pgravel) → needinfo?(htsai)
Thanks for all the details, Phil. Good to learn the basic test flow. I will review the code again and see what I could find. Let's keep posted on bugzilla, thank you :)
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #27)
> Thanks for all the details, Phil. Good to learn the basic test flow. I will
> review the code again and see what I could find. Let's keep posted on
> bugzilla, thank you :)

sorry that I haven't got any ideas yet, but still want to share my current thoughts and questions so far... any input is welcome!

1) If the crash is caused by "killing the app" then according to the current protection in ActorDestroy, TelephonyRequestParent avoids to send a message and TelephonRequestChild will set mCallback to nullptr. However, per the log CAF provided, |static_cast<TelephonyCallback*>(mCallback.get())->CALLBACK_ID| was still valid that did't look like ActorDestroy had been called. Any possibility that leads to a non-null mCallback even an app is killed?

2) How could that happen that TelephonyParent sends "ErrorResponse" in [1] but TelephonyChild receives "DialResponseMMIError" in [2], even in the app being killed scenario? 


[1] https://dxr.mozilla.org/mozilla-central/source/dom/telephony/ipc/TelephonyParent.cpp#451
[2] https://dxr.mozilla.org/mozilla-central/source/dom/telephony/ipc/TelephonyChild.cpp?from=TelephonyChild.cpp#132
Sorry for such a late response. First, I have to make one thing clear since it might been misunderstood. The Serial Number(SN) does not grow increamentally, so it *doesn't* show nor gaurantee the sequence. If you look into the patch, then you'll soon find out it's just a random number. 

However, SN is still pretty usesful. Line 848 of attachment 8624821 [details] shows the SN is |1839118962|, and from the SN we can findout the callback which failed in queryInterface is generated in line 709. Here, we can see the backtrace below Line 709. Since the backtrace is just PCes in the log file, so we need an object files with Symbols to find out which WebAPI induces the generation of the callback, and further identify the root cause of this issue.

(In reply to Inder from comment #22)
> PS: not sure we understand the need for the libxul.so with the debug patch
> applied. I am assuming you should be able to build it too, right?

I don't think I can build a binary identical libxul.so as the one you used in your test platform. As mentioned above, I need libxul.so to find out who generates the callback which later fails to |queryInterface(...)| into |nsITelephonyDialCallback|. However, If you cannot provide the libxul.so you used, then can you please retrive the function names from the PCes in the log by |addr2line| and |c++filt| which you can find in {B2G}/prebuilts/{your platform}/{ISA}/{architecture}/bin? Then, I believe we can have a progress in this issue. Thanks.
Flags: needinfo?(bhsu)
Sorry for my misunderstanding Phil's comment. It seems that Phil already provides where the callback is generated.
Attachment #8623602 - Attachment is obsolete: true
Attachment #8625719 - Attachment description: Enhance Debug messages → Enhance Debug messages (V2)
Hi Phil, 

After discussing with my colleagues, we think there might be two possibilities causing this issue, where the first one is an IPC error as Hsinyi mentioned in comment 31, and the other is still an unintentional invoking |NotifyDialMMIError|. Sadly, each of them seems hard to happen. To identify whether the first possibility is the root cause, I slightly modified the debug message enhencement patch (attachment 8625719 [details] [diff] [review]) which further logs which method is invoked in the IPC parent side which includes |TelephonyService.js|. Do you mind trying this patch?

On the other hand, after tracking more related code, I found out that at the IPC parent side, there is only one type of actor which implements |nsITelephonyDialCallback|. The reason there is only one actor here is  |nsITelephonyDialCallback| inherits |nsITelephonyCallback|, and thus the actor can serve both |nsITelephonyDialCallback| and |nsITelephonyCallback|.

(In reply to pgravel from comment #26)
> In theory the logic on our side should protect us from sending things on the
> wrong interface. In order to determine the type of callback, we simply do
> (pseudo-code):
> 
> > nsCOMPtr<nsITelephonyDialCallback> dialCb(do_QueryInterface(callback));
> > if (dialCb) {
> >   // do nsITelephonyDialCallback stuff
> > } else {
> >   // do nsITelephonyCallback stuff
> > }
> 

As a result, the check you implement here might be useless in the IPC parent side, since the callback here can be always queried to |nsITelephonyDialCallback|, and the first branch is always entered. In addition, the falsy |queryInterface(...)| is induce by a |resume(...)|, which uses |nsITelephonyCallback| as its parameter, so I suggest you check the vender RIL again with this two clues.
Thanks Ben,

After our discussion and with the help of your analysis, I can now reproduce the crash consistently.

You are correct that the cause of the crash is that the Resume() call comes through TelephonyIPCService, which causes the ril to get a TelephonyRequestParent that supports both callback types, even though the caller only supports nsITelephonyCallback.

As discussed on the phone, ideally we would want to fix behavior such that we can can rely on QueryInterface to tell us which type of callback we have in both IPC and non-IPC cases, but for the sake of closing this bug we will be implementing a change in our ril to not depend on QI in this manner.


There is still an open question as to why and how this test results in a Resume() coming through TelephonyIPCService instead of the normal non-ipc direct nsITelephonyService::Resume() call (via TelephonyCall::Resume).
Clearing the NI flag on me as folks seemed got great progress in Whistler, thank you all!
Flags: needinfo?(htsai)
(In reply to pgravel from comment #37)
> Thanks Ben,
> 
> After our discussion and with the help of your analysis, I can now reproduce
> the crash consistently.
> 
> You are correct that the cause of the crash is that the Resume() call comes
> through TelephonyIPCService, which causes the ril to get a
> TelephonyRequestParent that supports both callback types, even though the
> caller only supports nsITelephonyCallback.
> 
> As discussed on the phone, ideally we would want to fix behavior such that
> we can can rely on QueryInterface to tell us which type of callback we have
> in both IPC and non-IPC cases, but for the sake of closing this bug we will
> be implementing a change in our ril to not depend on QI in this manner.
> 
> 
> There is still an open question as to why and how this test results in a
> Resume() coming through TelephonyIPCService instead of the normal non-ipc
> direct nsITelephonyService::Resume() call (via TelephonyCall::Resume).

From the log I saw the 2nd failed TelephonyCall::Resume request was called in Content, while the 1st was in Chrome, told by log PID. Probably should starting from checking your test script to grab more clues first.
Flags: needinfo?(anshulj)
Whiteboard: [b2g-crash][caf-crash 645][caf priority: p1][CR 853223]
Hi Ben,

Before we forget, could you please file a bugzilla for modifying the IPC layer to not always report both interfaces as supported, as discussed in our call? Even if it's not for v2.2, it'll be nice to tracking purposes.

Thanks!
Flags: needinfo?(bhsu)
(In reply to pgravel from comment #43)
> Hi Ben,
> 
> Before we forget, could you please file a bugzilla for modifying the IPC
> layer to not always report both interfaces as supported, as discussed in our
> call? Even if it's not for v2.2, it'll be nice to tracking purposes.
> 
> Thanks!

No problem, in fact, I'm thinking this question these days, and almost come out with a plan. Thanks for your kind remind.
Flags: needinfo?(bhsu)
See Also: → 1179089
Per comment 37, changing the bug component to Vendcom and assign to Phil for the comRIL.
Assignee: bhsu → pgravel
Component: RIL → Vendcom
Fixed in QC builds.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [POVB]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: