Closed Bug 832109 Opened 12 years ago Closed 12 years ago

crash in mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:tef+, firefox19 wontfix, firefox20 wontfix, firefox21 fixed, b2g18 verified, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

VERIFIED FIXED
B2G C4 (2jan on)
blocking-b2g tef+
Tracking Status
firefox19 --- wontfix
firefox20 --- wontfix
firefox21 --- fixed
b2g18 --- verified
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- fixed

People

(Reporter: m1, Assigned: bent.mozilla)

Details

(Keywords: crash, Whiteboard: [b2g-crash][BTG-1000][LOE:M])

Crash Data

Attachments

(5 files)

See multiple times during stability test. Test Steps: 1. Start Airplane mode, SMS, Call script. Top frames: Crash reason: SIGSEGV Crash address: 0x414566c0 Thread 0 (crashed) 0 0x414566c0 r4 = 0x414566c0 r5 = 0xbeeaa108 r6 = 0xbeeaa108 r7 = 0xbeeaa028 r8 = 0x00000005 r9 = 0x411aab78 r10 = 0xbeeaa330 fp = 0x43ffe8e0 sp = 0xbeea9fe0 lr = 0x408300cf pc = 0x414566c0 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11] sp = 0xbeea9ff0 pc = 0x408300b9 Found by: stack scanning 2 0xbeeaa31e r4 = 0xbeeaa108 r5 = 0x00000001 sp = 0xbeeaa008 pc = 0xbeeaa320 Found by: call frame info 3 libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp : 3083 + 0xd] sp = 0xbeeaa058 pc = 0x408f17fb Found by: stack scanning
(minidump is from AU 177)
(In reply to Michael Vines [:m1] from comment #0) > Created attachment 703696 [details] > decoded minidump of crash > > See multiple times during stability test. > > Test Steps: > 1. Start Airplane mode, SMS, Call script. > Hi Michael, Would you please provide more dump details, such as 'r0' address? That would be great help for us. It would be also very helpful that you could clarify the detailed test steps here. Thanks!
(In reply to Michael Vines [:m1] from comment #1) > (minidump is from AU 177) Hi Michael, Is it possible we can also get minidump and simple instruction to operate this tool? Thanks.
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #2) > Would you please provide more dump details, such as 'r0' address? That would > be great help for us. It would be also very helpful that you could clarify > the detailed test steps here. Thanks! Decoded minidump is attached to this bug. I think you will need to examine the minidump and corresponding source in the backtrace to attempt to determine why we could crash here.
(In reply to Michael Vines [:m1] from comment #4) > (In reply to Hsin-Yi Tsai [:hsinyi] from comment #2) > > Would you please provide more dump details, such as 'r0' address? That would > > be great help for us. It would be also very helpful that you could clarify > > the detailed test steps here. Thanks! > > Decoded minidump is attached to this bug. I think you will need to examine > the minidump and corresponding source in the backtrace to attempt to > determine why we could crash here. Yes, we already examined the minidump and we still think it would be helpful to have info about r0 but sadly, the info is missing in the minidump.
Bug in the minidump code? Sorry, I'm certainly not withholding r0! :)
Severity: normal → critical
Crash Signature: [@ mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged]
Keywords: crash
Whiteboard: [BTG-1000] → [b2g-crash][BTG-1000]
Hsin-Yi, I'm assigning to you since it seems like you're investigating here.
Assignee: nobody → htsai
blocking-b2g: tef? → tef+
Whiteboard: [b2g-crash][BTG-1000] → [b2g-crash][BTG-1000][LOE:M]
Triage with Moz Taipei team: minidump code without r0 is not helping the team to debug further. Can mvines either provide STR so that Taipei can reproduce locally? or provide another minidump with r0? Thanks
Flags: needinfo?(mvines)
Let's leave this open for another couple days to see if we can catch this again in automation with hopefully more helpful logs
Flags: needinfo?(mvines)
Hi, Michael, If you reproduce this issue again, can you please also provide the RIL binary code of the test image to us?
Haven't seen this come up in a couple days of testing so optimistically withdrawing
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Crash came back in AU 185. App that's crashing is communications. Top frames (will attach full minidump too...sorry no r0 still) ----------- Crash reason: SIGSEGV Crash address: 0x0 Thread 0 (crashed) 0 0x0 r4 = 0x00000000 r5 = 0xbe8df108 r6 = 0xbe8df108 r7 = 0xbe8df028 r8 = 0x00000005 r9 = 0x41219d98 r10 = 0xbe8df330 fp = 0x44767450 sp = 0xbe8defe0 lr = 0x40870877 pc = 0x00000000 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11] sp = 0xbe8deff0 pc = 0x40870861 Found by: stack scanning 2 libxul.so!js::PropertyTree::insertChild [Utility.h : 153 + 0x3] r4 = 0xbe8df108 r5 = 0x432e5dd8 sp = 0xbe8df008 pc = 0x40e8bae9 Found by: call frame info
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
This might be an issue with commercial RIL. Will reopen if we see the issue again.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → INVALID
Crash came back in AU 189. rash reason: SIGSEGV Crash address: 0x0 Thread 0 (crashed) 0 0x0 r4 = 0x00000000 r5 = 0xbeb07100 r6 = 0xbeb07100 r7 = 0xbeb07020 r8 = 0x00000005 r9 = 0x412834c8 r10 = 0xbeb07328 fp = 0x445d9f60 sp = 0xbeb06fd8 lr = 0x408d8b67 pc = 0x00000000 Found by: given as instruction pointer in context 1 libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11] sp = 0xbeb06fe8 pc = 0x408d8b51 Found by: stack scanning 2 0xbeb07316 r4 = 0xbeb07100 r5 = 0x00000001 sp = 0xbeb07000 pc = 0xbeb07318 Found by: call frame info 3 libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp : 3083 + 0xd] sp = 0xbeb07050 pc = 0x4099ab87 Found by: stack scanning 4 libxul.so!XPC_WN_CallMethod [XPCWrappedNativeJSOps.cpp : 1469 + 0x7] r4 = 0x4260bf10 r5 = 0xbeb07400 r6 = 0x4441d700 r7 = 0x00000001 r8 = 0x42948108 r9 = 0x00000004 r10 = 0x414fbb58 fp = 0xbeb078d8 sp = 0xbeb073e0 pc = 0x4099f3e5 Found by: call frame info
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Anshul, Sadly this happened again. To further investigate this issue, we need more information from you. Would you please help provide us the following? 1) complete logcat of the test run 2) RIL binary code of the test image 3) code version you were testing with 4) better with the detailed test steps We do need them for debugging and resolving this issue. Thank you.
Flags: needinfo?(anshulj)
Hsin-Yi, please find the information you requested. - Log file attached - You can get the RIL binary from AU 189 - Steps to reproduce - Start Camera, Call, Camcorder, Call - Start music at background - Repeat above steps The crash happened after the above test case was run over night by our stability team.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #18) > Hsin-Yi, please find the information you requested. > > - Log file attached > - You can get the RIL binary from AU 189 > - Steps to reproduce > - Start Camera, Call, Camcorder, Call > - Start music at background > - Repeat above steps > > The crash happened after the above test case was run over night by our > stability team. Hi Anshul, Thanks for providing what we requested. We investigated the full log, however we didn't see more clues about communications app crash there. Our QA is helping reproduce this issue with QC RIL, and hopefully to gather more hints. We've been following your steps over dozens of iterations but still not made it yet. We are wondering whether your team is testing this issue by a test script. If so, would it be possible to provide us your test script? That would be very much appreciated and a great help. Thanks again :)
Flags: needinfo?(anshulj)
Anshul, Me again ;-) Not sure whether it's possible you could help us to get more info when you run the next stability test. In case it is, here are the steps to get more debug messages. 1) edit |B2G/.userconfig|, add this line "export B2G_DEBUG=1" Then delete the obj directory & re-build. 2) In B2G/, run this line |./run-gdb.sh attach [pid communications app]|. Once the seg. fault appears again, type 'bt' to get stack.
QA Contact: atsai
Anshul, Though having 3 people to do this test during these days, we still can not reproduce this bug...:( Could you also help to reproduce this bug and get logs of what Hsinyi said to us? Thanks.
According to the minidump, the crash resulted from the null r4. After analyzing the assembly code, we, some RIL team members, conjectured something wrong when RIL gets RILTelephonyCallback that tried calling mTelephony->CallStateChanged() [1]. According to the assembly, RILTelephonyCallback looked fine but mTelephony [2] seemd null. I said 'seemed' because the minidump didn't have enough information and that's we reasoned from the assembly and null r4. We are discussing whether it's possible that mTelephony got freed due to cycle collection even when RILTelephonyCallback is valid, and whether we should make mTelephony reference counted. May I have your feedback of this, Ben? Thanks! [1] http://mxr.mozilla.org/mozilla-central/source/dom/telephony/Telephony.h#113 [2] http://mxr.mozilla.org/mozilla-central/source/dom/telephony/Telephony.h#109
Flags: needinfo?(bent.mozilla)
I don't see any place where we set RILTelephonyCallback::mTelephony could become null, even during cycle collection. However, it seems entirely possible that if any component (RIL, something else?) holds RILTelephonyCallback alive after we call UnregisterTelephonyCallback that we would end up crashing later once the Telephony object has been deleted. The Telephony code is doing the right thing here so we would want to fix the RIL. That may not be possible, though, so we could be safer by guarding against this case.
Flags: needinfo?(bent.mozilla)
Attached patch Bandaid patchSplinter Review
Here is a patch that would guard against a buggy RIL (or some other component) holding our callback alive longer than it should. Please give this a try and see if it fixes the crash?
Oh, and we definitely do not want to make RILTelephonyCallback::mTelephony into a strong reference since it would create a cycle between the RIL and the Telephony objects. The only way to fix that would be to make the RIL cycle-collected, which is a) hard, and b) maybe impossible if we are using custom RIL implementations.
Is this potential cycle wholly contained within a content process? It looks like it but I only skimmed the code. Known custom RIL implementations are only running code in the main process.
I actually don't know, it depends on how the remoting for the telephony stuff was done. The cycle I'm talking about would be in the content process for sure and maybe also in the parent process.
Anshul, can you please take a closer look at this possible cycle from the parent process POV?
Well, to be clear, the cycle I'm talking about is the reason why we can't make a strong ref from RIL -> RILTelephonyCallback -> Telephony. We don't have that now so this isn't an actual problem. The thing I would check for in custom RIL implementations is any place where nsIRILTelephonyCallback methods are called after the callback has been unregistered. That's my best guess for what's causing the current crash.
Ben, I will make a change in custom RIL to make sure the callstatecange notification is not sent after the telephony callback is unregistered. For now I am marking this change invalid and will reopen if I see it again. Thanks for helping out getting to the bottom of the issue.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #31) > Ben, I will make a change in custom RIL to make sure the callstatecange > notification is not sent after the telephony callback is unregistered. For > now I am marking this change invalid and will reopen if I see it again. Did you mean to resolve this bug as INVALID?
Flags: needinfo?(anshulj)
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Flags: needinfo?(anshulj)
Resolution: --- → INVALID
(In reply to ben turner [:bent] from comment #25) > Created attachment 711836 [details] [diff] [review] > Bandaid patch > > Here is a patch that would guard against a buggy RIL (or some other > component) holding our callback alive longer than it should. Please give > this a try and see if it fixes the crash? Ben, thanks for your insights and the patch!
Batch edit: Bugs still affected on b2g18 after 2/13 merge to v1.0.1 branch are affected on v1.0.1 branch.
(In reply to Anshul from comment #31) > Ben, I will make a change in custom RIL to make sure the callstatecange > notification is not sent after the telephony callback is unregistered. Well, I'd still probably like to check this patch in since it guards against this kind of problem. It won't be high priority if your custom RIL is changed in the meantime, of course, but some day this could come back without further protection. Andrew, can you set whatever flags that should require? :)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Comment on attachment 711836 [details] [diff] [review] Bandaid patch Justin, we chatted about this last week, feel comfortable reviewing it?
Attachment #711836 - Flags: review?(justin.lebar+bug)
> Justin, we chatted about this last week, feel comfortable reviewing it? I have zero recollection of this conversation, but your patch looks good to me. :)
Attachment #711836 - Flags: review?(justin.lebar+bug) → review+
(In reply to Justin Lebar [:jlebar] from comment #37) > > Justin, we chatted about this last week, feel comfortable reviewing it? > > I have zero recollection of this conversation, but your patch looks good to > me. :) Hi Ben, is it ready for m-c?
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Changing ownership to Bent since he gave the last push! Thanks again for helping this out :)
Assignee: htsai → bent.mozilla
does not make sense to create a regression issue.
Flags: in-moztrap-
verified with 1.1 build of 20130322070202
(In reply to Tracy Walker [:tracy] from comment #44) > verified with 1.1 build of 20130322070202 Btw, if you end up verifying on b2g18, I'd mark the general workflow flag as verified as well.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: