Closed Bug 832109 Opened 8 years ago Closed 8 years ago

crash in mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:tef+, firefox19 wontfix, firefox20 wontfix, firefox21 fixed, b2g18 verified, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

VERIFIED FIXED
B2G C4 (2jan on)
blocking-b2g tef+
Tracking Status
firefox19 --- wontfix
firefox20 --- wontfix
firefox21 --- fixed
b2g18 --- verified
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- fixed

People

(Reporter: m1, Assigned: bent.mozilla)

Details

(Keywords: crash, Whiteboard: [b2g-crash][BTG-1000][LOE:M])

Crash Data

Attachments

(5 files)

See multiple times during stability test. 

Test Steps:
1. Start Airplane mode, SMS, Call script.


Top frames:
Crash reason:  SIGSEGV
Crash address: 0x414566c0

Thread 0 (crashed)
 0  0x414566c0
     r4 = 0x414566c0    r5 = 0xbeeaa108    r6 = 0xbeeaa108    r7 = 0xbeeaa028
     r8 = 0x00000005    r9 = 0x411aab78   r10 = 0xbeeaa330    fp = 0x43ffe8e0
     sp = 0xbeea9fe0    lr = 0x408300cf    pc = 0x414566c0
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11]
     sp = 0xbeea9ff0    pc = 0x408300b9
    Found by: stack scanning
 2  0xbeeaa31e
     r4 = 0xbeeaa108    r5 = 0x00000001    sp = 0xbeeaa008    pc = 0xbeeaa320
    Found by: call frame info
 3  libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp : 3083 + 0xd]
     sp = 0xbeeaa058    pc = 0x408f17fb
    Found by: stack scanning
(minidump is from AU 177)
(In reply to Michael Vines [:m1] from comment #0)
> Created attachment 703696 [details]
> decoded minidump of crash
> 
> See multiple times during stability test. 
> 
> Test Steps:
> 1. Start Airplane mode, SMS, Call script.
> 
Hi Michael,

Would you please provide more dump details, such as 'r0' address? That would be great help for us. It would be also very helpful that you could clarify the detailed test steps here. Thanks!
(In reply to Michael Vines [:m1] from comment #1)
> (minidump is from AU 177)

Hi Michael,
Is it possible we can also get minidump and simple instruction to operate this tool? Thanks.
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #2)
> Would you please provide more dump details, such as 'r0' address? That would
> be great help for us. It would be also very helpful that you could clarify
> the detailed test steps here. Thanks!

Decoded minidump is attached to this bug.  I think you will need to examine the minidump and corresponding source in the backtrace to attempt to determine why we could crash here.
(In reply to Michael Vines [:m1] from comment #4)
> (In reply to Hsin-Yi Tsai [:hsinyi] from comment #2)
> > Would you please provide more dump details, such as 'r0' address? That would
> > be great help for us. It would be also very helpful that you could clarify
> > the detailed test steps here. Thanks!
> 
> Decoded minidump is attached to this bug.  I think you will need to examine
> the minidump and corresponding source in the backtrace to attempt to
> determine why we could crash here.

Yes, we already examined the minidump and we still think it would be helpful to have info about r0 but sadly, the info is missing in the minidump.
Bug in the minidump code?  Sorry, I'm certainly not withholding r0! :)
Severity: normal → critical
Crash Signature: [@ mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged]
Keywords: crash
Whiteboard: [BTG-1000] → [b2g-crash][BTG-1000]
Hsin-Yi, I'm assigning to you since it seems like you're investigating here.
Assignee: nobody → htsai
blocking-b2g: tef? → tef+
Whiteboard: [b2g-crash][BTG-1000] → [b2g-crash][BTG-1000][LOE:M]
Triage with Moz Taipei team: 
minidump code without r0 is not helping the team to debug further.
Can mvines either provide STR so that Taipei can reproduce locally? or provide another minidump with r0? Thanks
Flags: needinfo?(mvines)
Let's leave this open for another couple days to see if we can catch this again in automation with hopefully more helpful logs
Flags: needinfo?(mvines)
Hi, Michael,
   If you reproduce this issue again, can you please also provide the RIL binary code of the test image to us?
Haven't seen this come up in a couple days of testing so optimistically withdrawing
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Crash came back in AU 185.  App that's crashing is communications.

Top frames (will attach full minidump too...sorry no r0 still)
-----------
Crash reason:  SIGSEGV
Crash address: 0x0

Thread 0 (crashed)
 0  0x0
     r4 = 0x00000000    r5 = 0xbe8df108    r6 = 0xbe8df108    r7 = 0xbe8df028
     r8 = 0x00000005    r9 = 0x41219d98   r10 = 0xbe8df330    fp = 0x44767450
     sp = 0xbe8defe0    lr = 0x40870877    pc = 0x00000000
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11]
     sp = 0xbe8deff0    pc = 0x40870861
    Found by: stack scanning
 2  libxul.so!js::PropertyTree::insertChild [Utility.h : 153 + 0x3]
     r4 = 0xbe8df108    r5 = 0x432e5dd8    sp = 0xbe8df008    pc = 0x40e8bae9
    Found by: call frame info
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
This might be an issue with commercial RIL. Will reopen if we see the issue again.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → INVALID
Crash came back in AU 189.

rash reason:  SIGSEGV
Crash address: 0x0

Thread 0 (crashed)
 0  0x0
     r4 = 0x00000000    r5 = 0xbeb07100    r6 = 0xbeb07100    r7 = 0xbeb07020
     r8 = 0x00000005    r9 = 0x412834c8   r10 = 0xbeb07328    fp = 0x445d9f60
     sp = 0xbeb06fd8    lr = 0x408d8b67    pc = 0x00000000
    Found by: given as instruction pointer in context
 1  libxul.so!mozilla::dom::telephony::Telephony::RILTelephonyCallback::CallStateChanged [Telephony.h : 109 + 0x11]
     sp = 0xbeb06fe8    pc = 0x408d8b51
    Found by: stack scanning
 2  0xbeb07316
     r4 = 0xbeb07100    r5 = 0x00000001    sp = 0xbeb07000    pc = 0xbeb07318
    Found by: call frame info
 3  libxul.so!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp : 3083 + 0xd]
     sp = 0xbeb07050    pc = 0x4099ab87
    Found by: stack scanning
 4  libxul.so!XPC_WN_CallMethod [XPCWrappedNativeJSOps.cpp : 1469 + 0x7]
     r4 = 0x4260bf10    r5 = 0xbeb07400    r6 = 0x4441d700    r7 = 0x00000001
     r8 = 0x42948108    r9 = 0x00000004   r10 = 0x414fbb58    fp = 0xbeb078d8
     sp = 0xbeb073e0    pc = 0x4099f3e5
    Found by: call frame info
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Anshul,

Sadly this happened again. To further investigate this issue, we need more information from you. Would you please help provide us the following?
1) complete logcat of the test run
2) RIL binary code of the test image
3) code version you were testing with
4) better with the detailed test steps

We do need them for debugging and resolving this issue. Thank you.
Flags: needinfo?(anshulj)
Hsin-Yi, please find the information you requested.

- Log file attached
- You can get the RIL binary from AU 189
- Steps to reproduce
  - Start Camera, Call, Camcorder, Call
  - Start music at background
  - Repeat above steps

The crash happened after the above test case was run over night by our stability team.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #18)
> Hsin-Yi, please find the information you requested.
> 
> - Log file attached
> - You can get the RIL binary from AU 189
> - Steps to reproduce
>   - Start Camera, Call, Camcorder, Call
>   - Start music at background
>   - Repeat above steps
> 
> The crash happened after the above test case was run over night by our
> stability team.
Hi Anshul,
Thanks for providing what we requested. We investigated the full log, however we didn't see more clues about communications app crash there. Our QA is helping reproduce this issue with QC RIL, and hopefully to gather more hints. We've been following your steps over dozens of iterations but still not made it yet. We are wondering whether your team is testing this issue by a test script. If so, would it be possible to provide us your test script? That would be very much appreciated and a great help. Thanks again :)
Flags: needinfo?(anshulj)
Anshul, 
Me again ;-) Not sure whether it's possible you could help us to get more info when you run the next stability test. In case it is, here are the steps to get more debug messages.
1) edit |B2G/.userconfig|, add this line "export B2G_DEBUG=1" Then delete the obj directory & re-build.
2) In B2G/, run this line |./run-gdb.sh attach [pid communications app]|. Once the seg. fault appears again, type 'bt' to get stack.
QA Contact: atsai
Anshul,
  Though having 3 people to do this test during these days, we still can not reproduce this bug...:( 
Could you also help to reproduce this bug and get logs of what Hsinyi said to us? Thanks.
According to the minidump, the crash resulted from the null r4. After analyzing the assembly code, we, some RIL team members, conjectured something wrong when RIL gets RILTelephonyCallback that tried calling mTelephony->CallStateChanged() [1]. According to the assembly, RILTelephonyCallback looked fine but mTelephony [2] seemd null. I said 'seemed' because the minidump didn't have enough information and that's we reasoned from the assembly and null r4.

We are discussing whether it's possible that mTelephony got freed due to cycle collection even when RILTelephonyCallback is valid, and whether we should make mTelephony reference counted. May I have your feedback of this, Ben? Thanks!

[1] http://mxr.mozilla.org/mozilla-central/source/dom/telephony/Telephony.h#113
[2] http://mxr.mozilla.org/mozilla-central/source/dom/telephony/Telephony.h#109
Flags: needinfo?(bent.mozilla)
I don't see any place where we set RILTelephonyCallback::mTelephony could become null, even during cycle collection.

However, it seems entirely possible that if any component (RIL, something else?) holds RILTelephonyCallback alive after we call UnregisterTelephonyCallback that we would end up crashing later once the Telephony object has been deleted. The Telephony code is doing the right thing here so we would want to fix the RIL. That may not be possible, though, so we could be safer by guarding against this case.
Flags: needinfo?(bent.mozilla)
Attached patch Bandaid patchSplinter Review
Here is a patch that would guard against a buggy RIL (or some other component) holding our callback alive longer than it should. Please give this a try and see if it fixes the crash?
Oh, and we definitely do not want to make RILTelephonyCallback::mTelephony into a strong reference since it would create a cycle between the RIL and the Telephony objects. The only way to fix that would be to make the RIL cycle-collected, which is a) hard, and b) maybe impossible if we are using custom RIL implementations.
Is this potential cycle wholly contained within a content process?   It looks like it but I only skimmed the code.   Known custom RIL implementations are only running code in the main process.
I actually don't know, it depends on how the remoting for the telephony stuff was done. The cycle I'm talking about would be in the content process for sure and maybe also in the parent process.
Anshul, can you please take a closer look at this possible cycle from the parent process POV?
Well, to be clear, the cycle I'm talking about is the reason why we can't make a strong ref from RIL -> RILTelephonyCallback -> Telephony. We don't have that now so this isn't an actual problem.

The thing I would check for in custom RIL implementations is any place where nsIRILTelephonyCallback methods are called after the callback has been unregistered. That's my best guess for what's causing the current crash.
Ben, I will make a change in custom RIL to make sure the callstatecange notification is not sent after the telephony callback is unregistered. For now I am marking this change invalid and will reopen if I see it again.

Thanks for helping out getting to the bottom of the issue.
Flags: needinfo?(anshulj)
(In reply to Anshul from comment #31)
> Ben, I will make a change in custom RIL to make sure the callstatecange
> notification is not sent after the telephony callback is unregistered. For
> now I am marking this change invalid and will reopen if I see it again.

Did you mean to resolve this bug as INVALID?
Flags: needinfo?(anshulj)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Flags: needinfo?(anshulj)
Resolution: --- → INVALID
(In reply to ben turner [:bent] from comment #25)
> Created attachment 711836 [details] [diff] [review]
> Bandaid patch
> 
> Here is a patch that would guard against a buggy RIL (or some other
> component) holding our callback alive longer than it should. Please give
> this a try and see if it fixes the crash?

Ben, thanks for your insights and the patch!
Batch edit: Bugs still affected on b2g18 after 2/13 merge to v1.0.1 branch are affected on v1.0.1 branch.
(In reply to Anshul from comment #31)
> Ben, I will make a change in custom RIL to make sure the callstatecange
> notification is not sent after the telephony callback is unregistered.

Well, I'd still probably like to check this patch in since it guards against this kind of problem. It won't be high priority if your custom RIL is changed in the meantime, of course, but some day this could come back without further protection.

Andrew, can you set whatever flags that should require? :)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Comment on attachment 711836 [details] [diff] [review]
Bandaid patch

Justin, we chatted about this last week, feel comfortable reviewing it?
Attachment #711836 - Flags: review?(justin.lebar+bug)
> Justin, we chatted about this last week, feel comfortable reviewing it?

I have zero recollection of this conversation, but your patch looks good to me.  :)
Attachment #711836 - Flags: review?(justin.lebar+bug) → review+
(In reply to Justin Lebar [:jlebar] from comment #37)
> > Justin, we chatted about this last week, feel comfortable reviewing it?
> 
> I have zero recollection of this conversation, but your patch looks good to
> me.  :)

Hi Ben, is it ready for m-c?
https://hg.mozilla.org/mozilla-central/rev/5dec45f30b77
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Changing ownership to Bent since he gave the last push! Thanks again for helping this out :)
Assignee: htsai → bent.mozilla
does not make sense to create a regression issue.
Flags: in-moztrap-
verified with 1.1 build of 20130322070202
(In reply to Tracy Walker [:tracy] from comment #44)
> verified with 1.1 build of 20130322070202

Btw, if you end up verifying on b2g18, I'd mark the general workflow flag as verified as well.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.