dom::bluetooth::BluetoothChild::RecvNotify crashes during monkey testing

RESOLVED FIXED in Firefox 32

Status

defect
--
critical
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: tkundu, Assigned: shawnjohnjr)

Tracking

({crash})

unspecified
2.0 S1 (9may)
ARM
Gonk (Firefox OS)
Bug Flags:
in-moztrap -

Firefox Tracking Flags

(blocking-b2g:1.4+, firefox30 wontfix, firefox31 wontfix, firefox32 fixed, b2g-v1.4 fixed, b2g-v2.0 fixed)

Details

(Whiteboard: [b2g-crash][qa-])

Attachments

(3 attachments, 3 obsolete attachments)

No description provided.
Blocks: 984663
blocking-b2g: --- → 1.4?
Any log or back trace? And on which device?
Flags: needinfo?(tkundu)
Posted file stack trace
stack trace for settings app crash.

STR:

1) Run monkey testing on device.
Flags: needinfo?(tkundu)
back trace pattern is very similar with bug 795458.
Remove v1.4? blocking tag for now as this issue was seen only on msm8610 .
No longer blocks: 984663
blocking-b2g: 1.4? → ---
Keywords: crash
Whiteboard: [b2g-crash]
Assignee: nobody → shuang
I suspect that sBluetoothService had been deleted during Bluetooth on/off which leads SIGSEGV.

 1  libxul.so!mozilla::dom::bluetooth::BluetoothChild::RecvNotify(mozilla::dom::bluetooth::BluetoothSignal const&) [BluetoothChild.cpp : 78 + 0x9]
     sp = 0xbee4aa98    pc = 0xb5bf368d
    Found by: stack scanning
 2  libxul.so!mozilla::dom::bluetooth::PBluetoothChild::OnMessageReceived(IPC::Message const&) [PBluetoothChild.cpp : 364 + 0x9]
     sp = 0xbee4aaa0    pc = 0xb5724811
    Found by: call frame info
blocking-b2g: --- → 1.4?
Sorry, that was my re-nom for 1.4.  8610 applies for 1.4 as well.
(In reply to Greg Grisco from comment #7)
> Sorry, that was my re-nom for 1.4.  8610 applies for 1.4 as well.

Can you clarify what you mean by 8610? Is this supposed to be CS blocking?
Flags: needinfo?(ggrisco)
Jason that's referring to the chipset in question. Per triage yes it blocks QC.
Flags: needinfo?(ggrisco)
blocking-b2g: 1.4? → 1.4+
Target Milestone: --- → 1.4 S6 (25apr)
Based on the stack, Setting app crashed in BluetoothChild, and probably access nullptr (sBluetoothService). 

#0  mozilla::dom::bluetooth::BluetoothChild::RecvNotify (this=0x496bdc70, aSignal=...) at /home/shawnjohnjr/b2g-latest/B2G/gecko/dom/bluetooth/ipc/BluetoothChild.cpp:78
#1  0x4084f886 in mozilla::dom::bluetooth::PBluetoothChild::OnMessageReceived (this=0x496bdc70, __msg=...) at /home/shawnjohnjr/b2g-latest/B2G/objdir-gecko/ipc/ipdl/PBluetoothChild.cpp:364
#2  0x40870fb4 in mozilla::dom::PContentChild::OnMessageReceived (this=0x4033f018, __msg=...) at /home/shawnjohnjr/b2g-latest/B2G/objdir-gecko/ipc/ipdl/PContentChild.cpp:3417
#3  0x40833be2 in mozilla::ipc::MessageChannel::DispatchAsyncMessage (this=0x4033f048, aMsg=...) at /home/shawnjohnjr/b2g-latest/B2G/gecko/ipc/glue/MessageChannel.cpp:1142
Posted patch bug993954.patch (obsolete) — Splinter Review
Hi Thomas,
What do you think if we can add protection here?
Attachment #8410742 - Flags: feedback?(tzimmermann)
The frame 0 seems occurpt based on attachment 0. The PC is 0x00000000.
0  0x0
     r4 = 0x00000002    r5 = 0xbee4aab8    r6 = 0xb2337c9c    r7 = 0x00000000
     r8 = 0x00000002    r9 = 0x00000001   r10 = 0x00000000    fp = 0xbee4b79c
     sp = 0xbee4aa80    lr = 0xb5bf2fe9    pc = 0x00000000
    Found by: given as instruction pointer in context
Maybe PC=0 causes by function pointer is null, and maybe sBluetoothService is not null.
I think this bug only crashes Settings application instead of the whole b2g process, since it's at child process side.
I guess when |BluetoothServiceChildProcess::NoteDeadActor| gets called, BluetoothChild destructor called, and |BluetoothServiceChildProcess* gBluetoothService| can be nullptr, at the same time |PBluetoothChild::OnMessageReceived| gets message from MessageLoop, which calls |RecvNotify()|. But I cannot explain why PC=0 in frame 0.
IRC log:

>>>>

<shawnjohnjr> hello
<tzimmermann> hi
<shawnjohnjr> busy now?
<tzimmermann> kind of, but go on
<tzimmermann> are you asking about bug 993954?
<shawnjohnjr> Any chance to feedback bug 993954 today?
<shawnjohnjr> yap
<tzimmermann> i'll take a look immediately
<shawnjohnjr> it's monkey testing, so i cannot reproduce it easily.
<tzimmermann> there are already assertions for sBluetoothService
<tzimmermann> does that happen on bluez or bluedroid?
<shawnjohnjr> bluez
<shawnjohnjr> Because frame 0 PC = 0, I can only guess gBluetoothService->DistributeSignal  is null
<shawnjohnjr> But it shouldn't
<tzimmermann> one more second...
<shawnjohnjr> ok
<tzimmermann> (pc == 0) happened when calling BluetoothChild::RecvNotify, right?
<tzimmermann> that method is virtual, if BluetoothChild has been free'd, you might call access and invalid vtable and jump to address 0
<shawnjohnjr> Around gBluetoothService->DistributeSignal
<tzimmermann> oh, ok
<shawnjohnjr> ok
<tzimmermann> and sBluetoothService has probably been free'd but the variable has  not been cleared to nullptr
<tzimmermann> that happens in the destructor of BluetoothService
<tzimmermann> sorry, that happens in the desctuctor of BluetoothChild!
<shawnjohnjr> yap
<tzimmermann> but i couldn't find any call to the destructor of BluetoothChild
<shawnjohnjr> I wrote in Comment 14...
<shawnjohnjr> NoteDeadActor
<tzimmermann> but NoteDeadActor only clears sBluetoothChild!
<tzimmermann> it does not call the destructor afaics
<tzimmermann> sBluetoothChild is not ref-counted
<shawnjohnjr> ok
<tzimmermann> or auto_ptr'ed, or something similar
<shawnjohnjr> you are right
<shawnjohnjr> it's not
<tzimmermann> so i'd guess that the destructor of BluetoothChild never runs and never clears sBluetoothService
<shawnjohnjr> Then my assumption is not correct :(
<tzimmermann> can you reproduce the bug?
<shawnjohnjr> i cannot reproduce this bug.
<tzimmermann> :/
<shawnjohnjr> In theory, monkey testing just randomly click
<tzimmermann> i'd try to delete sBluetoothChild, and give it back to monkey testers for verification.
<shawnjohnjr> ok
<shawnjohnjr> this is something helpful
<tzimmermann> glad to help :) i'll add the conversion to the bug report
<shawnjohnjr> i don't have better idea currently.
<shawnjohnjr> Thanks
Attachment #8410742 - Attachment is obsolete: true
Attachment #8410742 - Flags: feedback?(tzimmermann)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #15)
> <tzimmermann> there are already assertions for sBluetoothService
> <tzimmermann> does that happen on bluez or bluedroid?
> <shawnjohnjr> bluez
> <shawnjohnjr> Because frame 0 PC = 0, I can only guess

Please note we have switched over bluedroid stack
(In reply to bhargavg1 from comment #16)
> (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #15)
> > <tzimmermann> there are already assertions for sBluetoothService
> > <tzimmermann> does that happen on bluez or bluedroid?
> > <shawnjohnjr> bluez
> > <shawnjohnjr> Because frame 0 PC = 0, I can only guess
> 
> Please note we have switched over bluedroid stack

What we've found is unrelated to the BT driver. And at some point, BT will probably be supported on the desktop. And then we'll need BlueZ.
There is any chance I can download symbol file of libxul.so?
Flags: needinfo?(tkundu)
Flags: needinfo?(tkundu)
I still think sBluetoothService could be nullptr, and at the same time, called RecvNotify() from MessageLoop, which leads content process (Settings app) crashes. I don't have other explanation to fit the back trace.
 
And I found |BluetoothRequestChild::~BluetoothRequestChild()| will be called if we power off the device manually on Nexus4.
Comment on attachment 8412539 [details] [diff] [review]
Bug 993954 - Check sBluetoothChild is null when closing Settings application

Review of attachment 8412539 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think it's correct to delete sBluetoothChild and sBluetoothService manually. First, without settings these two pointers to nullptr, it might cause a crash when RecvNotify() is called fater delete, which also means the additional if-check would become useless. In my opinion, you could declare nsRefPtr<BluetoothService> as a member variable of BluetoothChild so you don't need to worry about memory free/allocate. Second, please add protection for RecvEnabled() as well.

There is a bug which is very similar to our case and Ben Turner left a comment which might be helpful. Please see https://bugzilla.mozilla.org/show_bug.cgi?id=767082 for more details. To be honest I'm not sure how this happened even after investigating related code. I'd like to propose that let's add if-protection and not change the null out behavior to delete. Leak is better than crash to me.
Attachment #8412539 - Flags: review?(echou) → review-
Attachment #8414264 - Attachment is obsolete: true
Attachment #8414264 - Flags: review?(echou)
Comment on attachment 8414267 [details] [diff] [review]
Bug 993954 - Check sBluetoothChild is null when closing Settings application

Review of attachment 8414267 [details] [diff] [review]:
-----------------------------------------------------------------

Shawn, sorry but I have to r- this. In this patch, you add the protection for sBluetoothService, which may be good, but definitely will not solve the original issue.
Attachment #8414267 - Flags: review?(echou) → review-
Target Milestone: 1.4 S6 (25apr) → 2.0 S1 (9may)
(In reply to Eric Chou [:ericchou] [:echou] from comment #24)
> Comment on attachment 8414267 [details] [diff] [review]
> Bug 993954 - Check sBluetoothChild is null when closing Settings application
> 
> Review of attachment 8414267 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Shawn, sorry but I have to r- this. In this patch, you add the protection
> for sBluetoothService, which may be good, but definitely will not solve the
> original issue.
ok. I extend my ETA a bit since currently I don't have better solution.
(In reply to Preeti Raghunath(:Preeti) from comment #9)
> Jason that's referring to the chipset in question. Per triage yes it blocks
> QC.

Hi Preeti, 
Can you elaborate on the triage decision on why this blocks?  There's no clear repro on this, and the settings app should be recoverable in this case.
Flags: needinfo?(praghunath)
Group: qualcomm-confidential
Comment on attachment 8414267 [details] [diff] [review]
Bug 993954 - Check sBluetoothChild is null when closing Settings application

ok. After discussed with Shawn, I decided to accept his protection patch. Let's see if QC side could reproduce this after this patch applied.

Shawn, please keep investigating, and please open a followup for this once the patch is proven to work by QC's QA.
Attachment #8414267 - Flags: review- → review+
Hi Tapas,

Please apply the r+ patch and see if this can still be reproduced. Thanks.
Flags: needinfo?(tkundu)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #31)
> (In reply to Eric Chou [:ericchou] [:echou] from comment #29)
> > ... QC ...
> > 
> > ... QC's ...
> 
> We are CAF, not QC :)

ok, sure.
Marking 2 comments as private, rather than the whole bug; we should aim to have as few closed bugs as possible.

Landed as:
remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/b908160de1ac
Group: qualcomm-confidential
Keywords: checkin-needed
(In reply to Ed Morley [:edmorley UTC+0] from comment #33)
> Marking 2 comments as private, rather than the whole bug; we should aim to
> have as few closed bugs as possible.
> 
> Landed as:
> remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/b908160de1ac

I will ask our team to test with this patch and update here soon
https://hg.mozilla.org/mozilla-central/rev/b908160de1ac
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [b2g-crash] → [b2g-crash][qa-]
Flags: needinfo?(praghunath)
When I try to reproduce this bug, I don't ever see |BluetoothServiceChildProcess::NoteDeadActor| being executed.

How do I make this happen?
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #37)
> When I try to reproduce this bug, I don't ever see
> |BluetoothServiceChildProcess::NoteDeadActor| being executed.
> 
> How do I make this happen?
Comment 15 mentioned NoteDeadActor (will be executed) might not correct. But on Nexus 4, I added logs in |~BluetoothChild()|, the destructor will be called via killing app (long press Home button and kill).
I don't have any luck reproducing this.
Shawn

Next steps here. Do we need anything from partners to get ahead?
Flags: needinfo?(shawnjohnjr)
Recently partners run auto tests, so I think let's wait for their feedback if they can hit the problem again.
Flags: needinfo?(shawnjohnjr)
We didn't see it on v1.4 for long time in our stability run. It seems like it is fixed in v1.4
Flags: needinfo?(tkundu)
(In reply to Tapas Kumar Kundu from comment #42)
> We didn't see it on v1.4 for long time in our stability run. It seems like
> it is fixed in v1.4

Thanks!
Not enough information with current STR to write test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(rmead)
See Also: → 1013748
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(rmead)
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.