Closed Bug 1013748 Opened 5 years ago Closed 5 years ago

dom::bluetooth::BluetoothChild::RecvNotify crashes during monkey testing

Categories

(Firefox OS Graveyard :: Bluetooth, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:2.0+, firefox32 wontfix, firefox33 wontfix, firefox34 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed)

RESOLVED FIXED
2.1 S3 (29aug)
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: tkundu, Assigned: echou)

References

Details

(Keywords: crash, regression, Whiteboard: [caf-crash 109][caf priority: p1][CR 708513][b2g-crash])

Crash Data

Attachments

(8 files, 1 obsolete file)

Attached file stack trace
STR:

1) Run monkey testing on device.
blocking-b2g: --- → 2.0?
Crash Signature: [@ ? | mozilla::dom::bluetooth::BluetoothChild::RecvNotify(mozilla::dom::bluetooth::BluetoothSignal const&) | mozilla::dom::bluetooth::PBluetoothChild::OnMessageReceived(IPC::Message const&) | mozilla::dom::PContentChild::OnMessageReceived(IPC::Message co…
blocking-b2g: 2.0? → 2.0+
Keywords: regression
Severity: normal → critical
Keywords: crash
Whiteboard: [ETA:5/30] → [ETA:5/30][b2g-crash]
Crash in Observer.h function Broadcast, while BluetoothService is doing Broadcast
  void Broadcast(const T& aParam) {
   ....
      mObservers[i]->Notify(aParam);
  }

BluetoothService::DistributeSignal(const BluetoothSignal& aSignal)
ol->Broadcast(aSignal);
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #3)
> https://bugzilla.mozilla.org/show_bug.cgi?id=860075#c7
> It looks like similar crash pattern.
But in bug 860075, crash address is 0x0, but now crash address is 0x5a5a5a62.
https://bugzilla.mozilla.org/show_bug.cgi?id=860075#c18
"For this specific case, the root cause is that Settings/Bluetooth app registers for same Bluetooth signal twice but when it's killed only one record is removed, and the dangling one causes SIGSEGV later."

I tried the same STR that bug 860075 used but I cannot reproduce it.
Whiteboard: [ETA:5/30][b2g-crash] → [ETA:6/10][b2g-crash]
Target Milestone: --- → 2.0 S4 (20june)
Whiteboard: [ETA:6/10][b2g-crash] → [ETA:6/20][b2g-crash]
Whiteboard: [ETA:6/20][b2g-crash] → [b2g-crash]
Target Milestone: 2.0 S4 (20june) → 2.0 S5 (4july)
update: Shawn keeps tracking this one. But the issue can't be reproduced and the root cause is hard to identified.
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #5)
> https://bugzilla.mozilla.org/show_bug.cgi?id=860075#c18
> "For this specific case, the root cause is that Settings/Bluetooth app
> registers for same Bluetooth signal twice but when it's killed only one
> record is removed, and the dangling one causes SIGSEGV later."
> 
> I tried the same STR that bug 860075 used but I cannot reproduce it.
I was wrong. It's different from bug 860075. The crash here happened on Settings application.
Shawn, can we add a debug patch around this area of code that may help gather with additional logging/asserts etc which might be helfulp with investigation ? Tapas, can try to apply that locally and see if we get any more information that can be helpful here.

Tapas, anyway you can help with more information here? How often do you hit this? Is there a particular pattern that we can try etc ?
Flags: needinfo?(tkundu)
Flags: needinfo?(shuang)
Sorry, I still need your help to pin out the root cause. I added extra logs, can you help to get logcat log while the problem occurs?
Flags: needinfo?(shuang)
shuang, can you confirm the exact device  spec that you are using? (may be flame + tuned to what memory ?)
Flags: needinfo?(shuang)
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #9)
> Created attachment 8449367 [details] [diff] [review]
> bug1013748-extra-log.patch
> 
> Sorry, I still need your help to pin out the root cause. I added extra logs,
> can you help to get logcat log while the problem occurs?

(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #9)
> Created attachment 8449367 [details] [diff] [review]
> bug1013748-extra-log.patch
> 
> Sorry, I still need your help to pin out the root cause. I added extra logs,
> can you help to get logcat log while the problem occurs?

This bug can be closed now. We haven't seen this in a couple weeks and if it reproduces then we'll re-open this bug and apply this patch to collect more info.

Thanks a lot for all debugging effort on this.
Flags: needinfo?(tkundu)
Based on comment 11, closing it for now.
Lets revisit when it reproduces again.
Thanks Shawn for your help.
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(shuang)
Resolution: --- → WORKSFORME
We just saw this in again in AU 58 (cafbot should give build details later).
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Whiteboard: [b2g-crash] → [CR 708513][b2g-crash]
Whiteboard: [CR 708513][b2g-crash] → [caf priority: p1][CR 708513][b2g-crash]
Whiteboard: [caf priority: p1][CR 708513][b2g-crash] → [caf-crash 109][caf priority: p1][CR 708513][b2g-crash]
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.058
Moz BuildID: 20140807000201
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8cc28fd31905a0ea2b2e15d13e80a0eab2feb1ba
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=f7bd772b1e42774708a4ede13b149a1706a59b25
Blocks: CAF-v2.0-CC-metabug
No longer blocks: CAF-v2.0-FC-metabug
Shawn, can you help with this ? Looks like CAF needs to apply the extra logging patch you have here for you to analyze further ? Can you please give an updated patch that applies to our latest gecko/gaia or is the one you have here good enough ?
Flags: needinfo?(shuang)
The original crash is from mozilla::dom::bluetooth::BluetoothService::DistributeSignal(mozilla::dom::bluetooth::BluetoothSignal const&) [Observer.h : 67 + 0x4] (https://bugzilla.mozilla.org/attachment.cgi?id=8426026), 
but now crash in 
libxul.so!mozilla::dom::bluetooth::BluetoothChild::RecvNotify(mozilla::dom::bluetooth::BluetoothSignal const&) [BluetoothChild.cpp : 79 + 0x3]

It seems to be different to me, and I found bug 1050114 is the same.
Flags: needinfo?(shuang)
Tapas,
Can you share the device information regarding total memory? If something is related to OOM and we probably reproduce this bug via simulating OOM.
And do you have logcat log for this crash? Based on EXTRA file attachment, crash time is 1407820967(Tue, 12 Aug 2014 05:22:47 GMT), and I don't see it in that EXTRA file.
Flags: needinfo?(tkundu)
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #19)
> Tapas,
> Can you share the device information regarding total memory?

This is a KK based QRD  FFOS 2.0 256MB msm8610 device
> If something is
> related to OOM and we probably reproduce this bug via simulating OOM.
> And do you have logcat log for this crash? Based on EXTRA file attachment,
> crash time is 1407820967(Tue, 12 Aug 2014 05:22:47 GMT), and I don't see it
> in that EXTRA file.
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #19)
> Tapas,
> Can you share the device information regarding total memory? If something is
> related to OOM and we probably reproduce this bug via simulating OOM.
> And do you have logcat log for this crash? Based on EXTRA file attachment,
> crash time is 1407820967(Tue, 12 Aug 2014 05:22:47 GMT), and I don't see it
> in that EXTRA file.

The full log can be found here:  https://drive.google.com/file/d/0B08jm6FUqZzkcnQ4NWtBcU5RaG8/edit?usp=sharing
Flags: needinfo?(tkundu)
Crash time is 1407820967. I guess this logcat log captured in India timezone (GMT+5). Crash time shall be (Tue, 12 Aug 2014 10:22:47). Can you confirm the timezone?
Flags: needinfo?(ggrisco)
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #22)
> Crash time is 1407820967. I guess this logcat log captured in India timezone
> (GMT+5). Crash time shall be (Tue, 12 Aug 2014 10:22:47). Can you confirm
> the timezone?

I checked the logcat, from 08-12 10:22:41.910 to 08-12 10:23:07.530, only saw the following related logs.


08-12 10:22:44.680 16899 16899 E GeckoConsole: Content JS WARN at app://settings.gaiamobile.org/shared/js/fxa_iac_client.js:89 in onMessage: No callbacks for method getAccounts
08-12 10:22:44.850 32548 32548 F libc    : heap corruption detected by dlfree
08-12 10:22:44.850 32548 32548 F libc    : Fatal signal 6 (SIGABRT) at 0x00007f24 (code=-6), thread 32548 (trag)
08-12 10:22:44.960   224   224 I DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
08-12 10:22:44.960   224   224 I DEBUG   : Build fingerprint: 'qcom/msm8610/msm8610:4.4.2/KVT49L/eng.lnxbuild.20140808.184405:userdebug/test-keys'
08-12 10:22:44.960   224   224 I DEBUG   : Revision: '0'
08-12 10:22:44.960   224   224 I DEBUG   : pid: 32548, tid: 32548, name: trag  >>> trag <<<
Flags: in-moztrap?(bzumwalt)
Current STR not usable to create a test case. Unnecessary to write test case at this time.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(bzumwalt)
Flags: in-moztrap-
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage+][2.0-signoff-need+]
See Also: → 1050114, 993954
After checking the logcat log, I can only suspect Settings crashes during Bluetooth Discovery.
Can you apply this patch to print out more logs? Thanks.
Hi Shawn, sure I'll apply the patch, thanks.
Flags: needinfo?(ggrisco)
Flags: needinfo?(ggrisco)
(In reply to Greg Grisco from comment #27)
> Hi Shawn, sure I'll apply the patch, thanks.

This patch is built into AU66 and later.
Flags: needinfo?(ggrisco)
QA Whiteboard: [QAnalyst-Triage+][2.0-signoff-need+] → [QAnalyst-Triage+][lead-review+][2.0-signoff-need+]
Taking the issue since Shawn is working on other stuff.

I checked the log and walked through the related code. First of all, we know that Settings app crashed, which means it's not a b2g-crash since Settings app runs on its own process. There are two stack trace file attached on this issue (comment 0 & comment 15). Both of those pointed that the process crashed at

  bool
  BluetoothChild::RecvNotify(const BluetoothSignal& aSignal)
  {
    MOZ_ASSERT(sBluetoothService);

    if (sBluetoothService) {
      sBluetoothService->DistributeSignal(aSignal);     // Crashed here both time
    }
    return true;
  }

Shawn has already added the if-clause to prevent from the case which sBluetoothService has no value. Moreover, although frame 0 of these two stack trace are slightly different, "0x5a5a5a62" and "0x0" actually mean quite the same to me in both cases -- invalid memory. (In bug 1050114, the memory address becomes 0x2). So I think sBluetoothService had been deleted before sBluetoothService->DistributeSignal() got called.

Then I took a look at how the BluetoothServiceChildProcess object, which is created in BluetoothServiceChildProcess::Create(), be held in our code. There are two pointers holding this memory block:

  (1) StaticRefPtr<BluetoothService> sBluetoothService;    // in BluetoothService.cpp
  (2) BluetoothServiceChildProcess* sBluetoothService;     // in BluetoothChild.cpp

They both point to the same BluetoothServiceChildProcess object, therefore I guess the problem was that (2) may still get used when (1) is set to a new memory block or nullptr (so that the original one would be released). This patch modifies (2) to use StaticRefPtr instead. 

Thomas and Shawn, would you mind taking a look at my patch? Actually I don't know if this is *the* solution, but it seems to be the right way to go.

Thanks!
Assignee: shuang → echou
Attachment #8476596 - Flags: review?(tzimmermann)
Attachment #8476596 - Flags: feedback?(shuang)
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.068
Moz BuildID: 20140810160202
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=de28796a8956a48bb98ca67df6a33e0622d642d1
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2b27becae85092d46bfadcd4fb5605e82e1e1093
Hmm, if this is what the problem results from, I guess I will try to do some experiment tomorrow.
Comment on attachment 8476596 [details] [diff] [review]
patch 1: v1: Use StaticRefPtr to hold BluetoothService objects

Review of attachment 8476596 [details] [diff] [review]:
-----------------------------------------------------------------

The patch itself is OK, but I'm worried that we fix something we don't really understand.
Attachment #8476596 - Flags: review?(tzimmermann) → review+
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #32)
> Comment on attachment 8476596 [details] [diff] [review]
> patch 1: v1: Use StaticRefPtr to hold BluetoothService objects
> 
> Review of attachment 8476596 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The patch itself is OK, but I'm worried that we fix something we don't
> really understand.

Well, I've tried quite hard to figure out the root cause by reproducing it but I had no luck. To be honest, I tend to not to spend more time on it if the patch can actually fix the problem. I think I'll still check in this patch and see if this happens again in later tests. :)

Thanks!
Hi Greg,

I would like to know if my patch actually works. I've heard that you're going to start another run of test. Would you mind taking my patch (see comment 29) before running the test?
Flags: needinfo?(ggrisco)
The StaticRefPtr needs to be cleared on shutdown.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #35)
> The StaticRefPtr needs to be cleared on shutdown.
Hi Kyle,
I guess you're referring to bug 1047757, we shall also uplift that bug to v2.0?
Flags: needinfo?(khuey)
You don't need to uplift that bug.  Just don't add new leaks in your fix for this one ;)
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #37)
> You don't need to uplift that bug.  Just don't add new leaks in your fix for
> this one ;)

It seems like shwan debugging patch has some memleak issue We are waiting to see new patch which is memleak free. Please suggest.
Flags: needinfo?(ggrisco) → needinfo?(shuang)
* Updated by Kyle's comment. Ensure it'll be cleaned up on shutdown.
Attachment #8476596 - Attachment is obsolete: true
Attachment #8476596 - Flags: feedback?(shuang)
Attachment #8478063 - Flags: review?(khuey)
Flags: needinfo?(shuang)
Hi Tapas,

Please use the patch attached on comment 39. The patch includes the original fix and the cleanup.
Flags: needinfo?(tkundu)
Comment on attachment 8478063 [details] [diff] [review]
(v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown, r=tzimmermann, r=khuey

Review of attachment 8478063 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/bluetooth/BluetoothService.cpp
@@ +742,4 @@
>    }
>  
>    sBluetoothService = service;
> +  ClearOnShutdown(&sBluetoothService);

We don't really need this on the branch, but it won't hurt.
Attachment #8478063 - Flags: review?(khuey) → review+
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #41)
> Comment on attachment 8478063 [details] [diff] [review]
> patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure
> it'll be cleaned up on shutdown
> 
> Review of attachment 8478063 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: dom/bluetooth/BluetoothService.cpp
> @@ +742,4 @@
> >    }
> >  
> >    sBluetoothService = service;
> > +  ClearOnShutdown(&sBluetoothService);
> 
> We don't really need this on the branch, but it won't hurt.

Got it. Thanks!
Comment on attachment 8478063 [details] [diff] [review]
(v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown, r=tzimmermann, r=khuey

* This patch is for 2.0
Attachment #8478063 - Attachment description: patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown → (v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown
Comment on attachment 8478063 [details] [diff] [review]
(v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown, r=tzimmermann, r=khuey

>From 2c030d9d6e63d050561f09dab0f1b3aecf8e196b Mon Sep 17 00:00:00 2001
>From: Eric Chou <echou@mozilla.com>
>Date: Thu, 21 Aug 2014 19:31:39 +0800
>Subject: [PATCH] Bug 1013748 - Use StaticRefPtr to hold BluetoothService
> objects
>
>---
> dom/bluetooth/BluetoothService.cpp   | 3 +++
> dom/bluetooth/ipc/BluetoothChild.cpp | 6 +++++-
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
>diff --git a/dom/bluetooth/BluetoothService.cpp b/dom/bluetooth/BluetoothService.cpp
>index 6155de6..69812b6 100644
>--- a/dom/bluetooth/BluetoothService.cpp
>+++ b/dom/bluetooth/BluetoothService.cpp
>@@ -20,6 +20,7 @@
> #include "BluetoothUtils.h"
> 
> #include "jsapi.h"
>+#include "mozilla/ClearOnShutdown.h"
> #include "mozilla/Services.h"
> #include "mozilla/StaticPtr.h"
> #include "mozilla/unused.h"
>@@ -741,6 +742,8 @@ BluetoothService::Get()
>   }
> 
>   sBluetoothService = service;
>+  ClearOnShutdown(&sBluetoothService);
>+
>   return sBluetoothService;
> }
> 
>diff --git a/dom/bluetooth/ipc/BluetoothChild.cpp b/dom/bluetooth/ipc/BluetoothChild.cpp
>index 079984f..3a1e2ad 100644
>--- a/dom/bluetooth/ipc/BluetoothChild.cpp
>+++ b/dom/bluetooth/ipc/BluetoothChild.cpp
>@@ -9,6 +9,8 @@
> #include "BluetoothChild.h"
> 
> #include "mozilla/Assertions.h"
>+#include "mozilla/ClearOnShutdown.h"
>+#include "mozilla/StaticPtr.h"
> #include "nsDebug.h"
> #include "nsISupportsImpl.h"
> #include "nsThreadUtils.h"
>@@ -17,11 +19,12 @@
> #include "BluetoothService.h"
> #include "BluetoothServiceChildProcess.h"
> 
>+using namespace mozilla;
> USING_BLUETOOTH_NAMESPACE
> 
> namespace {
> 
>-BluetoothServiceChildProcess* sBluetoothService;
>+StaticRefPtr<BluetoothServiceChildProcess> sBluetoothService;
> 
> } // anonymous namespace
> 
>@@ -37,6 +40,7 @@ BluetoothChild::BluetoothChild(BluetoothServiceChildProcess* aBluetoothService)
>   MOZ_ASSERT(aBluetoothService);
> 
>   sBluetoothService = aBluetoothService;
>+  ClearOnShutdown(&sBluetoothService);
> }
> 
> BluetoothChild::~BluetoothChild()
>-- 
>1.8.3.2
>
Attachment #8478063 - Attachment description: (v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown → (v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown, r=tzimmermann, r=khuey
Attachment #8478063 - Flags: review+ → feedback?(shuang)
Comment on attachment 8478063 [details] [diff] [review]
(v2.0) patch 1: v2: Use StaticRefPtr to hold BluetoothService objects & ensure it'll be cleaned up on shutdown, r=tzimmermann, r=khuey

Review of attachment 8478063 [details] [diff] [review]:
-----------------------------------------------------------------

The patch itself is correct but honestly the real root cause is unclear for me. :-| Let's see the test results.
Attachment #8478063 - Flags: feedback?(shuang) → feedback+
(In reply to Eric Chou [:ericchou] [:echou] from comment #40)
> Hi Tapas,
> 
> Please use the patch attached on comment 39. The patch includes the original
> fix and the cleanup.

We are running same test with your patch from comment 39 . Thanks .
https://hg.mozilla.org/integration/b2g-inbound/rev/ecb763372316

Thanks to your review effort, Thomas, Kyle and Shawn.
https://hg.mozilla.org/mozilla-central/rev/ecb763372316
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: 2.0 S5 (4july) → 2.1 S3 (29aug)
We are not seeing this issue after landing Fix from Comment 47. We will create new bug if we see this again !
Flags: needinfo?(tkundu)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #51)
> We are not seeing this issue after landing Fix from Comment 47. We will
> create new bug if we see this again !

Thank you, Tapas!
You need to log in before you can comment on or make changes to this bug.