Closed Bug 1000961 Opened 10 years ago Closed 10 years ago

Make DBusReplyHandler use thread-safe ref-counting

Categories

(Firefox OS Graveyard :: Bluetooth, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:2.0+, firefox29 wontfix, firefox30 fixed, firefox31 fixed, firefox32+ verified, b2g-v1.2 fixed, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed)

RESOLVED FIXED
2.0 S1 (9may)
blocking-b2g 2.0+
Tracking Status
firefox29 --- wontfix
firefox30 --- fixed
firefox31 --- fixed
firefox32 + verified
b2g-v1.2 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: ekramer, Assigned: shawnjohnjr)

References

()

Details

(Keywords: regression, sec-high, smoketest, Whiteboard: [adv-main30-])

Attachments

(5 files, 5 obsolete files)

467.54 KB, text/plain
Details
132.48 KB, text/plain
Details
237.52 KB, application/vnd.tcpdump.pcap
Details
886 bytes, patch
abillings
: sec-approval+
Details | Diff | Splinter Review
863 bytes, patch
Details | Diff | Splinter Review
Description:
Making your phone "visible to all" through Settings > Bluetooth will cause the phone to become locked into the animated fox screen. 

User was unable to recover, resetting the phone through normal means (powering off or power cycling) did not work. Removing the battery and restarting the device places the user back in an infinite animated fox state.  

Repro Steps:
1) Update an Open_C to BuildID: 20140424040201
2) Open "Settings".
3) Open "Bluetooth" setting.
4) Turn on Bluetooth.
5) Activate the "Visible to all" option.

Actual:
The device crashes / resets to the animated fox screen. User cannot turn off, or power cycle, their phone. 

Expected:
Users are able to activate the Bluetooth options on their phone. 

Environmental Variables:
Device: Open_C Master MOZ
BuildID: 20140424040201
Gaia: d25852a189c9707b144eb5f82d08384eb066c0fd
Gecko: c8055a00235d
Version: 31.0a1
Firmware Version: P821A10-ENG_20140410


Keywords:
Bluetooth, Settings, Animated Fox

Notes:
We reproduced this issue on both an Open_C and a Buri, on Master Builds.


Repro frequency: 3/3 device, 100%
See attached: Will attach video shortly.
This issue does not reproduce on Aurora build on Open_C.

1.4 Environmental Variables:
Device: Open_C 1.4 MOZ
BuildID: 20140424000203
Gaia: b6a41ee9e2934fe9692da6fb1cb310cb759e9870
Gecko: 2e9abf89c087
Version: 30.0a2
Firmware Version: P821A10-ENG_20140410
Summary: [B2G][Open_C] Device resets to animated fox after making device "visible to all" in Bluetooth settings. → [B2G][Open_C] Device resets to animated fox after turning on "Bluetooth" in Bluetooth settings.
blocking-b2g: --- → 2.0?
Component: Runtime → Bluetooth
We need to get a logcat attached to this bug and/or a crash report. 

This could also be related the recent bug that was fixed - Bug 994411.
Attached file logcat
I'm on the same gecko and gaia version (possibly different base image) and I'm not seeing the issue:

Gaia      d25852a189c9707b144eb5f82d08384eb066c0fd
Gecko     https://hg.mozilla.org/mozilla-central/rev/c8055a00235d
BuildID   20140424040201
Version   31.0a1
ro.build.version.incremental=eng.cuizhaohui.20140410.144352
ro.build.date=Thu Apr 10 14:45:37 CST 2014
QA Contact: jzimbrick
Added URL for video.
Removed [Open_C] from the title due to this issue affecting Buri as well.
Summary: [B2G][Open_C] Device resets to animated fox after turning on "Bluetooth" in Bluetooth settings. → [B2G] Device resets to animated fox after turning on "Bluetooth" in Bluetooth settings.
Note: This window is using non-engineering builds, as we are blocked from finding a regression window with engineering builds by bug 1000617, so the pushlog is larger than normal windows. 

b2g-inbound Regression Window:

Last Working Environmental Variables:
Device: Buri
BuildID: 20140424023002
Gaia: 4a4a078768423426c7c0fc580c8642bc735033a5
Gecko: 91946180b1e7
Version: 31.0a1
Base Image: v1.2-device.cfg

First Broken Environmental Variables:
Device: Buri
BuildID: 20140424053001
Gaia: 4b6c9d4929c57005ab142e8eb3dc6783d55c427c
Gecko: c30df002e626
Version: 31.0a1
Base Image: v1.2-device.cfg

Last Working  Gaia / First Broken Gecko: Issue DOES occur.
Gaia: 4a4a078768423426c7c0fc580c8642bc735033a5
Gecko: c30df002e626

First Broken Gaia / Last Working Gecko: Issue does NOT occur.
Gaia: 4b6c9d4929c57005ab142e8eb3dc6783d55c427c
Gecko: 91946180b1e7

Gecko Pushlog: http://hg.mozilla.org/integration/b2g-inbound/pushloghtml?fromchange=91946180b1e7&tochange=c30df002e626
bug 993286 looks very suspicious for causing this regression.

Ben - Can you confirm? If so, can you backout?
Blocks: 993286
Flags: needinfo?(btian)
Bug 993286 only touches files for bluedroid, which are not built on Buri (BlueZ). Let me check this problem first and bring more info later.
Crash reason:  SIGSEGV
Crash address: 0x0

Thread 4 (crashed)
 0  libxul.so!mozilla::ipc::DBusReplyHandler::Release() [DBusUtils.h : 64 + 0x4]
     r4 = 0x4408f380    r5 = 0x40331080    r6 = 0x437ffd60    r7 = 0x437ffdf0
     r8 = 0x437ffd58    r9 = 0x437ffd88   r10 = 0x00000000    fp = 0x00000000
     sp = 0x437ffd10    lr = 0x4088beb3    pc = 0x40f98de0
    Found by: given as instruction pointer in context
 1  libxul.so!ProcessRemainingDeviceAddressesTask::~ProcessRemainingDeviceAddressesTask [nsAutoPtr.h : 894 + 0x3]
     r4 = 0x44018e90    r5 = 0x44018e90    r6 = 0x437ffd60    r7 = 0x437ffdf0
     r8 = 0x437ffd58    r9 = 0x437ffd88   r10 = 0x00000000    fp = 0x00000000
     sp = 0x437ffd28    pc = 0x412ccb6f
    Found by: call frame info
 2  libxul.so!ProcessRemainingDeviceAddressesTask::~ProcessRemainingDeviceAddressesTask [BluetoothDBusService.cpp:3b166b8add93 : 2722 + 0x5]
     r4 = 0x44018e90    r5 = 0x44018e90    r6 = 0x437ffd60    r7 = 0x437ffdf0
     r8 = 0x437ffd58    r9 = 0x437ffd88   r10 = 0x00000000    fp = 0x00000000
     sp = 0x437ffd30    pc = 0x412ccb91
    Found by: call frame info
 3  libxul.so!MessageLoop::RunTask(Task*) [message_loop.cc:3b166b8add93 : 358 + 0x7]
     r4 = 0x437ffde4    r5 = 0x44018e90    r6 = 0x437ffd60    r7 = 0x437ffdf0
     r8 = 0x437ffd58    r9 = 0x437ffd88   r10 = 0x00000000    fp = 0x00000000
     sp = 0x437ffd38    pc = 0x40d5e16d
    Found by: call frame info
 4  libxul.so!MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [message_loop.cc:3b166b8add93 : 365 + 0x5]
     r4 = 0x00000001    r5 = 0x437ffd50    r6 = 0x437ffd60    r7 = 0x437ffdf0
     r8 = 0x437ffd58    r9 = 0x437ffd88   r10 = 0x00000000    fp = 0x00000000
     sp = 0x437ffd48    pc = 0x40d5eec7
    Found by: call frame info
gaia="73b5d6a8773aa048054119bf5b3ca0d005b5494e"
gecko="154dafc7f594b4845466d3c68f3b116536949eef"
Add Thomas for checking ipc part.
Hello Thomas,
Can you help to check in parallel? This bug seems to be consistently happened.
Flags: needinfo?(tzimmermann)
Oh, recent change on DBusReplyHandler is Bug 999884 - Make DBusReplyHandler not use mozilla::RefCounted; r=smaug.
Flags: needinfo?(tzimmermann)
Attachment #8412354 - Attachment description: dmp_result.txt → stack trace
Flags: needinfo?(btian)
No longer blocks: 993286
Blocks: 999884
Keywords: crash, reproducible
Hi Shawn

(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #14)
> Add Thomas for checking ipc part.
> Hello Thomas,
> Can you help to check in parallel? This bug seems to be consistently
> happened.

I think this might again be caused by non-threadsafe ref-counting. The DBus reply handler gets allocated on the main thread near BluetoothDBusService.cpp:2775 and free'd on the I/O thread near BluetoothDBusService.cpp:2731.

Can you replace NS_INLINE_DECL_REFCOUNTING at DBusUtils.h:64 by NS_INLINE_DECL_THREADSAFE_REFCOUNTING and test again?
Blocks: 993286
No longer blocks: 999884
Keywords: crash, reproducible
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #17)
> Hi Shawn
> 
> (In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #14)
> > Add Thomas for checking ipc part.
> > Hello Thomas,
> > Can you help to check in parallel? This bug seems to be consistently
> > happened.
> 
> I think this might again be caused by non-threadsafe ref-counting. The DBus
> reply handler gets allocated on the main thread near
> BluetoothDBusService.cpp:2775 and free'd on the I/O thread near
> BluetoothDBusService.cpp:2731.
> 
> Can you replace NS_INLINE_DECL_REFCOUNTING at DBusUtils.h:64 by
> NS_INLINE_DECL_THREADSAFE_REFCOUNTING and test again?

You are right. Changing to threadsafe version can fix it.
Assignee: nobody → shuang
Summary: [B2G] Device resets to animated fox after turning on "Bluetooth" in Bluetooth settings. → Make DBusReplyHandler use thread-safe ref-counting
How odd. Bug 999884 didn't change refcounting from threadsafe to non-threadsafe.
Sounds like we need some fix to all the branches using DBusReplyHandler.
Comment on attachment 8412459 [details] [diff] [review]
Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting

Review of attachment 8412459 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.
Attachment #8412459 - Flags: feedback?(tzimmermann) → feedback+
Attachment #8412459 - Flags: review?(echou) → review+
Non-threadsafe refcounting bugs are potentially security sensitive, please request sec-approval before landing.
Group: core-security
Keywords: checkin-neededsec-high
(In reply to Olli Pettay [:smaug] from comment #20)
> How odd. Bug 999884 didn't change refcounting from threadsafe to
> non-threadsafe.
> Sounds like we need some fix to all the branches using DBusReplyHandler.

It isn't that odd, one of the main reasons Ehsan is converting things away from RefCounted is that it has no threadsafety assertions.
Blocks: 999884
blocking-b2g: 2.0? → 2.0+
(In reply to Olli Pettay [:smaug] from comment #20)
> How odd. Bug 999884 didn't change refcounting from threadsafe to
> non-threadsafe.

Yeah this bug was there before my patch, it was just masked.

> Sounds like we need some fix to all the branches using DBusReplyHandler.

Indeed.
(In reply to Andrew McCreight [:mccr8] from comment #24)
> (In reply to Olli Pettay [:smaug] from comment #20)
> > How odd. Bug 999884 didn't change refcounting from threadsafe to
> > non-threadsafe.
> > Sounds like we need some fix to all the branches using DBusReplyHandler.
> 
> It isn't that odd,
It is still a bit odd. We don't crash when using non-threadsafe RefCounted but do crash with
non-threadsafe xpcom refcounting.
Object layout ends up being a bit different so perhaps that is enough to mask the issue
in RefCounted case.
(In reply to Olli Pettay [:smaug] from comment #26)
> (In reply to Andrew McCreight [:mccr8] from comment #24)
> > (In reply to Olli Pettay [:smaug] from comment #20)
> > > How odd. Bug 999884 didn't change refcounting from threadsafe to
> > > non-threadsafe.
> > > Sounds like we need some fix to all the branches using DBusReplyHandler.
> > 
> > It isn't that odd,
> It is still a bit odd. We don't crash when using non-threadsafe RefCounted
> but do crash with
> non-threadsafe xpcom refcounting.

The "crash" is just us hitting the MOZ_CRASH which checks whether you're on the right thread when Release-ing the object.  Note that thread-safety assertions are on on non-debug builds on Nightly: <http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.h#46>

> Object layout ends up being a bit different so perhaps that is enough to
> mask the issue
> in RefCounted case.

No, the thing that masks the issue on RefCounted is that it lacks any thread safety assertions whatsoever, so you would race with other threads while manipulating the refcount without anyone catching you, and if you lose the race you will end up with a use after free.
Shawn, have you verified that this object is thread-safe in places other than the refcounting?
Flags: needinfo?(shuang)
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #27)
> The "crash" is just us hitting the MOZ_CRASH which checks whether you're on
> the right thread when Release-ing the object.  Note that thread-safety
> assertions are on on non-debug builds on Nightly:
> <http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.
> h#46>
Aha! we have that MOZ_CRASH enabled in b2g nighties, but not in Firefox nighties because the check is just
too slow and disturbs profiling FF nightlies. B2G stuff needs special build anyway for proper profiling.
Flags: needinfo?(shuang)
I cannot add :echou, :tzimmermann in cc list. Can anyone help?
It looks to me like they are already in the CC list for this bug.
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #28)
> Shawn, have you verified that this object is thread-safe in places other
> than the refcounting?

I have checked the use of all objects which are instances of DBusReplyHandler or its subclass such as BluetoothArrayOfDevicePropertiesReplyHandler. All of those are not used outside IO thread.

To be more careful, ni? Thomas, who has revised this part of code for many times, to confirm.
Flags: needinfo?(tzimmermann)
Comment on attachment 8412477 [details] [diff] [review]
Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting, r=echou, f=tzimmermann

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Hit use-after-delete when turning on bluetooth due to non-thread safe ref counting.

Which older supported branches are affected by this flaw?
Only master branch.
If not all supported branches, which bug introduced the flaw?
Bug 999884 reveals the potential non-thread safe ref counting problem.
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
This only happens on master branch.
How likely is this patch to cause regressions; how much testing does it need?
No regression, just change to xpcom thread-safe ref-counting.
Attachment #8412477 - Flags: sec-approval?
> To be more careful, ni? Thomas, who has revised this part of code for many
> times, to confirm.

The overall code should be thread safe. The reason why this bug exists is that I somehow assumed ref-counting to be thread-safe as well.

Almost all calls to libdbus are on the I/O thread and calls to libdbus acquire an internal lock, so you cannot run two of them concurrently.

For our data structures, I spend a lot of time to make sure that objects are only accessed on a single thread, or are owned by only a single thread at a time. The exception are places like the one we just fixed, where ownership changes. All data structures are build upon DBusReplyHandler, so they should all get fixed by the attached patch.
Flags: needinfo?(tzimmermann)
> [Security approval request comment]
> How easily could an exploit be constructed based on the patch?
> 
> Hit use-after-delete when turning on bluetooth due to non-thread safe ref
> counting.
> 
> Which older supported branches are affected by this flaw?

We only see this on master, but the bug was introduced by bug 884840, which landed ~9 month ago. All active branches should be affected.

> Do you have backports for the affected branches? If not, how different, hard
> to create, and risky will they be?

Trivial.
Group: b2g-core-security
Comment on attachment 8412477 [details] [diff] [review]
Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting, r=echou, f=tzimmermann

sec-approval+ for trunk.
Attachment #8412477 - Flags: sec-approval? → sec-approval+
https://hg.mozilla.org/mozilla-central/rev/c19c41feef86
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Please request Aurora/Beta approval on this patch.
Flags: needinfo?(shuang)
Target Milestone: --- → 2.0 S1 (9may)
Verified Fixed on the latest trunk - enabling BT does not cause the crash.

Open_C master m-c
BuildID: 20140430040206
Gaia: badc73ee7f108fa631150bded0cc57e92aad810e
Gecko: e19812f56952
Version: 32.0a1
P821A10-ENG_20140410
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #35)
> > [Security approval request comment]
> > How easily could an exploit be constructed based on the patch?
> > 
> > Hit use-after-delete when turning on bluetooth due to non-thread safe ref
> > counting.
> > 
> > Which older supported branches are affected by this flaw?
> 
> We only see this on master, but the bug was introduced by bug 884840, which
> landed ~9 month ago. All active branches should be affected.
> 
Hi Thomas,
Based on Comment 27, Refcounted is not thread-safe. I think we need to land patch of bug 999884 first and land the patch (attachment 8412477 [details] [diff] [review]) of bug 1000961. Because DBusReplyHandler still uses Refcounted class on other branches, right?
Flags: needinfo?(shuang) → needinfo?(tzimmermann)
Otherwise, can we just roll that change into a branch patch here?
> Based on Comment 27, Refcounted is not thread-safe. I think we need to land
> patch of bug 999884 first and land the patch (attachment 8412477 [details] [diff] [review]
> [diff] [review]) of bug 1000961. Because DBusReplyHandler still uses
> Refcounted class on other branches, right?

Inheriting from |AtomicRefCounted| instead of |RefCounted| should work.
Flags: needinfo?(tzimmermann)
So IIUC, we can land the trunk patch on Aurora (since bug 999884 landed there already), and we'll want a Beta patch with comment 43 addressed. And barring any unforeseen issues, the Beta patch should uplift to the other B2G release branches per the usual security uplift policy.

Shawn, you'll still need to do the Aurora/Beta approvals when ready.
Attachment #8417790 - Attachment description: Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting → Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting (v1.3)
Attachment #8417790 - Flags: review?(echou)
I've verified with v.1.3 beta patch (Attachment #8417790 [details] [diff]) on debug build, and did some basic tests, looks fine.
Attachment #8417790 - Flags: review?(echou) → review+
Attachment #8417790 - Attachment description: Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting (v1.3) → Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting (v1.3/1.4)
Comment on attachment 8418535 [details] [diff] [review]
Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting, r=echou

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 884840
User impact if declined:
Non-thread safe ref counting could cause crash
Testing completed (on m-c, etc.): on 1.3
Risk to taking this patch (and alternatives if risky): No risk
String or IDL/UUID changes made by this patch: Nothing changed
Attachment #8418535 - Flags: approval-mozilla-aurora?
Comment on attachment 8418535 [details] [diff] [review]
Bug 1000961 - Make DBusReplyHandler use thread-safe ref-counting, r=echou

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 884840
User impact if declined:
Non-thread safe ref counting could cause crash
Testing completed (on m-c, etc.): on 1.3
Risk to taking this patch (and alternatives if risky): No risk
String or IDL/UUID changes made by this patch: Nothing changed
Attachment #8418535 - Flags: approval-mozilla-beta?
Attachment #8418535 - Flags: approval-mozilla-beta?
Attachment #8418535 - Flags: approval-mozilla-beta+
Attachment #8418535 - Flags: approval-mozilla-aurora?
Attachment #8418535 - Flags: approval-mozilla-aurora+
Group: b2g-core-security
Whiteboard: [adv-main30-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: