Closed
Bug 1005934
Opened 11 years ago
Closed 10 years ago
Move bluedroid to a separate process
Categories
(Firefox OS Graveyard :: Bluetooth, defect)
Tracking
(feature-b2g:2.2+)
People
(Reporter: bkelly, Assigned: tzimmermann)
References
Details
(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167])
+++ This bug was initially created as a clone of Bug #1002917 +++
Spinning this off based on bug 1002917 comment 34. It seems on android bluedroid is run from a separate process with reduced privileges. We should probably do the same both for improved security and to avoid different behavior as in bug 1002917.
Flags: needinfo?(echou)
Reporter | ||
Comment 1•11 years ago
|
||
I should mention I kind of just jumped to the conclusion about moving to a separate process based on Michael's comment. If there are other ways to address this, lets just rename this bug.
Assignee | ||
Comment 2•11 years ago
|
||
Hi
(In reply to Ben Kelly [:bkelly] from comment #1)
> I should mention I kind of just jumped to the conclusion about moving to a
> separate process based on Michael's comment. If there are other ways to
> address this, lets just rename this bug.
I also have been thinking about moving Bluedroid to a separate process. It's a matter of security, but also of fault isolation and overall code quality. Bluedroid's internals are quite complex and it seems better to not run the code in an isolated process.
If we do that, we'll need an IPC mechanism for talking to the process. One of my ideas was to use the BlueZ backend to talk to the Bluedroid daemon. Doing that could save us some maintenance overhead. The BlueZ backend uses DBus to talk to the BlueZ daemon, and most of it's IPC primitives seem to have an equivalent in the Bluedroid API. So the Bluedroid daemon would implement a very similar DBus interface, but use libbluedroid internally.
One of the major issues is that Bluetooth sockets with BlueZ are created by calling |socket|, while Bluedroid provides a separate API for that. I think we'd need to open the socket in the Bluedroid process and transfer the socket file descriptor to the B2G process. That should be possible with DBus AFAIK, but adds some extra complexity on the Gecko side.
Comment 3•11 years ago
|
||
More ammunition for why we need to do this sooner rather than later:
* https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/stack/btu/btu_hcif.c?h=b2g_kk_3.5#n1667
* https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/btif/src/btif_dm.c?h=b2g_kk_3.5#n1813
For those who don't want to follow the link, in certain error conditions Bluedroid will |kill(getpid(), SIGKILL);| to /recover/. That would probably be fine if it wasn't running in the main b2g process. We don't even get a crash report in this case, the device UI just "reboots" mysteriously.
Blocks: CAF-v2.0-FC-metabug
(In reply to Michael Vines [:m1] [:evilmachines] from comment #3)
> More ammunition for why we need to do this sooner rather than later:
>
> *
> https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/
> bluedroid/tree/stack/btu/btu_hcif.c?h=b2g_kk_3.5#n1667
> *
> https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/
> bluedroid/tree/btif/src/btif_dm.c?h=b2g_kk_3.5#n1813
>
> For those who don't want to follow the link, in certain error conditions
> Bluedroid will |kill(getpid(), SIGKILL);| to /recover/. That would probably
> be fine if it wasn't running in the main b2g process. We don't even get a
> crash report in this case, the device UI just "reboots" mysteriously.
Hi Michael,
Thanks for raising up with bluedroid auto-recovery topic. It looks like this kind of error happens when HCI command timeout due to mysterious Controller firmware problem or maybe UART or SMD interface got stuck? So HCI commands never have chances to be acked from Bluetooth Controller. How can bluedroid library directly send SIGKILL to recover this situation if it can be considered as a HW Error?
Moreover, in Android, SIGKILL is sent to Bluetooth app, it only restarts to re-allocated HAL again, register all callbacks again, will that really fix original HW error problem? Or it will also automatically try to turn on/off bluetooth again to reset Controller?
If the whole point is to reset Bluetooth Controller, why can we just simply turn off and on again?
Flags: needinfo?(mvines)
https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/main/bte_main.c?h=b2g_kk_3.5#n325
Ok, so if we enable bluetooth again bluedroid will reset bluetooth power first.
Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT received without sending SIGKILL?
In this way, we can cleanup and re-init bluedroid stack from Gecko, and after everything bluedroid internal restarts, we can call enable API from Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in bte_main.c). In the long run, we need to move bluedroid to a separate process.
Comment 6•11 years ago
|
||
If BlueZ's architecture is better than BlueDroid's, would it make sense to instead turn into the direction of BlueZ for Android? I recently read https://lwn.net/Articles/597293/ on it and it seems pretty active.
Assignee | ||
Comment 7•11 years ago
|
||
HA! This article has come up several time now.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6)
> If BlueZ's architecture is better than BlueDroid's, would it make sense to
> instead turn into the direction of BlueZ for Android? I recently read
> https://lwn.net/Articles/597293/ on it and it seems pretty active.
This is something we're studying recently.
Maybe we can do something similar, like this: https://kernel.googlesource.com/pub/scm/bluetooth/bluez/+/5.10/android/hal-ipc-api.txt
Then we can write standalone bluedroid daemon which calls original bluedroid library to talk with Android HAL library via Android HAL protocol. So Gecko side does not need to change much, and probably we can use the same bluedroid bluetooth.default library to use bluedroid and BlueZ 5.18 at the same time.
Comment 9•11 years ago
|
||
We also run into the kill() from btif_dm.c during Wifi subsystem restart with BT active. This shows up as a 'HW error' from the Bluedroid POV.
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #5)
> Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT
> received without sending SIGKILL?
> In this way, we can cleanup and re-init bluedroid stack from Gecko, and
> after everything bluedroid internal restarts, we can call enable API from
> Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in
> bte_main.c).
That might work as a bandaid assuming bluedroid can cleanly reinit itself.
> In the long run, we need to move bluedroid to a separate process.
Yep
Flags: needinfo?(mvines)
Whiteboard: [cr 667167]
(In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> We also run into the kill() from btif_dm.c during Wifi subsystem restart
> with BT active. This shows up as a 'HW error' from the Bluedroid POV.
>
When will Wifi subsystem restart BT active and trigger 'HW error'?
> (In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #5)
> > Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT
> > received without sending SIGKILL?
> > In this way, we can cleanup and re-init bluedroid stack from Gecko, and
> > after everything bluedroid internal restarts, we can call enable API from
> > Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in
> > bte_main.c).
>
> That might work as a bandaid assuming bluedroid can cleanly reinit itself.
Actually, this is as the same as Android, on Android it sends SIGKILL to Bluetooth application and restart Bluetooth application. Everything will be re-initialized and since settings database sets BT=on, Bluetooth will turn on again to reset Controller and reload firmware.
>
> > In the long run, we need to move bluedroid to a separate process.
>
> Yep
Comment 11•11 years ago
|
||
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #10)
> (In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> > We also run into the kill() from btif_dm.c during Wifi subsystem restart
> > with BT active. This shows up as a 'HW error' from the Bluedroid POV.
> >
> When will Wifi subsystem restart BT active and trigger 'HW error'?
In a couple different cases. Not that v1.3 with BlueZ doesn't have this issue of course :-/
> Actually, this is as the same as Android, on Android it sends SIGKILL to
> Bluetooth application and restart Bluetooth application.
Well in our case b2g is the "bluetooth application" that gets restarted, so it's not really the same.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #11)
> (In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #10)
> > (In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> > > We also run into the kill() from btif_dm.c during Wifi subsystem restart
> > > with BT active. This shows up as a 'HW error' from the Bluedroid POV.
> > >
> > When will Wifi subsystem restart BT active and trigger 'HW error'?
>
> In a couple different cases. Not that v1.3 with BlueZ doesn't have this
> issue of course :-/
>
> > Actually, this is as the same as Android, on Android it sends SIGKILL to
> > Bluetooth application and restart Bluetooth application.
>
> Well in our case b2g is the "bluetooth application" that gets restarted, so
> it's not really the same.
Maybe my statement is confused, what I'm trying to say that if we can get an 'extra callback' from bluedroid and we can achieve the same recovery mechanism, and sending SIGKILL is not necessary for v1.4. So I want to know if this is feasible to do in order to avoid killing b2g process and restore bt.
Comment 13•11 years ago
|
||
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #12)
> Maybe my statement is confused, what I'm trying to say that if we can get an
> 'extra callback' from bluedroid and we can achieve the same recovery
> mechanism, and sending SIGKILL is not necessary for v1.4. So I want to know
> if this is feasible to do in order to avoid killing b2g process and restore
> bt.
Maybe. The point is that this would again deviate from the manner in which Bluedroid is used in android and potentially expose us to different bugs that don't manifest in android.
Comment 14•10 years ago
|
||
Moving out of 2.0 as this is a pretty big amount of work. I think 2.2 is a safer bet based on the current planned BT work.
blocking-b2g: 2.0? → backlog
Flags: needinfo?(echou)
Updated•10 years ago
|
Whiteboard: [cr 667167] → [caf priority: p3][cr 667167]
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → tzimmermann
Status: NEW → ASSIGNED
Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Inder from comment #15)
> As per discussions with Bhavana, lets try to do it for 2.1
How about discussing this in public? I'm working on this issue almost full-time. It would have been nice to ask me before coming up with dates and version numbers.
I have already written a good part of the daemon, but it's not just about writing a separate daemon. We also need to refactor the whole Bluedroid backend. See the blocking bugs for progress on that.
There are also other major changes to the BT module underway. We shouldn't replace everything at once.
Dear Triagers,
As mentioned in comment 14, 2.2 is a safer bet. Please *don't* 2.1+ for now. This needs more discussion.
No longer blocks: CAF-v2.1-FC-metabug
Comment 17•10 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #16)
> (In reply to Inder from comment #15)
> > As per discussions with Bhavana, lets try to do it for 2.1
>
> How about discussing this in public? I'm working on this issue almost
> full-time. It would have been nice to ask me before coming up with dates and
> version numbers.
>
> I have already written a good part of the daemon, but it's not just about
> writing a separate daemon. We also need to refactor the whole Bluedroid
> backend. See the blocking bugs for progress on that.
>
> There are also other major changes to the BT module underway. We shouldn't
> replace everything at once.
>
> Dear Triagers,
>
> As mentioned in comment 14, 2.2 is a safer bet. Please *don't* 2.1+ for now.
> This needs more discussion.
Agree. First of all, this is not a bug but a feature, which means it doesn't make sense to use blocking-b2g to track this item. Second, the 'moving to 2.2' decision described in comment 14 was made by Bruce and me. I measured Bluetooth team's resource for 2.0/2.1 releases and told Bruce that we won't be able to deliver this item in 2.1. (Thomas also gave a heads up in comment 16 which explains how much work we need to do. It's a lot of work.)
Please reconsider the nomination of 2.1 blocking. Welcome to contact with Thomas, Shawn and me to know more about this feature. Thanks.
Assignee | ||
Comment 18•10 years ago
|
||
Hi Inder
(In reply to Inder from comment #15)
> As per discussions with Bhavana, lets try to do it for 2.1
After a bit of a harsh answer from my side, I think you deserve a quick overview of what we're doing.
There is some prototype code for bluetoothd at [1]. It's incomplete and untested, but complete enough to give an idea of the final daemon.
For the communication between Gecko and bluetoothd, we intent to use the protocol defined by the BlueZ devs and used in BlueZ 5. [2] It maps all relevant interfaces of Bluedroid to a network protocol. A nice side effect is that we'll be able to replace Bluedroid by BlueZ 5 without having to change a single line of code. So vendors have more options to choose from.
The really hard part is the refactoring of our Bluedroid backend into an asynchronous design, and cleaning up the Bluedroid data types. By now we could simply call any Bluedroid interface and use the returned value directly. This won't work any longer with a separate bluetoothd, because we cannot block until we receive a result from the daemon. Refactoring our Gecko code is a major task and will take weeks to months, even without testing and bug fixing. And all the patches have to be ported to bluetooth2 as well. You should look at the blocking bugs to get an idea of the current status.
[1] https://github.com/tdz/bluetoothd
[2] https://git.kernel.org/cgit/bluetooth/bluez.git/tree/android/hal-ipc-api.txt
Comment 19•10 years ago
|
||
blocking-b2g: 2.1? → ---
Updated•10 years ago
|
Whiteboard: [caf priority: p3][cr 667167] → [caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Comment 21•10 years ago
|
||
Observed on:
Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.063
Moz BuildID: 20140810160202
B2G Version: ---
Gecko Version: 32.0
Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=de28796a8956a48bb98ca67df6a33e0622d642d1
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2b27becae85092d46bfadcd4fb5605e82e1e1093
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Updated•10 years ago
|
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Comment 22•10 years ago
|
||
(Wow, cafbot gone wild. This should be fixed up here now. Sry!)
Comment 23•10 years ago
|
||
Bug 1058366 is another example of why we should all want to get Bluedroid out of the b2g process ASAP.
See Also: → 1058366
Updated•10 years ago
|
Blocks: CAF-v2.1-FC-metabug
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Updated•10 years ago
|
Updated•10 years ago
|
feature-b2g: --- → 2.2+
Updated•10 years ago
|
Summary: move bluedroid to a separate process → Move bluedroid to a separate process
Updated•10 years ago
|
QA Whiteboard: [COM=Bluetooth]
Updated•10 years ago
|
Blocks: FxOS-v2.2-features
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Comment 24•10 years ago
|
||
"Closing issue which has not been seen since 11/20/14 19:14"
Comment 25•10 years ago
|
||
cafbot gone wild?
Status: RESOLVED → REOPENED
Flags: needinfo?(ggrisco)
Resolution: WORKSFORME → ---
Comment 26•10 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #25)
> cafbot gone wild?
cafbot was just doing what it was told to do :) Anyway, fixed this in our local crash database.
Flags: needinfo?(ggrisco)
Updated•10 years ago
|
QA Whiteboard: [COM=Bluetooth] → [COM=Bluetooth] [2.2-feature-qa+]
Updated•10 years ago
|
Flags: in-moztrap?(echang)
Updated•10 years ago
|
Flags: in-moztrap?(echang) → in-moztrap?(twen)
Updated•10 years ago
|
Target Milestone: --- → 2.2 S3 (9jan)
Updated•10 years ago
|
Target Milestone: 2.2 S3 (9jan) → 2.2 S4 (23jan)
Comment 27•10 years ago
|
||
Any update on this bug?
Comment 28•10 years ago
|
||
Resolved fixed as all dependent bugs are closed, thanks for everyone working so hard on this!
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 29•10 years ago
|
||
I will run regression on this, please advise on the device to run on.
Assignee | ||
Comment 30•10 years ago
|
||
Hi Teri,
Flame-kk is what we use during development, but any other device with Bluedroid will do as well.
Comment 31•10 years ago
|
||
Thomas,
Thanks, I will run on flame-kk first then on L.
Update with status soon.
Comment 32•10 years ago
|
||
2 regression suites + PTS.
[Sys PF]Settings - Bluetooth
https://moztrap.mozilla.org/manage/suites/?pagenumber=1&pagesize=20&sortfield=created_on&sortdirection=desc&filter-name=bluetooth&filter-name=settings+-+bluetooth
[BT]Bluetooth
https://moztrap.mozilla.org/manage/suites/?pagenumber=1&pagesize=20&sortfield=created_on&sortdirection=desc&filter-name=%5Bbt%5Dbluetooth
Flags: in-moztrap?(twen) → in-moztrap+
Blocks: CAF-v3.0-FL-metabug
No longer blocks: CAF-v3.0-FL-metabug
You need to log in
before you can comment on or make changes to this bug.
Description
•