Closed Bug 1005934 Opened 10 years ago Closed 9 years ago

Move bluedroid to a separate process

Categories

(Firefox OS Graveyard :: Bluetooth, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(feature-b2g:2.2+)

RESOLVED FIXED
2.2 S4 (23jan)
feature-b2g 2.2+

People

(Reporter: bkelly, Assigned: tzimmermann)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167])

+++ This bug was initially created as a clone of Bug #1002917 +++

Spinning this off based on bug 1002917 comment 34.  It seems on android bluedroid is run from a separate process with reduced privileges.  We should probably do the same both for improved security and to avoid different behavior as in bug 1002917.
See Also: → 1002917
I should mention I kind of just jumped to the conclusion about moving to a separate process based on Michael's comment.  If there are other ways to address this, lets just rename this bug.
Hi

(In reply to Ben Kelly [:bkelly] from comment #1)
> I should mention I kind of just jumped to the conclusion about moving to a
> separate process based on Michael's comment.  If there are other ways to
> address this, lets just rename this bug.

I also have been thinking about moving Bluedroid to a separate process. It's a matter of security, but also of fault isolation and overall code quality. Bluedroid's internals are quite complex and it seems better to not run the code in an isolated process. 

If we do that, we'll need an IPC mechanism for talking to the process. One of my ideas was to use the BlueZ backend to talk to the Bluedroid daemon. Doing that could save us some maintenance overhead. The BlueZ backend uses DBus to talk to the BlueZ daemon, and most of it's IPC primitives seem to have an equivalent in the Bluedroid API. So the Bluedroid daemon would implement a very similar DBus interface, but use libbluedroid internally.

One of the major issues is that Bluetooth sockets with BlueZ are created by calling |socket|, while Bluedroid provides a separate API for that. I think we'd need to open the socket in the Bluedroid process and transfer the socket file descriptor to the B2G process. That should be possible with DBus AFAIK, but adds some extra complexity on the Gecko side.
More ammunition for why we need to do this sooner rather than later:

* https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/stack/btu/btu_hcif.c?h=b2g_kk_3.5#n1667
* https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/btif/src/btif_dm.c?h=b2g_kk_3.5#n1813

For those who don't want to follow the link, in certain error conditions Bluedroid will |kill(getpid(), SIGKILL);| to /recover/.  That would probably be fine if it wasn't running in the main b2g process.   We don't even get a crash report in this case, the device UI just "reboots" mysteriously.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #3)
> More ammunition for why we need to do this sooner rather than later:
> 
> *
> https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/
> bluedroid/tree/stack/btu/btu_hcif.c?h=b2g_kk_3.5#n1667
> *
> https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/
> bluedroid/tree/btif/src/btif_dm.c?h=b2g_kk_3.5#n1813
> 
> For those who don't want to follow the link, in certain error conditions
> Bluedroid will |kill(getpid(), SIGKILL);| to /recover/.  That would probably
> be fine if it wasn't running in the main b2g process.   We don't even get a
> crash report in this case, the device UI just "reboots" mysteriously.

Hi Michael,
Thanks for raising up with bluedroid auto-recovery topic. It looks like this kind of error happens when HCI command timeout due to mysterious Controller firmware problem or maybe UART or SMD interface got stuck? So HCI commands never have chances to be acked from Bluetooth Controller. How can bluedroid library directly send SIGKILL to recover this situation if it can be considered as a HW Error?

Moreover, in Android, SIGKILL is sent to Bluetooth app, it only restarts to re-allocated HAL again, register all callbacks again, will that really fix original HW error problem? Or it will also automatically try to turn on/off bluetooth again to reset Controller?
If the whole point is to reset Bluetooth Controller, why can we just simply turn off and on again?
Flags: needinfo?(mvines)
https://www.codeaurora.org/cgit/quic/la/platform/external/bluetooth/bluedroid/tree/main/bte_main.c?h=b2g_kk_3.5#n325

Ok, so if we enable bluetooth again bluedroid will reset bluetooth power first.
Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT received without sending SIGKILL?
In this way, we can cleanup and re-init bluedroid stack from Gecko, and after everything bluedroid internal restarts, we can call enable API from Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in bte_main.c). In the long run, we need to move bluedroid to a separate process.
If BlueZ's architecture is better than BlueDroid's, would it make sense to instead turn into the direction of BlueZ for Android? I recently read https://lwn.net/Articles/597293/ on it and it seems pretty active.
HA! This article has come up several time now.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6)
> If BlueZ's architecture is better than BlueDroid's, would it make sense to
> instead turn into the direction of BlueZ for Android? I recently read
> https://lwn.net/Articles/597293/ on it and it seems pretty active.
This is something we're studying recently.
Maybe we can do something similar, like this: https://kernel.googlesource.com/pub/scm/bluetooth/bluez/+/5.10/android/hal-ipc-api.txt

Then we can write standalone bluedroid daemon which calls original bluedroid library to talk with Android HAL library via Android HAL protocol. So Gecko side does not need to change much, and probably we can use the same bluedroid bluetooth.default library to use bluedroid and BlueZ 5.18 at the same time.
We also run into the kill() from btif_dm.c during Wifi subsystem restart with BT active.  This shows up as a 'HW error' from the Bluedroid POV.  

(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #5)
> Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT
> received without sending SIGKILL?
> In this way, we can cleanup and re-init bluedroid stack from Gecko, and
> after everything bluedroid internal restarts, we can call enable API from
> Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in
> bte_main.c). 

That might work as a bandaid assuming bluedroid can cleanly reinit itself.

> In the long run, we need to move bluedroid to a separate process.
 
Yep
Flags: needinfo?(mvines)
Whiteboard: [cr 667167]
(In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> We also run into the kill() from btif_dm.c during Wifi subsystem restart
> with BT active.  This shows up as a 'HW error' from the Bluedroid POV.  
> 
When will Wifi subsystem restart BT active and trigger 'HW error'?

> (In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #5)
> > Can we get a new callback when bluedroid internal event BTA_DM_HW_ERROR_EVT
> > received without sending SIGKILL?
> > In this way, we can cleanup and re-init bluedroid stack from Gecko, and
> > after everything bluedroid internal restarts, we can call enable API from
> > Gecko to let bluedroid itself clean reset-BT chipset (as mentioned in
> > bte_main.c). 
> 
> That might work as a bandaid assuming bluedroid can cleanly reinit itself.
Actually, this is as the same as Android, on Android it sends SIGKILL to Bluetooth application and restart Bluetooth application. Everything will be re-initialized and since settings database sets BT=on, Bluetooth will turn on again to reset Controller and reload firmware.

> 
> > In the long run, we need to move bluedroid to a separate process.
>  
> Yep
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #10)
> (In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> > We also run into the kill() from btif_dm.c during Wifi subsystem restart
> > with BT active.  This shows up as a 'HW error' from the Bluedroid POV.  
> > 
> When will Wifi subsystem restart BT active and trigger 'HW error'?

In a couple different cases.  Not that v1.3 with BlueZ doesn't have this issue of course :-/

> Actually, this is as the same as Android, on Android it sends SIGKILL to
> Bluetooth application and restart Bluetooth application.

Well in our case b2g is the "bluetooth application" that gets restarted, so it's not really the same.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #11)
> (In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #10)
> > (In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> > > We also run into the kill() from btif_dm.c during Wifi subsystem restart
> > > with BT active.  This shows up as a 'HW error' from the Bluedroid POV.  
> > > 
> > When will Wifi subsystem restart BT active and trigger 'HW error'?
> 
> In a couple different cases.  Not that v1.3 with BlueZ doesn't have this
> issue of course :-/
> 
> > Actually, this is as the same as Android, on Android it sends SIGKILL to
> > Bluetooth application and restart Bluetooth application.
> 
> Well in our case b2g is the "bluetooth application" that gets restarted, so
> it's not really the same.
Maybe my statement is confused, what I'm trying to say that if we can get an 'extra callback' from bluedroid and we can achieve the same recovery mechanism, and sending SIGKILL is not necessary for v1.4. So I want to know if this is feasible to do in order to avoid killing b2g process and restore bt.
(In reply to Shawn Huang [:shuang] [:shawnjohnjr] from comment #12)
> Maybe my statement is confused, what I'm trying to say that if we can get an
> 'extra callback' from bluedroid and we can achieve the same recovery
> mechanism, and sending SIGKILL is not necessary for v1.4. So I want to know
> if this is feasible to do in order to avoid killing b2g process and restore
> bt.

Maybe.  The point is that this would again deviate from the manner in which Bluedroid is used in android and potentially expose us to different bugs that don't manifest in android.
Moving out of 2.0 as this is a pretty big amount of work.  I think 2.2 is a safer bet based on the current planned BT work.
blocking-b2g: 2.0? → backlog
Flags: needinfo?(echou)
Whiteboard: [cr 667167] → [caf priority: p3][cr 667167]
Assignee: nobody → tzimmermann
Status: NEW → ASSIGNED
Depends on: 1027029
Depends on: 1031122
As per discussions with Bhavana, lets try to do it for 2.1
Blocks: CAF-v2.1-FC-metabug
No longer blocks: CAF-v2.0-FC-metabug
blocking-b2g: backlog → 2.1?
(In reply to Inder from comment #15)
> As per discussions with Bhavana, lets try to do it for 2.1

How about discussing this in public? I'm working on this issue almost full-time. It would have been nice to ask me before coming up with dates and version numbers.

I have already written a good part of the daemon, but it's not just about writing a separate daemon. We also need to refactor the whole Bluedroid backend. See the blocking bugs for progress on that.

There are also other major changes to the BT module underway. We shouldn't replace everything at once.

Dear Triagers,

As mentioned in comment 14, 2.2 is a safer bet. Please *don't* 2.1+ for now. This needs more discussion.
No longer blocks: CAF-v2.1-FC-metabug
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #16)
> (In reply to Inder from comment #15)
> > As per discussions with Bhavana, lets try to do it for 2.1
> 
> How about discussing this in public? I'm working on this issue almost
> full-time. It would have been nice to ask me before coming up with dates and
> version numbers.
> 
> I have already written a good part of the daemon, but it's not just about
> writing a separate daemon. We also need to refactor the whole Bluedroid
> backend. See the blocking bugs for progress on that.
> 
> There are also other major changes to the BT module underway. We shouldn't
> replace everything at once.
> 
> Dear Triagers,
> 
> As mentioned in comment 14, 2.2 is a safer bet. Please *don't* 2.1+ for now.
> This needs more discussion.

Agree. First of all, this is not a bug but a feature, which means it doesn't make sense to use blocking-b2g to track this item. Second, the 'moving to 2.2' decision described in comment 14 was made by Bruce and me. I measured Bluetooth team's resource for 2.0/2.1 releases and told Bruce that we won't be able to deliver this item in 2.1. (Thomas also gave a heads up in comment 16 which explains how much work we need to do. It's a lot of work.)

Please reconsider the nomination of 2.1 blocking. Welcome to contact with Thomas, Shawn and me to know more about this feature. Thanks.
Hi Inder

(In reply to Inder from comment #15)
> As per discussions with Bhavana, lets try to do it for 2.1

After a bit of a harsh answer from my side, I think you deserve a quick overview of what we're doing.

There is some prototype code for bluetoothd at [1]. It's incomplete and untested, but complete enough to give an idea of the final daemon.

For the communication between Gecko and bluetoothd, we intent to use the protocol defined by the BlueZ devs and used in BlueZ 5. [2] It maps all relevant interfaces of Bluedroid to a network protocol. A nice side effect is that we'll be able to replace Bluedroid by BlueZ 5 without having to change a single line of code. So vendors have more options to choose from.

The really hard part is the refactoring of our Bluedroid backend into an asynchronous design, and cleaning up the Bluedroid data types. By now we could simply call any Bluedroid interface and use the returned value directly. This won't work any longer with a separate bluetoothd, because we cannot block until we receive a result from the daemon. Refactoring our Gecko code is a major task and will take weeks to months, even without testing and bug fixing. And all the patches have to be ported to bluetooth2 as well. You should look at the blocking bugs to get an idea of the current status.

[1] https://github.com/tdz/bluetoothd
[2] https://git.kernel.org/cgit/bluetooth/bluez.git/tree/android/hal-ipc-api.txt
Depends on: 1038591
This does not block v2.1, see Comment 14, Comment 16, Comment 17.
blocking-b2g: 2.1? → ---
Depends on: 1048913
Whiteboard: [caf priority: p3][cr 667167] → [caf priority: p1][cr 667167]
Whiteboard: [caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Keywords: crash
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.063
Moz BuildID: 20140810160202
B2G Version: ---
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=de28796a8956a48bb98ca67df6a33e0622d642d1
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=2b27becae85092d46bfadcd4fb5605e82e1e1093
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p3][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p1][cr 667167]
Whiteboard: [b2g-crash][caf-crash 269][caf priority: p1][cr 667167] → [b2g-crash][caf-crash 269][caf priority: p3][cr 667167]
(Wow, cafbot gone wild.  This should be fixed up here now.  Sry!)
Bug 1058366 is another example of why we should all want to get Bluedroid out of the b2g process ASAP.
See Also: → 1058366
Depends on: 1061126
Depends on: 1061489
Depends on: 1065335
Depends on: 1065336
Depends on: 1073548
Blocks: CAF-v2.2-metabug
No longer blocks: CAF-v2.1-FC-metabug
See Also: → 1076739
Depends on: 1084257
Depends on: 1087195
Depends on: 1087196
Depends on: 1087198
Depends on: 1091588
Depends on: 1095436
No longer depends on: 1091588
No longer depends on: 1031122
Depends on: 1095491
No longer depends on: 1084257
No longer depends on: 1087195, 1087196, 1087198
No longer depends on: 1095491
Depends on: 1095491
No longer depends on: 1095436
feature-b2g: --- → 2.2+
Summary: move bluedroid to a separate process → Move bluedroid to a separate process
QA Whiteboard: [COM=Bluetooth]
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
"Closing issue which has not been seen since 11/20/14 19:14"
cafbot gone wild?
Status: RESOLVED → REOPENED
Flags: needinfo?(ggrisco)
Resolution: WORKSFORME → ---
(In reply to Michael Vines [:m1] [:evilmachines] from comment #25)
> cafbot gone wild?

cafbot was just doing what it was told to do :)  Anyway, fixed this in our local crash database.
Flags: needinfo?(ggrisco)
QA Whiteboard: [COM=Bluetooth] → [COM=Bluetooth] [2.2-feature-qa+]
Flags: in-moztrap?(echang)
Flags: in-moztrap?(echang) → in-moztrap?(twen)
Target Milestone: --- → 2.2 S3 (9jan)
Target Milestone: 2.2 S3 (9jan) → 2.2 S4 (23jan)
Depends on: 1119746
Any update on this bug?
Resolved fixed as all dependent bugs are closed, thanks for everyone working so hard on this!
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
I will run regression on this, please advise on the device to run on.
Hi Teri,

Flame-kk is what we use during development, but any other device with Bluedroid will do as well.
Thomas, 

Thanks, I will run on flame-kk first then on L. 
Update with status soon.
No longer blocks: CAF-v3.0-FL-metabug
You need to log in before you can comment on or make changes to this bug.