Closed Bug 1130271 Opened 5 years ago Closed 5 years ago

Crash in /system/bin/bluetoothd @ bt-core-io.c:485 (alarm_event)

Categories

(Firefox OS Graveyard :: Bluetooth, defect)

x86
Gonk (Firefox OS)
defect
Not set

Tracking

(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S6 (20feb)
blocking-b2g 2.2+
Tracking Status
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: m1, Assigned: tzimmermann)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 510][caf priority: p2][CR 792141])

Attachments

(6 files, 5 obsolete files)

464.36 KB, text/plain
Details
7.01 KB, patch
Details | Diff | Splinter Review
468.39 KB, text/plain
Details
7.36 KB, patch
Details | Diff | Splinter Review
61 bytes, text/x-github-pull-request
tzimmermann
: review+
Details | Review
65 bytes, text/x-github-pull-request
tzimmermann
: review+
Details | Review
Attached file bluetoothd tombstone
A crash has been observed in bluetoothd during monkey testing w/ no better STR at the moment.

This has only been observed on a L-based v2.2 builds.
First observed on B2G v2.2 SHA1s from Mozilla build ID 20150201002504.

backtrace:
    #00 pc 0000212c  <unknown> 
    #01 pc 00002677  /system/bin/bluetoothd!bt-core-io.c:485 (alarm_event)
    #02 pc 00000bad  /system/lib/libfdio.so!loop.c:198 (epoll_loop_iteration)
    #03 pc 00000d83  /system/lib/libfdio.so!loop.c:220 (epoll_loop)
    #04 pc 00001075  /system/bin/bluetoothd!main.c:178 (main)
    #05 pc 0000f079  /system/lib/libc.so!libc_init_dynamic.cpp:113 (__libc_init)
    #06 pc 00001128  /system/bin/bluetoothd!service.c:0 (_start)
Whiteboard: [CR 792141]
Whiteboard: [CR 792141] → [caf priority: p2][CR 792141]
It's related to a change on L's Bluedroid API.
I don't see anything that is wrong in the code. The stack trace and line numbers seem to be off from the actual code. Could you attach the bluetoothd binary?
Whiteboard: [caf priority: p2][CR 792141] → [b2g-crash][caf-crash 510][caf priority: p2][CR 792141]
Keywords: crash
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.063
Moz BuildID: 20150204002509
B2G Version: ---
Gecko Version: 37.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=a4c4cc86303a554facb8f45b7e764e5c4473c3de
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=5523db61e51c5263191d5df2f4b40f78c82e8746
Patches: bug 1091307, bug 1067629, bug 1106637, bug 1117862
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #2)
> I don't see anything that is wrong in the code. The stack trace and line
> numbers seem to be off from the actual code. Could you attach the bluetoothd
> binary?

Maybe you checked master instead of the v2.2 branch?  The source code can be found at:
* https://git.mozilla.org/?p=b2g/platform_system_bluetoothd.git;a=tree;h=refs/heads/v2.2
* https://git.mozilla.org/?p=b2g/platform_system_libfdio.git;a=tree;h=refs/heads/v2.2

I think the static alarm_event_* functions may be getting optimized, which is making the function names in the backtrace a little fuzzy.  The file names/line numbers seem good to me though when I read up the backtrace.

Perhaps alarm_state->cb is corrupt at https://git.mozilla.org/?p=b2g/platform_system_bluetoothd.git;a=blob;f=src/bt-core-io.c;h=1668763769ba39b1e36f97ad38b56566192cd49b;hb=refs/heads/v2.2#l485, somehow?
Flags: needinfo?(mvines)
Maybe we shall use this version stack and have a try.
<project name="platform/external/bluetooth/bluedroid" path="external/bluetooth/bluedroid" revision="e3d1b03b45b062c58cc6fd9999ab847ef698468b" upstream="LF.BR.1.2.1">
We've been seeing this crash roughly once a day, so if there's a debug patch that you'd like me to locally add to the build here that would extract more data the next time it reproduces please LMK.
Oh, I checked the master branch indeed. Sorry about that.

It's the call to Bluedroid's timeout handler. That's either |bt_alarm_cb| or |timer_callback|. The former doesn't use the data argument, so we either call a wrong address from bluetoothd or there's another bug within Bluedroid. The latter does use the argument to run it's own internal callback mechanism.

Line 6 of the attached file says

> signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0xb6f0b12c

According to [1], SEGV_ACCERR means that memory is accessed with incorrect permissions. The fault address looks like it's part of the heap. It looks like Bluedroid tries to read/write from incorrectly mapped memory.

Michael, can you find out which function contains the pc 0000212c?

[1] http://lxr.free-electrons.com/source/include/uapi/asm-generic/siginfo.h#L201
Flags: needinfo?(mvines)
Hi Michael,
I add some logs in the attached patch. Would you please help to use it to dump logs for me?
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #8)
> Michael, can you find out which function contains the pc 0000212c?

Looks like none, based on the "memory map: (fault address prefixed with --->)" section of attachment 8560266 [details]: 
    b6fab000-b6facfff r--     8192  /system/lib/libc.so
    b6fad000-b6faffff rw-    12288  /system/lib/libc.so
--->b6fb0000-b6fb8fff rw-    36864  
    b6fb9000-b6fb9fff rw-     4096  [anon:linker_alloc]
    b6fba000-b6fbafff r--     4096  [anon:linker_alloc]


(In reply to Bruce Sun [:brsun] from comment #9)
> I add some logs in the attached patch. Would you please help to use it to
> dump logs for me?

Thanks Bruce, I've added the patch in the tree here.  I'll share the new tombstone when this reproduces, hopefully in 24 hours from now.
Flags: needinfo?(mvines)
> > Michael, can you find out which function contains the pc 0000212c?
> 
> Looks like none, based on the "memory map: (fault address prefixed with
> --->)" section of attachment 8560266 [details]: 
>     b6fab000-b6facfff r--     8192  /system/lib/libc.so
>     b6fad000-b6faffff rw-    12288  /system/lib/libc.so
> --->b6fb0000-b6fb8fff rw-    36864  
>     b6fb9000-b6fb9fff rw-     4096  [anon:linker_alloc]
>     b6fba000-b6fbafff r--     4096  [anon:linker_alloc]
> 

Thanks. I just noticed that PC, R0 and R3 contain the same value. So we are either trying to execute heap memory or read/write instructions. From the memory map, I'd say the former.
Crash reproduced with the patch at attachment 8561308 [details] [diff] [review] applied.  From the end of the tombstone:
---
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event(fd:29, events:1, data:0xb898ebf8) enters
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) enters
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event_in(alarm_state:0xb898ebf8, alarm_state->cb:0xb6cb468d, alarm_state->data:0xb6d739f0) enters
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) before calling alarm callback(cb:0xb6cb468d, data:0xb6d739f0)
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) after calling alarm callback(cb:0xb6cb468d, data:0xb6d739f0)
01-01 00:21:39.995  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) leaves
01-01 00:21:39.996  3821  3821 I bluetoothd: [xxx] alarm_event(fd:29, events:1, data:0xb898ebf8) leaves (res:0)
01-01 00:21:39.996  3821  3821 I bluetoothd: [xxx] alarm_event(fd:29, events:1, data:0xb898ebf8) enters
01-01 00:21:39.996  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) enters
01-01 00:21:39.996  3821  3821 I bluetoothd: [xxx] alarm_event_in(alarm_state:0xb898ebf8, alarm_state->cb:0xb6f4912c, alarm_state->data:0xb6f4912c) enters
01-01 00:21:39.996  3821  3821 I bluetoothd: [xxx] alarm_event_in(fd:29, data:0xb898ebf8) before calling alarm callback(cb:0xb6f4912c, data:0xb6f4912c)
01-01 00:21:40.007  3821  3834 I bluetoothd: [xxx] acquire_wake_lock_cb(lock_name:bluedroid_timer) enters
01-01 00:21:40.007  3821  3834 I bluetoothd: [xxx] acquire_wake_lock_cb(lock_name:bluedroid_timer) leaves
---
(In reply to Michael Vines [:m1] [:evilmachines] from comment #12)
> Created attachment 8561909 [details]
> tombstone-ad9576ac-87ee-45bc-9853-16631f3f9284.txt
> 
> Crash reproduced with the patch at attachment 8561308 [details] [diff] [review]
> [review] applied.  From the end of the tombstone:
> ---
Thanks Michael, this really helps a lot.
Hi Michael,

It seems our usage of |timerfd| might need to be improved. Logs show that one redundant |alarm_event_in()| has been called. I've added |read()| to consume 64-bits data after each |alarm_event_in()| has been triggered. Hope the redundant callback can be avoided by doing so.

Would you help to try this new patch to see if it helps or not?
Flags: needinfo?(mvines)
(In reply to Bruce Sun [:brsun] from comment #14)
> Created attachment 8561928 [details] [diff] [review]
> bug1130271_bluetoothd_alarm_state_crash_debug_log.v2.patch
> 
> Hi Michael,
> 
> It seems our usage of |timerfd| might need to be improved. Logs show that
> one redundant |alarm_event_in()| has been called. I've added |read()| to
> consume 64-bits data after each |alarm_event_in()| has been triggered. Hope
> the redundant callback can be avoided by doing so.
> 
> Would you help to try this new patch to see if it helps or not?

Oh, I think I know what's going on! in |alarm_event_in|, we remove the timer without reading it first, but return IO_OK. libfdio will not update it's internal epoll results. So we get the same timeout again (the fd is still readable). We need to return IO_POLL from that function to fix the bug.
We also need to consider that |remove_timer_cb()| is not executed immediately when |remove_timer()| returns.
Right, that's a problem as well.

Let us move the tasks out of the timer functions and into bluetoothd. The rest of libfdio works the same way and we don't actually need the task when removing a timer, because we're already within the I/O loop.
Shall I take that bug?
Feel free to take this bug. :)
OK.

Michael, could you please still try the 'read fix' while we're preparing a patch for the problem? Thanks!
I've replaced 8561308 with attachment 8561928 [details] [diff] [review] in the tree here, if the crash doesn't reproduce after a couple test cycles (~2 days from now) then we can probably be confident that the v2 patch helps.
Flags: needinfo?(mvines)
PSA: comment 22 aligns with comment 12.  That is, AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.069 does not contain attachment 8561928 [details] [diff] [review].
Call |add_fd_to_epoll_loop| directly in |add_timer|. Remove |remove_timer|.
Attachment #8562681 - Flags: feedback?(tzimmermann)
Remove |timerfd| directly after the timer callback has been triggered once.
Attachment #8562682 - Flags: feedback?(tzimmermann)
Comment on attachment 8562681 [details] [review]
https://github.com/mozilla-b2g/platform_system_libfdio/pull/4

The current code is good, but needs more cleanup.
Attachment #8562681 - Flags: feedback?(tzimmermann)
Comment on attachment 8562682 [details] [review]
https://github.com/mozilla-b2g/platform_system_bluetoothd/pull/25

This should do the job, but I added comments  for possible improvements.
Attachment #8562682 - Flags: feedback?(tzimmermann)
This already looked good.
Blocks: 1132229
blocking-b2g: --- → 2.2?
Triage: blocking
Assignee: nobody → brsun
blocking-b2g: 2.2? → 2.2+
Attachment #8562681 - Flags: review?(tzimmermann)
Attachment #8562682 - Flags: review?(tzimmermann)
Attachment #8562681 - Flags: review?(tzimmermann) → review+
Comment on attachment 8562682 [details] [review]
https://github.com/mozilla-b2g/platform_system_bluetoothd/pull/25

r- because of the two comments in the review. Should be trivial to fix though. The rest of the patches look really good.
Attachment #8562682 - Flags: review?(tzimmermann) → review-
Attachment #8562682 - Attachment is obsolete: true
Attachment #8563873 - Flags: review?(tzimmermann)
Comment on attachment 8563873 [details] [review]
https://github.com/mozilla-b2g/platform_system_bluetoothd/pull/25

Looks good.

Thanks for fixing this problem.
Attachment #8563873 - Flags: review?(tzimmermann) → review+
Keywords: checkin-needed
Comment on attachment 8562681 [details] [review]
https://github.com/mozilla-b2g/platform_system_libfdio/pull/4

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1119176
User impact if declined: bluetoothd and b2g both crash when the underlying bluetooth backend trys to set a wake alarm.
Testing completed: Manual testing locally.
Risk to taking this patch (and alternatives if risky): Low. The interface of |libfdio.so| is changed, but currently the only client that uses it is |bluetoothd|.
String or UUID changes made by this patch: N/A
Attachment #8562681 - Flags: approval-mozilla-b2g37?
Comment on attachment 8563873 [details] [review]
https://github.com/mozilla-b2g/platform_system_bluetoothd/pull/25

Please kindly refer to comment 33 for the request. Thanks.
Attachment #8563873 - Flags: approval-mozilla-b2g37?
Reverted for Werror bustage.
https://treeherder.mozilla.org/logviewer.html#?job_id=1348947&repo=b2g-inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Strange, I compiled libfdio for Nexus-5-l, we don't see this error, even though we do have flag '-Wall -Werror'.
Comment on attachment 8564871 [details] [review]
Bug 1130271: Call |add_fd_to_epoll_loop| directly in |add_timer|. Remove |remove_timer|

I think I don't have to go through the review process, I only remove unused variable. I tested it on flame-kk and now it can build successfully. And probably need to figure out why we can't catch this error with nexus-5-l configuration.
Attachment #8564871 - Flags: review?(tzimmermann) → review+
Comment on attachment 8564871 [details] [review]
Bug 1130271: Call |add_fd_to_epoll_loop| directly in |add_timer|. Remove |remove_timer|

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1119176
User impact if declined: bluetoothd and b2g both crash when the underlying bluetooth backend trys to set a wake alarm.
Testing completed: Manual testing locally.
Risk to taking this patch (and alternatives if risky): Low. The interface of |libfdio.so| is changed, but currently the only client that uses it is |bluetoothd|.
String or UUID changes made by this patch: N/A
Attachment #8564871 - Flags: approval-mozilla-b2g37?
Thanks Shawn for handling this so quickly. I built for nexus-5-l as well, but never saw an error message. Don't know why.
I can't reopen the platform_system_bluetooth pull request, so can you please create a new one so I can merge it?
Flags: needinfo?(brsun)
Keywords: checkin-needed
Attached file Github pull request for bluetoothd (obsolete) —
Reopened pull request for bluetoothd.
Attachment #8563873 - Attachment is obsolete: true
Attachment #8563873 - Flags: approval-mozilla-b2g37?
Flags: needinfo?(brsun)
Attachment #8565879 - Flags: review+
Comment on attachment 8565879 [details]
Github pull request for bluetoothd

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1119176
User impact if declined: bluetoothd and b2g both crash when the underlying bluetooth backend trys to set a wake alarm.
Testing completed: Manual testing locally.
Risk to taking this patch (and alternatives if risky): Low. The interface of |libfdio.so| is changed, but currently the only client that uses it is |bluetoothd|.
String or UUID changes made by this patch: N/A
Attachment #8565879 - Flags: approval-mozilla-b2g37?
We're still hitting Werror bustage from this.
https://treeherder.mozilla.org/logviewer.html#?job_id=1372222&repo=b2g-inbound

I can't find you on IRC, so I'll be reverting if I don't hear from you Really Soon Now.
Flags: needinfo?(tzimmermann)
And backed out again. Please find a way to reproduce these failures locally and verify successful builds before requesting checkin again.

Master: https://github.com/mozilla-b2g/platform_system_libfdio/commit/85df9b5036010d9aa057c72cdfc7c86fc51f7c10

Master: https://github.com/mozilla-b2g/platform_system_bluetoothd/commit/582acf3a92fa2cd62ddf4a7cd98068e9c1749485
Status: RESOLVED → REOPENED
Flags: needinfo?(tzimmermann)
Resolution: FIXED → ---
Target Milestone: 2.2 S6 (20feb) → ---
waiting for relanding on m-c before approving on branch.
I think there may be a problem with vcs-sync here. According to the link below, platform_system_bluetoothd hasn't picked up an update in 8 days.
http://git.mozilla.org/?p=b2g/platform_system_bluetoothd.git;a=shortlog;h=refs/heads/master

Even though Github clearly shows newer commits:
https://github.com/mozilla-b2g/platform_system_bluetoothd/commits/

So maybe we should try re-landing comment 45 once that's sorted out.
Flags: needinfo?(hwine)
In fact, the repo has not been syncing because a non-fastward change was landed. Once that is fixed in github.com/mozilla-b2g/platform_system_bluetoothd the sync we restart.

I'll file a bug to improve notification on non-fastforward situations for the future, but this needs to be fixed in the upstream repo.
Flags: needinfo?(hwine)
It seems like someone might have force pushed to platform_system_bluetoothd. Non-fast forward changes are not permitted in repos that mirror to git.mozilla.org. Thomas, can you sort this out? Might need to rebase things to the last commit that git.mozilla.org saw.
Flags: needinfo?(tzimmermann)
(In reply to Michael Wu [:mwu] from comment #51)
> It seems like someone might have force pushed to platform_system_bluetoothd.
> Non-fast forward changes are not permitted in repos that mirror to
> git.mozilla.org. Thomas, can you sort this out? Might need to rebase things
> to the last commit that git.mozilla.org saw.

Yeah, that was me. I'll revert that. Sorry about this.
Flags: needinfo?(tzimmermann)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #52)
> (In reply to Michael Wu [:mwu] from comment #51)
> > It seems like someone might have force pushed to platform_system_bluetoothd.
> > Non-fast forward changes are not permitted in repos that mirror to
> > git.mozilla.org. Thomas, can you sort this out? Might need to rebase things
> > to the last commit that git.mozilla.org saw.
> 
> Yeah, that was me. I'll revert that. Sorry about this.

Set Github repo to state of git.mozilla.org + recent fixes. The commit history should be OK now.
Taking this bug.
Assignee: brsun → tzimmermann
git.mozilla.org contains the recent changes. I'm going to prepare new pull request.
Trying...

  https://treeherder.mozilla.org/#/jobs?repo=try&revision=acb41b920f1a

It looks like nexus-5-l cannot be selected on try.
No device images can. But yeah, I bet the reason for the errors was that we were getting the libfdio changes but not the bluetoothd changes. Might also explain why you couldn't reproduce locally :-)
Attachment #8564871 - Attachment is obsolete: true
Attachment #8564871 - Flags: approval-mozilla-b2g37?
Attachment #8566492 - Flags: review+
Reopened pull request on top of fixed upstream repository.
Attachment #8566494 - Flags: review+
Attachment #8565879 - Attachment is obsolete: true
Attachment #8565879 - Flags: approval-mozilla-b2g37?
Comment on attachment 8566492 [details] [review]
Github pull request for libfdio

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1119176
User impact if declined: bluetoothd and b2g both crash when the underlying bluetooth backend trys to set a wake alarm.
Testing completed: Manual testing locally.
Risk to taking this patch (and alternatives if risky): Low. The interface of |libfdio.so| is changed, but currently the only client that uses it is |bluetoothd|.
String or UUID changes made by this patch: N/A
Attachment #8566492 - Flags: approval-mozilla-b2g37?
Comment on attachment 8566494 [details] [review]
Github pull request for bluetoothd

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1119176
User impact if declined: bluetoothd and b2g both crash when the underlying bluetooth backend trys to set a wake alarm.
Testing completed: Manual testing locally.
Risk to taking this patch (and alternatives if risky): Low. The interface of |libfdio.so| is changed, but currently the only client that uses it is |bluetoothd|.
String or UUID changes made by this patch: N/A
Attachment #8566494 - Flags: approval-mozilla-b2g37?
Try looks good and the repository's history has been fixed. Let's see if it works this time.

Ryan, sorry for this mess and creating additional workload on your side.
Keywords: checkin-needed
Green :)
Attachment #8566492 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Attachment #8566494 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
I am really sorry that I didn't handle patches and pull requests well and made so many people suffering. Thanks for all the efforts from Ryan, Thomas, Shawn, and all others involved.
Don't worry. The git.mozilla.org incident was actually my fault, as I force-pushed into the bluetoothd repo without realizing that this would cause trouble.
No longer blocks: CAF-v3.0-FL-metabug
You need to log in before you can comment on or make changes to this bug.