Closed
Bug 1007766
Opened 11 years ago
Closed 11 years ago
Is wifi HAL being called on multiple threads?
Categories
(Firefox OS Graveyard :: Wifi, defect)
Tracking
(blocking-b2g:1.4+)
People
(Reporter: ggrisco, Assigned: vchang)
Details
(Keywords: crash, Whiteboard: [b2g-crash][p=13][ETA: 7/18])
Crash Data
Attachments
(10 files)
Steps to reproduce not available, but we have seen several crashes in wifi, coming from wifi_wait_on_socket and wifi_connect_on_socket during stability testing.
This recent crash:
[@ wpa_ctrl_recv | wifi_ctrl_recv | wifi_wait_on_socket | ICSWpaSupplicantImpl::do_wifi_wait_for_event(char const*, char*, unsigned int) ]
points to wifi_ctrl_recv() passing in invalid monitor_conn when calling wpa_ctrl_recv(), but you can see monitor_conn accessed ok in previous call to wpa_ctrl_get_fd(). monitor_conn, being static, looks like it might have been overwritten on another thread.
Reporter | ||
Comment 1•11 years ago
|
||
Updated•11 years ago
|
Reporter | ||
Comment 2•11 years ago
|
||
Another instance of similar crash is bug 1001897
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → vchang
Assignee | ||
Comment 3•11 years ago
|
||
We have event and control threads call wifi/hal library, which means that we could have control thread calls wifi_close_supplicant_connection() and event thread calls wifi_wait_for_event() to obtain message from supplicant. We have provided a patch in Bug 986365 to sync these two operations in main thread.
Comment 4•11 years ago
|
||
Did we miss a case then perhaps? Bug 986365 is already a part if the v1.4 tree yet we seem to see similar symptoms.
Assignee | ||
Comment 5•11 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #4)
> Did we miss a case then perhaps? Bug 986365 is already a part if the v1.4
> tree yet we seem to see similar symptoms.
It's possible but a little difficult to imagine without having more information.
Can you give it a shot for the patch in Bug 1005775 ? That patch guarantees to close connection and stop supplicant after receiving CTRL-EVENT-TERMINATING. It's possible to lost CTRL-EVENT-TERMINATING if we send stop supplicant command too early.
Assignee | ||
Comment 6•11 years ago
|
||
I don't see the message "I/wpa_supplicant(13400): wlan0: CTRL-EVENT-TERMINATING" in the log. I am wondering if the wpa_supplicant sends terminating event when disabling wifi ? Can you help to confirm this ?
Reporter | ||
Comment 7•11 years ago
|
||
(In reply to Vincent Chang[:vchang] from comment #5)
> Can you give it a shot for the patch in Bug 1005775 ? That patch guarantees
I tried applying this patch bug wifi scan becomes flaky, sometimes not producing results and sometimes not allowing me to toggle wifi. And then sometimes I can get the UI to say it's connected, but then not be able to browse any pages using wifi.
(In reply to Vincent Chang[:vchang] from comment #6)
> I don't see the message "I/wpa_supplicant(13400): wlan0:
> CTRL-EVENT-TERMINATING" in the log. I am wondering if the wpa_supplicant
> sends terminating event when disabling wifi ? Can you help to confirm this ?
Yes, I see this in the log when I turn wifi off:
7027:I/wpa_supplicant( 2018): wlan0: CTRL-EVENT-DISCONNECTED bssid=00:0b:86:7c:18:41 reason=3 locally_generated=1
7102:I/wpa_supplicant( 2018): wlan0: CTRL-EVENT-TERMINATING
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to Greg Grisco from comment #7)
> (In reply to Vincent Chang[:vchang] from comment #5)
> > Can you give it a shot for the patch in Bug 1005775 ? That patch guarantees
>
> I tried applying this patch bug wifi scan becomes flaky, sometimes not
> producing results and sometimes not allowing me to toggle wifi. And then
> sometimes I can get the UI to say it's connected, but then not be able to
> browse any pages using wifi.
Can you provide the wifi log after applying the patch ? You can turn on wifi log in developer options in settings.
Reporter | ||
Comment 9•11 years ago
|
||
Reporter | ||
Comment 10•11 years ago
|
||
Reporter | ||
Comment 11•11 years ago
|
||
I just uploaded minidump/extra/logs for a similar crash we just saw with signature:
[@ ? | pthread_kill | ... | __stack_chk_fail | wifi_connect_on_socket_path | ICSWpaSupplicantImpl::do_wifi_connect_to_supplicant(char const*) ]
I added some logging so that we can see the threadid on each call into wifi.h. I can see that event and control threads are logging different threadid. I'm not sure if this extra logging helps, but it looks like the last call to wifi before crash was in wifi_stop_supplicant:
01-01 10:38:18.159 205 852 E WifiHW : function = wifi_stop_supplicant
01-01 10:38:18.159 205 852 E WifiHW : Wifi called with different threadid = 0x141A9A0, old = 0x141A758
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(vchang)
Comment 13•11 years ago
|
||
(In reply to Preeti Raghunath(:Preeti) from comment #12)
> Ken
>
> Please help review this
Note - we're looking to understand if this info given in the bug is enough to go off of to fix this or not.
Assignee | ||
Comment 14•11 years ago
|
||
I still feel that the patch in Bug 1005775 could help to fix the crash issue. Did you capture the second log after apply the patch ? I am wondering what kind of problem did you encounter while applying the patch in Bug 1005775. Could you also turn on the wifi debug message in Settings/Developer/Wi-Fi output in adb when doing the test ?
According to below two lines in debug message,
01-01 10:38:18.589 198 725 D QsoftapCmd: Got softap fwreload command we are passing on
01-01 10:38:18.589 198 725 D SoftapController: Softap fwReload - Ok
the crash is happened while turning on wifi.
http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js?from=wifiworker.js&case=true#945
It's unreasonable for me that crash is happened in wifi_connect_to_supplicant because we guarantee wifi_wait_for_event should be called in event thread after wifi_connect_to_supplicant returns OK which means two operations are sync.
The prerequisite is that we always turn on wifi after receive CTRL-EVENT-TERMINITING event and causes the event thread exit wifi_wait_for_event loop.
http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js?from=wifiworker.js&case=true#778
It would be very appreciative if you could also put some debug messages in following functions
wifi_start_supplicant
wifi_connect_to_supplicant
wifi_close_supplicant_connection
do_dhcp_request
wifi_load_driver
wifi_unload_driver
Flags: needinfo?(vchang)
Updated•11 years ago
|
Flags: needinfo?(ggrisco)
Updated•11 years ago
|
blocking-b2g: 1.4? → 1.4+
Reporter | ||
Comment 15•11 years ago
|
||
(In reply to Vincent Chang[:vchang] from comment #14)
> I still feel that the patch in Bug 1005775 could help to fix the crash
Hi Vincent. Ok, I'm going to try this patch again and if I find issues with it I will be sure to capture logs and post them here.
I also have a couple new logs with some additional logging when similar crashes happened in last night's run. I'm uploading both of those now. The additional log is on the entry/exit point of the wifi apis (most of the ones you listed in comment 14.
Reporter | ||
Comment 16•11 years ago
|
||
Flags: needinfo?(ggrisco)
Reporter | ||
Comment 17•11 years ago
|
||
Reporter | ||
Comment 18•11 years ago
|
||
Reporter | ||
Comment 19•11 years ago
|
||
Comment 20•11 years ago
|
||
Since the crash is due to the stack overflow, could you have the memory content around the sp (e.g. 0xafffe890 in attachment 8422720 [details]) for us to help find any hint? Thanks!
Assignee | ||
Comment 21•11 years ago
|
||
(In reply to Greg Grisco from comment #18)
> Created attachment 8422720 [details]
> decoded minidump of crash (#4)
Do you know when is the crash happened in the log ?
01-01 05:23:52.258 23980 23980 D wpa_supplicant: mixed random - hexdump(len=20): [REMOVED]
01-01 05:23:52.258 23980 23980 D wpa_supplicant: random: Updated entropy file /data/misc/wifi/entropy.bin (own_pool_ready=2)
Or
01-01 05:24:04.518 25200 25200 D wpa_supplicant: RTM_NEWLINK: operstate=0 ifi_flags=0x1003 ([UP])
01-01 05:24:04.518 25200 25200 D wpa_supplicant: RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added
01-01 05:24:04.518 25200 25200 D wpa_supplicant: nl80211: if_removed already cleared - ignore event
or both of them.
Assignee | ||
Comment 22•11 years ago
|
||
BTW, can you point out the source code of line 511 in wpa_ctr.c ?
0 libwpa_client.so!wpa_ctrl_attach_helper [wpa_ctrl.c : 511 + 0x4]
Not sure if you can attach to the gdb when doing the test.
Flags: needinfo?(ggrisco)
Comment 24•11 years ago
|
||
We reproduced the same crash on nexus 5 by toggling airplane mode every 0.01 second with the help of gaia-ui-test. By applying this patch: https://bugzilla.mozilla.org/attachment.cgi?id=8420021 , b2g could survive more than 30 minutes while b2g would crash within one minutes without the patch. We can also explain how this crash occurred:
The event thread is going to execute [1] while the command thread is going to execute [2] (called from [3]). At this moment, |monitor_conn| is valid. Then these thread are racing. When [2] is executed first, |monitor_conn| becomes NULL and the call of [4] generates a segmentation fault.
To solve this racing issue, we need to make sure |wifi_wait_for_event| (called by event thread) and |wifi_close_supplicant_connection| (called by command thread) are not executed interleavely. The patch is trying to guarantee |WifiWorker.js::handleEvent| to return |false| before calling |wifiCommand.closeSupplicantConnection|.
[1] http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.c#685
[2] http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.c#776
[3] http://hg.mozilla.org/mozilla-central/file/58c5a3427997/dom/wifi/WifiWorker.js#l973
[4] http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.c#685
[5] http://hg.mozilla.org/mozilla-central/file/58c5a3427997/dom/wifi/WifiWorker.js#l778
Reporter | ||
Comment 25•11 years ago
|
||
(In reply to Vincent Chang[:vchang] from comment #22)
> BTW, can you point out the source code of line 511 in wpa_ctr.c ?
> 0 libwpa_client.so!wpa_ctrl_attach_helper [wpa_ctrl.c : 511 + 0x4]
I believe that line 512 is a call to wpa_ctrl_attach_helper from wpa_ctrl_detach.
> Not sure if you can attach to the gdb when doing the test.
Maybe this isn't necessary now based on comment 24.
Flags: needinfo?(ggrisco)
Reporter | ||
Comment 26•11 years ago
|
||
After applying the patch for bug 1005775 we are still seeing a similar crash. I am uploading minidump and extra files (with logging) now.
Reporter | ||
Comment 27•11 years ago
|
||
Reporter | ||
Comment 28•11 years ago
|
||
(In reply to Henry Chang [:henry] from comment #24)
> We reproduced the same crash on nexus 5 by toggling airplane mode every 0.01
> second with the help of gaia-ui-test. By applying this patch:
> https://bugzilla.mozilla.org/attachment.cgi?id=8420021 , b2g could survive
> more than 30 minutes while b2g would crash within one minutes without the
> patch. We can also explain how this crash occurred:
>
> The event thread is going to execute [1] while the command thread is going
> to execute [2] (called from [3]). At this moment, |monitor_conn| is valid.
> Then these thread are racing. When [2] is executed first, |monitor_conn|
> becomes NULL and the call of [4] generates a segmentation fault.
>
> To solve this racing issue, we need to make sure |wifi_wait_for_event|
> (called by event thread) and |wifi_close_supplicant_connection| (called by
> command thread) are not executed interleavely. The patch is trying to
> guarantee |WifiWorker.js::handleEvent| to return |false| before calling
> |wifiCommand.closeSupplicantConnection|.
>
> [1]
> http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.
> c#685
> [2]
> http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.
> c#776
> [3]
> http://hg.mozilla.org/mozilla-central/file/58c5a3427997/dom/wifi/WifiWorker.
> js#l973
> [4]
> http://androidxref.com/4.4.2_r2/xref/hardware/libhardware_legacy/wifi/wifi.
> c#685
> [5]
> http://hg.mozilla.org/mozilla-central/file/58c5a3427997/dom/wifi/WifiWorker.
> js#l778
Hi Henry,
do you recommend that I apply patch 8420021 and retest? Should I remove the patch for bug 1005775 when doing so? Or is there another fix in the works?
-Greg
Flags: needinfo?(hchang)
Comment 29•11 years ago
|
||
The patch for 1005775 is https://bugzilla.mozilla.org/attachment.cgi?id=8420021 :p Besides, the crash log after applying the patch seems closer to 1001897 (crash in wifi_connect_on_socket_path)
Flags: needinfo?(hchang)
Reporter | ||
Comment 30•11 years ago
|
||
Hi(In reply to Henry Chang [:henry] from comment #29)
> The patch for 1005775 is
> https://bugzilla.mozilla.org/attachment.cgi?id=8420021 :p Besides, the crash
> log after applying the patch seems closer to 1001897 (crash in
> wifi_connect_on_socket_path)
Ok, so you're saying that the patch does indeed fix a set of crashes, but not the one we just reproduced which is similar to bug 1001897? Just wanted to confirm.
Is there plan to land this patch then? And then we can continue discussion for this recent crash on bug 1001897?
Flags: needinfo?(hchang)
Comment 31•11 years ago
|
||
(In reply to Greg Grisco from comment #30)
> Hi(In reply to Henry Chang [:henry] from comment #29)
> > The patch for 1005775 is
> > https://bugzilla.mozilla.org/attachment.cgi?id=8420021 :p Besides, the crash
> > log after applying the patch seems closer to 1001897 (crash in
> > wifi_connect_on_socket_path)
>
> Ok, so you're saying that the patch does indeed fix a set of crashes, but
> not the one we just reproduced which is similar to bug 1001897? Just wanted
> to confirm.
>
> Is there plan to land this patch then? And then we can continue discussion
> for this recent crash on bug 1001897?
Totally! What I reproduced is the crash caused by |wpa_ctrl_recv|, which is the same as the first attached log in this bug. But unfortunately, I didn't reproduce the crash caused by |wifi_connect_on_socket_path| as described in Bug 1001897. (I was hoping they have the same root cause....)
Flags: needinfo?(hchang)
Assignee | ||
Updated•11 years ago
|
Target Milestone: --- → 2.0 S6 (18july)
Assignee | ||
Updated•11 years ago
|
Whiteboard: [b2g-crash] → [b2g-crash][p=13][ETA: 7/18]
Assignee | ||
Comment 32•11 years ago
|
||
Basically, we have found two types of crash for wifi,
1. crash is happened in wpa_ctrl_recv, this can be duplicated by toggle airplane mode mentioned in comment 24. This type of crash could be fixed by applying the patch in Bug 1005775. I would like to lift Bug 1005775 to 1.4+ and mark this as a duplicated.
2. crash is happened in wifi_connect_on_socket_path, according to the backtrace, this type of crash is happened because of stack overflow. We still can't duplicate this here and don't have good understanding about the root cause. But we could apply the patch in Bug 100575 first and focus the discussion in Bug 1001897.
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•