Closed Bug 986365 Opened 6 years ago Closed 6 years ago

Crash in supplicant because of closing supplicant while waiting for event

Categories

(Firefox OS Graveyard :: Wifi, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:1.3+, firefox29 wontfix, firefox30 fixed, firefox31 fixed, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed)

RESOLVED FIXED
1.4 S5 (11apr)
blocking-b2g 1.3+
Tracking Status
firefox29 --- wontfix
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: vasanth, Assigned: vchang)

References

Details

(Keywords: crash, Whiteboard: [caf-crash 47][caf priority: p3][CR 584054][b2g-crash][cert])

Attachments

(5 files, 5 obsolete files)

We see the attached crash once in a while, by running browser, wifi on/off test scripts together.

From the call stack
Only “ctrl->s” can cause crash in 
res = recv(ctrl->s, reply, *reply_len, 0); -> line 503, wpa_ctrl.c [1]

ctrl is same as monitor_conn[index] in “wifi.c, wifi_ctrl_recv()”
which can become null when “wifi.c wifi_close_sockets()” is called [2]

So my guess is FFOS framework closed the wifi socket when wpa_ctrl_recv() is accessing it?
wifi_wait_on_socket() already checks whether monitor_conn[index] is NULL [3], 
but still framework might have closed it after that check but before supplicant’s wpa_ctrl_recv()?

wpa_supplicant doesn't provide any locking so I guess it's frameworks job to handle that, else supplicant's open source folks will need some substantial reason to add null check at [1]
I see android kk has a change where WifiStateMachine.java communicates with supplicant only inside a monitor [4] (i.e. close and WaitForEvent cannot happen at the same time)

So is it possible to do make sure closeSupplicantConnection() does not happen when WaitForEvent() is going on in WifiWorker.js [5]?


[1] https://www.codeaurora.org/cgit/quic/la/platform/external/wpa_supplicant_8/tree/src/common/wpa_ctrl.c?h=jb_3.2#n503
[2] https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=jb_3.2
[3] https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=jb_3.2#n771
[4] https://www.codeaurora.org/cgit/quic/la/platform/frameworks/base/commit/?h=kk_3.5&id=c249c2cc6b601ed1ff063f1748ba4399b9270209
[5] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js
This carsh is while testing 1.3 version
blocking-b2g: --- → 1.3?
Whiteboard: [CR 584054]
Keywords: crash
Whiteboard: [CR 584054] → [CR 584054][b2g-crash]
Ken

Please help review and reassign
Flags: needinfo?(kchang)
blocking-b2g: 1.3? → 1.3+
We should make a note that we are entering the shutdown state and then not do waitEvent.
Vincent, please help.
Assignee: nobody → vchang
Flags: needinfo?(kchang)
Attached patch Patch v1.0 (obsolete) — Splinter Review
Attachment #8396177 - Flags: review?(chulee)
Comment on attachment 8396177 [details] [diff] [review]
Patch v1.0

Review of attachment 8396177 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/wifi/WifiWorker.js
@@ +201,5 @@
>  
>    function waitForEvent(iface) {
> +    // Request for wifi event when wpa_supplicant is ready.
> +    if (manager.state === "DISABLING" ||
> +          manager.state === "UNINITIALIZED") {

nit: |manager.state| should be aligned.

A little concern here, we might lose a disconnect state change in [1] and a disconnect event in [2], because "CTRL-EVENT-TERMINATING" is received after we disabling wpa_supplicant.
If so, some information in DOM object is not cleared [3], and apps won't the disconnect event.

Can you check if this condition exists?


[1] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js#727
[2] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js#1895
[3] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/DOMWifiManager.js#243
Attachment #8396177 - Flags: review?(chulee) → feedback+
Attached patch Patch v1.1 (obsolete) — Splinter Review
Address the review comment.
Attachment #8396177 - Attachment is obsolete: true
Attachment #8397623 - Flags: review?(chulee)
Comment on attachment 8397623 [details] [diff] [review]
Patch v1.1

Review of attachment 8397623 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!
Attachment #8397623 - Flags: review?(chulee) → review+
Hi, is it possible for you to test this patch before we check in?
Flags: needinfo?(vasanth)
(In reply to Chuck Lee [:chucklee] from comment #9)
> Hi, is it possible for you to test this patch before we check in?

Sure. This crash comes like once/day. So will provide an update in 2 days
Flags: needinfo?(vasanth)
(In reply to Chuck Lee [:chucklee] from comment #9)
> Hi, is it possible for you to test this patch before we check in?

Could you please give a patch on top of V1.3 as we are seeing this issue in that?
I don't see doDisableWifi() function at [1] which is present in the patch 

[1] https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/tree/dom/wifi/WifiWorker.js?h=mozilla/v1.3
Flags: needinfo?(chulee)
Attached patch Patch v.1.1 for 1.3 branch (obsolete) — Splinter Review
Can you try this ?
Flags: needinfo?(chulee)
(In reply to Vincent Chang[:vchang] from comment #12)
> Created attachment 8397718 [details] [diff] [review]
> Patch v.1.1 for 1.3 branch
> 
> Can you try this ?

Vasanth, are you trying this patch?
Flags: needinfo?(vasanth)
Received same crash again with provided fix.
Please find attached crashed call stack and logcat with enabled logs from  wifiworker.js and libhardware/legacy/wifi.c
Flags: needinfo?(vasanth)
Attached file Crash dump (obsolete) —
Vincent, Could you please look into this?
Flags: needinfo?(vchang)
Sure. I am working on this.  
BTW, may I have your test script so that I can verify the patch by myself ?
Flags: needinfo?(vchang)
Attached file Crash dump
Save debug info in bugzilla.
Attachment #8398930 - Attachment is obsolete: true
(In reply to Vincent Chang[:vchang] from comment #18)
> BTW, may I have your test script so that I can verify the patch by myself ?

Not sure whether I can share script. Simplest STR is like this.
1. Configure the device/AP, so that device can connect to that AP when wifi is turned on.
2. Once in 10 seconds toggle wifi state between ON/OFF
3. While repeating step 2 for 8 to 10 hours, this crash is observed atleast once

To toggle wifi state, it changes "wifi.enabled" property to true/false like this [1]
[1] https://github.com/mozilla-b2g/gaia/blob/master/apps/settings/js/wifi.js#L45
Attached patch Patch v1.2Splinter Review
Attachment #8397623 - Attachment is obsolete: true
Attachment #8397718 - Attachment is obsolete: true
Attachment #8400515 - Flags: review?(chulee)
Comment on attachment 8400515 [details] [diff] [review]
Patch v1.2

Review of attachment 8400515 [details] [diff] [review]:
-----------------------------------------------------------------

Bravo!

I think it's better add some comments explaining why we move |closeSupplicantConnection()| from |setWifiEnabled()| to |handleEvent()|.
Attachment #8400515 - Flags: review?(chulee) → review+
Hi vasanth, could please help to verify this one ? Many thanks :)
Flags: needinfo?(vasanth)
Sure Vincent. BTW were you able to reproduce the crash using STR I mentioned and is that resolved using the new patch?
Flags: needinfo?(vasanth)
(In reply to vasanth from comment #25)
> Sure Vincent. BTW were you able to reproduce the crash using STR I mentioned
> and is that resolved using the new patch?

I can't reproduce the crash issue by hand. I am trying to write the script to do that. But it's not done yet. Logically, the new patch should solve the crash. But we still need your help to verify it.
Crash is observed till now. But still testing is in progress. Will update the result again once complete testing will get over.
Sorry, typo in above comment.

Crash is "NOT" observed till now. But still testing is in progress. Will update the result again once complete testing will get over.
(In reply to Pooja from comment #28)
> Sorry, typo in above comment.
> 
> Crash is "NOT" observed till now. But still testing is in progress. Will
> update the result again once complete testing will get over.

Thanks a lot. I am also using my own script based on STR in comment 21. The test has continue for about 21 hours and still looks good :-)
Hi Pooja, may I know your test result ?
Flags: needinfo?(vasanth)
Flags: needinfo?(vasanth) → needinfo?(poojas)
Hi Vincent,
No crash is observed in 3-4 days long testing. Fix works :)
Flags: needinfo?(poojas)
Whiteboard: [CR 584054][b2g-crash] → [CR 584054][b2g-crash][cert]
Vincent, I guess this fix needs 1.4 uplift too?
status-b2g-v1.4: --- → ?
Yes, you are right. Thanks for your reminding.
https://hg.mozilla.org/mozilla-central/rev/29e804e6693f
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 993821
Please request approval-mozilla-b2g28? on this patch when it is ready for v1.3 uplift.
Flags: needinfo?(vchang)
Target Milestone: --- → 1.5 S1 (9may)
I think the patch has landed to 1.4, do I need to request for approval-mozilla-b2g28 ?
Flags: needinfo?(vchang)
The bug is marked blocking 1.3+, so it was my assumption that you did...
Vincent, should this bug still be blocking 1.3? If so, please request b2g28 approval on the patch.
Flags: needinfo?(vchang)
Attachment #8400563 - Flags: approval-mozilla-b2g28?
:vchang, can you please fill fill the approval request form that pops up when you set approval-mozilla-b2g28 ? PLease follow the guidelines at https://wiki.mozilla.org/Release_Management/Uplift_rules#Guidelines_on_approval_comments
[Approval Request Comment]
Bug caused by (feature/regressing bug #): new bug found by the QC stability test. 
User impact if declined: Toggle wifi settings on/off for about 10 hours causes gecko crash.  
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): No
String or IDL/UUID changes made by this patch: No
Flags: needinfo?(vchang)
We should uplift bug 993821 to 1.3 if this one is taken.
(In reply to Vincent Chang[:vchang] from comment #43)
> We should uplift bug 993821 to 1.3 if this one is taken.

Hi Vincent, Can you please explain why ? Based on QC testing this patch seems to work fine or is their a dependency related to  bug 993821 which we are missing to read here ? At this point for 1.3 we should be cautious on landing just what's needed and go with the lowest risk options without introducing further risk for follow-up regressions.
Flags: needinfo?(vchang)
(In reply to bhavana bajaj [:bajaj] from comment #44)
> (In reply to Vincent Chang[:vchang] from comment #43)
> > We should uplift bug 993821 to 1.3 if this one is taken.
> 
> Hi Vincent, Can you please explain why ? Based on QC testing this patch
> seems to work fine or is their a dependency related to  bug 993821 which we
> are missing to read here ? At this point for 1.3 we should be cautious on
> landing just what's needed and go with the lowest risk options without
> introducing further risk for follow-up regressions.

Ok, partially answering my own question here, looks like we need 993821 due to : https://bugzilla.mozilla.org/show_bug.cgi?id=993821#c6. Would the risk profile be same here ?
Attachment #8400563 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
(In reply to bhavana bajaj [:bajaj] from comment #44)
> (In reply to Vincent Chang[:vchang] from comment #43)
> > We should uplift bug 993821 to 1.3 if this one is taken.
> 
> Hi Vincent, Can you please explain why ? Based on QC testing this patch
> seems to work fine or is their a dependency related to  bug 993821 which we
> are missing to read here ? At this point for 1.3 we should be cautious on
> landing just what's needed and go with the lowest risk options without
> introducing further risk for follow-up regressions.

I used Unagi to do stability test and found that wpa_supplicant may not reply the TERMINATE event. Not sure if QC used the same version of wpa_supplicant when performing their test. 
I think we should apply this to prevent any potential bugs.
Flags: needinfo?(vchang)
Flags: in-moztrap-
(In reply to Vincent Chang[:vchang] from comment #29)
> (In reply to Pooja from comment #28)
> > Sorry, typo in above comment.
> > 
> > Crash is "NOT" observed till now. But still testing is in progress. Will
> > update the result again once complete testing will get over.
> 
> Thanks a lot. I am also using my own script based on STR in comment 21. The
> test has continue for about 21 hours and still looks good :-)

Hi Vincent,

Can you provide the script being used to test this?

Thanks!
Flags: needinfo?(vchang)
Peter - we don't need a manual test case for this.
Flags: needinfo?(vchang)
Whiteboard: [CR 584054][b2g-crash][cert] → [caf priority: p3][CR 584054][b2g-crash][cert]
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_JB_3.2.01.03.00.112.284
Moz BuildID: 20140323004002
B2G Version: 1.3
Gecko Version: 28.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=2e2e44886439c25c57583f51496c39ea9a6bee79
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=6d7df6e4f3257ffa859d6576460f6acc25db7892
Whiteboard: [caf priority: p3][CR 584054][b2g-crash][cert] → [caf-crash 47][caf priority: p3][CR 584054][b2g-crash][cert]
You need to log in before you can comment on or make changes to this bug.