Closed
      
        Bug 986365
      
      
        Opened 11 years ago
          Closed 11 years ago
      
        
    
  
Crash in supplicant because of closing supplicant while waiting for event
Categories
(Firefox OS Graveyard :: Wifi, defect)
Tracking
(blocking-b2g:1.3+, firefox29 wontfix, firefox30 fixed, firefox31 fixed, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed)
People
(Reporter: vasanth, Assigned: vchang)
References
Details
(Keywords: crash, Whiteboard: [caf-crash 47][caf priority: p3][CR 584054][b2g-crash][cert])
Attachments
(5 files, 5 obsolete files)
| 268.42 KB,
          text/plain         | Details | |
| 228.65 KB,
          text/plain         | Details | |
| 3.51 MB,
          application/x-7z-compressed         | Details | |
| 3.19 KB,
          patch         | chucklee
:
              
              review+ | Details | Diff | Splinter Review | 
| 2.42 KB,
          patch         | gal
:
              
              approval-mozilla-b2g28+ | Details | Diff | Splinter Review | 
We see the attached crash once in a while, by running browser, wifi on/off test scripts together.
From the call stack
Only “ctrl->s” can cause crash in 
res = recv(ctrl->s, reply, *reply_len, 0); -> line 503, wpa_ctrl.c [1]
ctrl is same as monitor_conn[index] in “wifi.c, wifi_ctrl_recv()”
which can become null when “wifi.c wifi_close_sockets()” is called [2]
So my guess is FFOS framework closed the wifi socket when wpa_ctrl_recv() is accessing it?
wifi_wait_on_socket() already checks whether monitor_conn[index] is NULL [3], 
but still framework might have closed it after that check but before supplicant’s wpa_ctrl_recv()?
wpa_supplicant doesn't provide any locking so I guess it's frameworks job to handle that, else supplicant's open source folks will need some substantial reason to add null check at [1]
I see android kk has a change where WifiStateMachine.java communicates with supplicant only inside a monitor [4] (i.e. close and WaitForEvent cannot happen at the same time)
So is it possible to do make sure closeSupplicantConnection() does not happen when WaitForEvent() is going on in WifiWorker.js [5]?
[1] https://www.codeaurora.org/cgit/quic/la/platform/external/wpa_supplicant_8/tree/src/common/wpa_ctrl.c?h=jb_3.2#n503
[2] https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=jb_3.2
[3] https://www.codeaurora.org/cgit/quic/la/platform/hardware/libhardware_legacy/tree/wifi/wifi.c?h=jb_3.2#n771
[4] https://www.codeaurora.org/cgit/quic/la/platform/frameworks/base/commit/?h=kk_3.5&id=c249c2cc6b601ed1ff063f1748ba4399b9270209
[5] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js
This carsh is while testing 1.3 version
blocking-b2g: --- → 1.3?
|   | ||
| Updated•11 years ago
           | 
blocking-b2g: 1.3? → 1.3+
|   | ||
| Comment 3•11 years ago
           | ||
We should make a note that we are entering the shutdown state and then not do waitEvent.
| Assignee | ||
| Comment 5•11 years ago
           | ||
        Attachment #8396177 -
        Flags: review?(chulee)
Comment on attachment 8396177 [details] [diff] [review]
Patch v1.0
Review of attachment 8396177 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/wifi/WifiWorker.js
@@ +201,5 @@
>  
>    function waitForEvent(iface) {
> +    // Request for wifi event when wpa_supplicant is ready.
> +    if (manager.state === "DISABLING" ||
> +          manager.state === "UNINITIALIZED") {
nit: |manager.state| should be aligned.
A little concern here, we might lose a disconnect state change in [1] and a disconnect event in [2], because "CTRL-EVENT-TERMINATING" is received after we disabling wpa_supplicant.
If so, some information in DOM object is not cleared [3], and apps won't the disconnect event.
Can you check if this condition exists?
[1] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js#727
[2] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js#1895
[3] http://dxr.mozilla.org/mozilla-central/source/dom/wifi/DOMWifiManager.js#243
        Attachment #8396177 -
        Flags: review?(chulee) → feedback+
| Assignee | ||
| Comment 7•11 years ago
           | ||
Address the review comment.
        Attachment #8396177 -
        Attachment is obsolete: true
        Attachment #8397623 -
        Flags: review?(chulee)
Comment on attachment 8397623 [details] [diff] [review]
Patch v1.1
Review of attachment 8397623 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, thanks!
        Attachment #8397623 -
        Flags: review?(chulee) → review+
Hi, is it possible for you to test this patch before we check in?
Flags: needinfo?(vasanth)
|   | Reporter | |
| Comment 10•11 years ago
           | ||
(In reply to Chuck Lee [:chucklee] from comment #9)
> Hi, is it possible for you to test this patch before we check in?
Sure. This crash comes like once/day. So will provide an update in 2 days
Flags: needinfo?(vasanth)
|   | Reporter | |
| Comment 11•11 years ago
           | ||
(In reply to Chuck Lee [:chucklee] from comment #9)
> Hi, is it possible for you to test this patch before we check in?
Could you please give a patch on top of V1.3 as we are seeing this issue in that?
I don't see doDisableWifi() function at [1] which is present in the patch 
[1] https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/tree/dom/wifi/WifiWorker.js?h=mozilla/v1.3
Flags: needinfo?(chulee)
|   | ||
| Comment 13•11 years ago
           | ||
(In reply to Vincent Chang[:vchang] from comment #12)
> Created attachment 8397718 [details] [diff] [review]
> Patch v.1.1 for 1.3 branch
> 
> Can you try this ?
Vasanth, are you trying this patch?
Flags: needinfo?(vasanth)
|   | ||
| Comment 14•11 years ago
           | ||
Received same crash again with provided fix.
Please find attached crashed call stack and logcat with enabled logs from  wifiworker.js and libhardware/legacy/wifi.c
Flags: needinfo?(vasanth)
|   | ||
| Comment 15•11 years ago
           | ||
|   | ||
| Comment 16•11 years ago
           | ||
|   | Reporter | |
| Comment 17•11 years ago
           | ||
Vincent, Could you please look into this?
Flags: needinfo?(vchang)
| Assignee | ||
| Comment 18•11 years ago
           | ||
Sure. I am working on this.  
BTW, may I have your test script so that I can verify the patch by myself ?
Flags: needinfo?(vchang)
Save debug info in bugzilla.
        Attachment #8398930 -
        Attachment is obsolete: true
Preserve debug log in bugzilla.
        Attachment #8398929 -
        Attachment is obsolete: true
|   | Reporter | |
| Comment 21•11 years ago
           | ||
(In reply to Vincent Chang[:vchang] from comment #18)
> BTW, may I have your test script so that I can verify the patch by myself ?
Not sure whether I can share script. Simplest STR is like this.
1. Configure the device/AP, so that device can connect to that AP when wifi is turned on.
2. Once in 10 seconds toggle wifi state between ON/OFF
3. While repeating step 2 for 8 to 10 hours, this crash is observed atleast once
To toggle wifi state, it changes "wifi.enabled" property to true/false like this [1]
[1] https://github.com/mozilla-b2g/gaia/blob/master/apps/settings/js/wifi.js#L45
| Assignee | ||
| Comment 22•11 years ago
           | ||
        Attachment #8397623 -
        Attachment is obsolete: true
        Attachment #8397718 -
        Attachment is obsolete: true
        Attachment #8400515 -
        Flags: review?(chulee)
Comment on attachment 8400515 [details] [diff] [review]
Patch v1.2
Review of attachment 8400515 [details] [diff] [review]:
-----------------------------------------------------------------
Bravo!
I think it's better add some comments explaining why we move |closeSupplicantConnection()| from |setWifiEnabled()| to |handleEvent()|.
        Attachment #8400515 -
        Flags: review?(chulee) → review+
| Assignee | ||
| Comment 24•11 years ago
           | ||
Hi vasanth, could please help to verify this one ? Many thanks :)
Flags: needinfo?(vasanth)
|   | Reporter | |
| Comment 25•11 years ago
           | ||
Sure Vincent. BTW were you able to reproduce the crash using STR I mentioned and is that resolved using the new patch?
Flags: needinfo?(vasanth)
| Assignee | ||
| Comment 26•11 years ago
           | ||
(In reply to vasanth from comment #25)
> Sure Vincent. BTW were you able to reproduce the crash using STR I mentioned
> and is that resolved using the new patch?
I can't reproduce the crash issue by hand. I am trying to write the script to do that. But it's not done yet. Logically, the new patch should solve the crash. But we still need your help to verify it.
|   | ||
| Comment 27•11 years ago
           | ||
Crash is observed till now. But still testing is in progress. Will update the result again once complete testing will get over.
|   | ||
| Comment 28•11 years ago
           | ||
Sorry, typo in above comment.
Crash is "NOT" observed till now. But still testing is in progress. Will update the result again once complete testing will get over.
| Assignee | ||
| Comment 29•11 years ago
           | ||
(In reply to Pooja from comment #28)
> Sorry, typo in above comment.
> 
> Crash is "NOT" observed till now. But still testing is in progress. Will
> update the result again once complete testing will get over.
Thanks a lot. I am also using my own script based on STR in comment 21. The test has continue for about 21 hours and still looks good :-)
|   | ||
| Comment 31•11 years ago
           | ||
Hi Vincent,
No crash is observed in 3-4 days long testing. Fix works :)
Flags: needinfo?(poojas)
|   | ||
| Updated•11 years ago
           | 
Whiteboard: [CR 584054][b2g-crash] → [CR 584054][b2g-crash][cert]
| Assignee | ||
| Comment 32•11 years ago
           | ||
| Assignee | ||
| Comment 34•11 years ago
           | ||
Yes, you are right. Thanks for your reminding.
|   | ||
| Comment 35•11 years ago
           | ||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Comment 36•11 years ago
           | ||
Please request approval-mozilla-b2g28? on this patch when it is ready for v1.3 uplift.
          status-b2g-v1.3:
          --- → affected
          status-b2g-v2.0:
          --- → fixed
Flags: needinfo?(vchang)
Target Milestone: --- → 1.5 S1 (9may)
| Updated•11 years ago
           | 
| Comment 37•11 years ago
           | ||
Target Milestone: 1.5 S1 (9may) → 1.4 S5 (11apr)
| Assignee | ||
| Comment 38•11 years ago
           | ||
I think the patch has landed to 1.4, do I need to request for approval-mozilla-b2g28 ?
Flags: needinfo?(vchang)
| Comment 39•11 years ago
           | ||
The bug is marked blocking 1.3+, so it was my assumption that you did...
| Comment 40•11 years ago
           | ||
Vincent, should this bug still be blocking 1.3? If so, please request b2g28 approval on the patch.
Flags: needinfo?(vchang)
| Assignee | ||
| Updated•11 years ago
           | 
        Attachment #8400563 -
        Flags: approval-mozilla-b2g28?
| Comment 41•11 years ago
           | ||
:vchang, can you please fill fill the approval request form that pops up when you set approval-mozilla-b2g28 ? PLease follow the guidelines at https://wiki.mozilla.org/Release_Management/Uplift_rules#Guidelines_on_approval_comments
| Assignee | ||
| Comment 42•11 years ago
           | ||
[Approval Request Comment]
Bug caused by (feature/regressing bug #): new bug found by the QC stability test. 
User impact if declined: Toggle wifi settings on/off for about 10 hours causes gecko crash.  
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): No
String or IDL/UUID changes made by this patch: No
Flags: needinfo?(vchang)
| Assignee | ||
| Comment 43•11 years ago
           | ||
We should uplift bug 993821 to 1.3 if this one is taken.
| Comment 44•11 years ago
           | ||
(In reply to Vincent Chang[:vchang] from comment #43)
> We should uplift bug 993821 to 1.3 if this one is taken.
Hi Vincent, Can you please explain why ? Based on QC testing this patch seems to work fine or is their a dependency related to  bug 993821 which we are missing to read here ? At this point for 1.3 we should be cautious on landing just what's needed and go with the lowest risk options without introducing further risk for follow-up regressions.
Flags: needinfo?(vchang)
| Comment 45•11 years ago
           | ||
(In reply to bhavana bajaj [:bajaj] from comment #44)
> (In reply to Vincent Chang[:vchang] from comment #43)
> > We should uplift bug 993821 to 1.3 if this one is taken.
> 
> Hi Vincent, Can you please explain why ? Based on QC testing this patch
> seems to work fine or is their a dependency related to  bug 993821 which we
> are missing to read here ? At this point for 1.3 we should be cautious on
> landing just what's needed and go with the lowest risk options without
> introducing further risk for follow-up regressions.
Ok, partially answering my own question here, looks like we need 993821 due to : https://bugzilla.mozilla.org/show_bug.cgi?id=993821#c6. Would the risk profile be same here ?
|   | ||
| Updated•11 years ago
           | 
        Attachment #8400563 -
        Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
| Comment 46•11 years ago
           | ||
| Updated•11 years ago
           | 
          status-b2g-v1.3T:
          --- → fixed
| Assignee | ||
| Comment 47•11 years ago
           | ||
(In reply to bhavana bajaj [:bajaj] from comment #44)
> (In reply to Vincent Chang[:vchang] from comment #43)
> > We should uplift bug 993821 to 1.3 if this one is taken.
> 
> Hi Vincent, Can you please explain why ? Based on QC testing this patch
> seems to work fine or is their a dependency related to  bug 993821 which we
> are missing to read here ? At this point for 1.3 we should be cautious on
> landing just what's needed and go with the lowest risk options without
> introducing further risk for follow-up regressions.
I used Unagi to do stability test and found that wpa_supplicant may not reply the TERMINATE event. Not sure if QC used the same version of wpa_supplicant when performing their test. 
I think we should apply this to prevent any potential bugs.
Flags: needinfo?(vchang)
|   | ||
| Updated•11 years ago
           | 
Flags: in-moztrap-
|   | ||
| Comment 48•11 years ago
           | ||
(In reply to Vincent Chang[:vchang] from comment #29)
> (In reply to Pooja from comment #28)
> > Sorry, typo in above comment.
> > 
> > Crash is "NOT" observed till now. But still testing is in progress. Will
> > update the result again once complete testing will get over.
> 
> Thanks a lot. I am also using my own script based on STR in comment 21. The
> test has continue for about 21 hours and still looks good :-)
Hi Vincent,
Can you provide the script being used to test this?
Thanks!
Flags: needinfo?(vchang)
|   | ||
| Comment 49•11 years ago
           | ||
Peter - we don't need a manual test case for this.
Flags: needinfo?(vchang)
| Updated•11 years ago
           | 
Whiteboard: [CR 584054][b2g-crash][cert] → [caf priority: p3][CR 584054][b2g-crash][cert]
| Comment 50•11 years ago
           | ||
Observed on: 
Device: 
Gonk Version: AU_LINUX_GECKO_B2G_JB_3.2.01.03.00.112.284
Moz BuildID: 20140323004002
B2G Version: 1.3
Gecko Version: 28.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=2e2e44886439c25c57583f51496c39ea9a6bee79
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=6d7df6e4f3257ffa859d6576460f6acc25db7892
| Updated•11 years ago
           | 
Whiteboard: [caf priority: p3][CR 584054][b2g-crash][cert] → [caf-crash 47][caf priority: p3][CR 584054][b2g-crash][cert]
          You need to log in
          before you can comment on or make changes to this bug.
        
Description
•