Closed Bug 796640 Opened 7 years ago Closed 7 years ago

[wifi] after disconnection, wifi does not connect back to known networks automatically

Categories

(Firefox OS Graveyard :: Gaia, defect, P1, critical)

defect

Tracking

(blocking-basecamp:+, firefox18- fixed, firefox19- fixed)

VERIFIED FIXED
blocking-basecamp +
Tracking Status
firefox18 - fixed
firefox19 - fixed

People

(Reporter: ghtobz, Assigned: mrbkap)

References

Details

(Keywords: regression, Whiteboard: [label:networking])

Attachments

(2 files)

[GitHub issue by autonome on 2012-09-13T22:58:53Z, https://github.com/mozilla-b2g/gaia/issues/4707]
STR:

1. get to office, connect to wifi
2. leave office for lunch
3. return to office after lunch (terrible taco salad, never going back there)

Expected: phone is connected to wifi

Actual: phone is not connected, must connect manually
[GitHub comment by autonome on 2012-09-14T18:07:02Z]
/cc @fabi1cazenave @mrbkap
[GitHub comment by autonome on 2012-09-27T00:09:34Z]
@nhirata can you test this? i haven't seen it recently.
[GitHub comment by nhirata on 2012-09-27T00:14:49Z]
Marking QA wanted.  I'll try testing it out.
[GitHub comment by mrbkap on 2012-09-27T18:01:18Z]
I was able to reproduce this yesterday. It looked like wpa_supplicant was getting stuck trying to scan :/
[GitHub comment by nhirata on 2012-09-28T23:59:53Z]
I was able to reproduce this as well with today's build.

Otoro phone, build 2012-09-28 us
Taken from default.xml in b2g-distro: 
* "platform_build" revision= 795261940c8b11fb7dddd7a8e9dd8561fdc4fb64
* "gaia" revision= dbe752c2bc61835a92469cb0e35ad5d938a754d5 
* "releases-mozilla-central" revision= 5e9ba780a2f3db83364a87a7b976eafe4ae834b2
* "gonk-misc" revision= dbb03748465d4985a393a3a5c23de04e119567a2
[mass adding reproducible keyword for any open Gaia bug with the word "STR:" in comments]
Keywords: reproducible
Severity: normal → critical
Priority: -- → P1
mbrbkap is on vacation - mwu, can you take a look at this critical issue ASAP?
Assignee: mrbkap → mwu
(cc vincent)
I can't reproduce this on otoro.

However, there is a bug on unagi with similar symptoms which is pretty severe and for which we have a simple fix which I've posted to bug 801935.
Assignee: mwu → nobody
Marcia - can you check whether or not you're able to reproduce?
QA Contact: mozillamarcia.knous
If we're able to reproduce, we should see if restarting the phone resolves the issue.
I'm able to reproduce this:

Build Unagi4 (10/15/12)
gaia:  589c7f8f7df88766f7a5fa944f6bb05eef04b8c3
gonk:  e6403c71e9eca8cb943739d5a0a192deac60fc51

STR:
1.  Connected to Mozilla guest network; connected successfully to a website
2.  Walked down Castro street 3 blocks without disconnecting
3.  Came back to office
4.  Looked at my settings again - wifi icon was still there, and settings showed that I was still connected to Mozilla guest
5. Tried to load a different (non cached) website

Expected:  Website loads

Actual:  I was required to toggle wifi off and back on again, and reconnect to Mozilla Guest network.  After this, loading the website worked.
Rebooting the device after I returned to office prompted wifi to scan and reconnect to the network (to which it was originally connected).
I tested this and it works for me.  I can reliably reconnect to the moz guest network as of yesterday.
I hit this just now when getting into the office, on Otoro. Connected to office wifi yesterday after flashing. Went home. Did OTA update this morning at home. Went back to office this morning, and it didn't auto-connect *until* I loaded the Settings app.
I think it has to do with connecting to 2 different networks; it should remember to auto connect when returning to a network.
I'll try when I go grab lunch :D
There might be overlap with bug 802418. For example, I have not moved from this couch since connecting to the wifi, and it has not disconnected itself (regression, didn't do this until quite recently).
The behavior of wpa_supplicant is something like this, 
supposed that otoro was connected to an access point first, later on, it moved out of the range of AP. The wpa_supplicant received DISASSOC event from driver, and reported CTRL-EVENT-DISCONNECTED to WifiManager. WifiManager sent a connection status changed event(status = disconnected) to settings app. We can turn on autoscan and scan the network continually when receiving disconnected event in settings app. 

The implementation of bug 802418 helps to avoid this bug, because it turns off the wifi when screen is off and turns on wifi when screen is on. Turn off/on the wifi makes wpa_supplicant reconnect to known AP automatically.
(In reply to Dietrich Ayala (:dietrich) from comment #18)
> There might be overlap with bug 802418. For example, I have not moved from
> this couch since connecting to the wifi, and it has not disconnected itself
> (regression, didn't do this until quite recently).

s/has not/has/

oops.
I started looking at this before I went on vacation last week and here's what I started to find:

(In reply to Vincent Chang from comment #19)
> from driver, and reported CTRL-EVENT-DISCONNECTED to WifiManager.
> WifiManager sent a connection status changed event(status = disconnected) to

This is all correct...

> settings app. We can turn on autoscan and scan the network continually when
> receiving disconnected event in settings app. 

...but this should be unnecessary. Unless we explicitly tell wpa_supplicant to DISCONNECT, when it is disconnected, it will automatically scan for known networks (in older versions, it seemed to do this based on an android property whereas in our current version it looks like it uses some sort of algorithm to figure out how often to scan). When I was testing this, I saw the supplicant realize the connect had been lost, disconnect scan once (maybe) and start scanning a second time and get stuck. This also meant that further attempts to schedule scans failed, leading to an empty network list.

We can hack around this by detecting that this is happening and sending a disconnect/reassociate pair of commands, but I'd like to know why wpa_supplicant is losing its mind first.
So, I flashed my otoro back to Android and found that they also get the same odd "stuck in scanning state" behavior from the supplicant. However, they *also* turn on background scans which seem to work independently of the main scanning stuff. I'm working ona patch that tries to do that (turning on background scans manually seemed to fix this bug for me) and hope to have it finished by tomorrow.
Smart move. ++mrbkap
Looks like steps came in on comment 15. Go ahead and re-add qawanted/steps-wanted if needed, please.
I disabled the sched_scan capability, and use android's private command to do sched_scan. It works very well for me for this issue. I don't observe scan command fail after apply this.
Comment on attachment 674593 [details] [diff] [review]
use android private command to do sched_scan

This would almost certainly fix this, though my understanding is that we're trying pretty hard to avoid changing drivers on our side. I have to go catch a flight. I'll definitely have a patch for this on the Gecko end up tomorrow.
Attached patch patch v1Splinter Review
This seems to work, though I haven't tested at all heavily yet. Vincent, what do you think? It was pretty tricky figuring out when to turn on/off the background scanning stuff, so I'd appreciate you looking pretty closely at that part.
Attachment #675359 - Flags: review?(vchang)
Comment on attachment 675359 [details] [diff] [review]
patch v1

Review of attachment 675359 [details] [diff] [review]:
-----------------------------------------------------------------

I check the implementation of wpa_supplicant. The schedule scan is triggered when we have enabled networks, the schedule scan is fired whenever we get scan-results without selected SSID. It is what wpa_supplicant do for schedule scan. This hack works very well when I do the test on otoro. It helps to prevent wpa_supplicant from getting stuck.
Not sure what will happen if we apply this patch together with wpa_supplicant which works well for schedule scan ?

::: dom/wifi/WifiWorker.js
@@ +241,5 @@
> +    var doEnable = (enable === "ON");
> +    if (doEnable === backgroundScanEnabled) {
> +      callback(false, true);
> +      return;
> +    }

Do we need to call callback and set reEnableBackgroundScan to false ?

@@ +607,1 @@
>      fields.prevState = manager.state;

Is it possible that we turn off background scan here, and fail to connect to AP ? We rely on scan-result event and disconnected event to turn on background scan. It seems fine when doing the test.

@@ +670,5 @@
> +      case "DISCONNECTED":
> +      case "INACTIVE":
> +      case "SCANNING":
> +        setBackgroundScan("ON", function(){});
> +

Turn on backgroundScan here prevents wpa_supplicant from getting stuck. I found that wpa_supplicant gets stuck when we boot the device first time with wifi settings enabled. In that case, scan command return FAIL-BUSY, which means that wifi driver or wpa_supplicant get stuck. But turn off/on wifi helps to recovery it.
Attachment #675359 - Flags: review?(vchang)
Keywords: qawanted
(In reply to Vincent Chang from comment #29)
> Not sure what will happen if we apply this patch together with
> wpa_supplicant which works well for schedule scan ?

I don't actually have a phone that I can flash that I can test on. Android has an internal properties file that it uses to figure out whether or not it should turn on background scanning. One thing I was thinking of doing was to see if we ever went into the SCANNING state for more than 3 or so seconds without getting any results left. Given how badly this is affecting our dogfooders though, I'd like to do that in a followup (unless you have any better ideas).

> ::: dom/wifi/WifiWorker.js
> Do we need to call callback and set reEnableBackgroundScan to false ?

Oops, yes.

> @@ +607,1 @@
> >      fields.prevState = manager.state;
> 
> Is it possible that we turn off background scan here, and fail to connect to
> AP ? We rely on scan-result event and disconnected event to turn on
> background scan. It seems fine when doing the test.

That's possible, but the state changing to DISCONNECTED should turn it back on.

> @@ +670,5 @@
> > +      case "DISCONNECTED":
> > +      case "INACTIVE":
> > +      case "SCANNING":
> > +        setBackgroundScan("ON", function(){});
> > +
> 
> Turn on backgroundScan here prevents wpa_supplicant from getting stuck. I
> found that wpa_supplicant gets stuck when we boot the device first time with
> wifi settings enabled. In that case, scan command return FAIL-BUSY, which
> means that wifi driver or wpa_supplicant get stuck. But turn off/on wifi
> helps to recovery it.

I don't understand what you're saying. Are you saying that you still see the supplicant getting stuck on startup? I was hoping to avoid that by starting background scanning if the supplicant was in the SCANNING state (and in my limited testing it seemed to work)
Flags: needinfo?(vchang)
> One thing I was thinking of doing was to
> see if we ever went into the SCANNING state for more than 3 or so seconds
> without getting any results left. Given how badly this is affecting our
> dogfooders though, I'd like to do that in a followup (unless you have any
> better ideas).

Sure, let's do it in a follow up bug.  

> That's possible, but the state changing to DISCONNECTED should turn it back
> on.

Got it. 

> I don't understand what you're saying. Are you saying that you still see the
> supplicant getting stuck on startup? I was hoping to avoid that by starting
> background scanning if the supplicant was in the SCANNING state (and in my
> limited testing it seemed to work)

No, I don't see the wpa_supplicant getting stuck after applying this patch. It magically avoid that. Not sure if there are any other cases which may make wpa_supplicant getting stuck. We can file follow up bugs if any. Let land this patch, I believe that this patch also fix several wifi related bugs.
Flags: needinfo?(vchang)
Hi Blake, 

Should I mark it r+ ? Could you please guide me how ? 

Regards
Vincent
One finding, when I applied this patch and tested "Bug 803932 - Unagi does not switch from cellular network to wifi when known wifi network is available.", I found that enable/disable background scan command(SET pno 1/0) may fail. 
STR 
   (1) turn on 3G data in settings to establish 3G data connection 
   (2) connect to Wifi AP, the default route switches to Wifi interface 
   (3) disable AP(other device), the default route will switch to 3G interface
   (4) enable AP(other device), the default route will switch back to Wifi interface
Repeat (3) and (4) 

After that, the wpa_supplicant seems getting stuck. 
Send DISCONNECT/SET pno 1/RECONNECT commands seem help to make it works.
Duplicate of this bug: 804466
(In reply to Vincent Chang from comment #32)
> Should I mark it r+ ? Could you please guide me how ? 

So, I realized that I already set reEnableBackgroundScan in scanCommand, so I don't need to change it. I'll file a followup bug for comment 33 (I suspected that could happen but couldn't make it happen). To mark the review as + you can click on "Details" to the right of the patch and you can set review to "+".
https://hg.mozilla.org/mozilla-central/rev/17534a39a5a5
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
tracking-firefox is not used for B2G. Sounds like we need to just uplift to mozilla-aurora when ready.
QA Contact: mozillamarcia.knous → jhammink
Works as it should on 11/14/12 nightly.  Well done, :mrbkap!
Status: RESOLVED → VERIFIED
for reference:
gaia: 1c884f41292650615b04ca9f40cab981ea9e4d00
gecko: 76848b5c67e6c164b28f366534e5eeab8fcadc2a
You need to log in before you can comment on or make changes to this bug.