Closed Bug 809663 Opened 12 years ago Closed 6 years ago

B2G Wifi: Make sure we successfully turn on background scanning (redux, again)

Categories

(Firefox OS Graveyard :: General, defect, P1)

defect

Tracking

(blocking-b2g:-)

RESOLVED WONTFIX
B2G C4 (2jan on)
blocking-b2g -

People

(Reporter: mrbkap, Unassigned)

References

Details

(Whiteboard: [awaiting partner help] [label:networking])

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #807148 +++

This bug won't die. This tracks bug 807148 comment 8. I'm going to give it to Vincent since he found it :-)
Blake, are we blocking on this, or did that flag just get carried over from the clone?
Well, the flag was accidentally carried over; but I think this needs to block (at least until we understand it better) since this bug can result in wifi not working until the phone is rebooted.
blocking-basecamp: ? → +
Marking for C2, given this meets the criteria of known P1/P2 blocking-basecamp+ bugs at the end of C1.
Target Milestone: --- → B2G C2 (20nov-10dec)
This can't be reproduced always. This case is used for monitoring. We're not quite sure how to reproduce this issue.
Keywords: qawanted
QA Contact: atsai
According to the log I found before, it seems that wifi driver got stuck and sent 
"Failed to initiate AP" event to WifiWorker. Reload the wifi driver seems help to recovery the device. Not sure if we need to do this workaround. mrbkap, how do you think ?  

685:D/wpa_supplicant(647): nl80211: Scan trigger failed: ret=-16 (Device or resource busy)
686:W/wpa_supplicant(  647): wlan0: Failed to initiate AP scan
687:I/Gecko(109): -*- WifiWorker component: Event coming in: Failed to initiate AP scan
I tried to re-enabled unloadDriver in WifiWorker.js, and make it did unload the  driver. It works fine in current kernel and driver. Don't remember why not unload driver in the beginning.
Blake, do you have any comment for comment 5? Thanks.
Flags: needinfo?(mrbkap)
Even if we tried hard to prevent wifi scan from getting stuck in Bug 796640 and Bug 807148 using background scan command(SET pno 1/0), but I found that we still may fall to do wifi scan. Even worse, the wifi driver may return -16 with device or resource busy in wpa_supplicant. In that case, the wifi driver seems getting stuck, and only reload wifi driver can recovery it.  

"D/wpa_supplicant(  739): nl80211: Scan trigger failed: ret=-16 (Device or resource busy)"  

The initial problem we tried to fix is that when device is moving out of coverage of connected AP, it will be disconnected. After the device is moving back to the coverage of known AP, it should connect to known AP automatically. 

When I was testing this, I saw the supplicant realize the connect had been lost, disconnect scan once (maybe) and start scanning a second time and get stuck in schedule scan. This also meant that further attempts to schedule scans failed, leading to an empty network list. Because we use the default setting ap_scan=1 and reply on wpa_supplicant to initiate scanning and AP selection. 

I saw there is a flag "sched_scan_supported" defined in  external/hostap/src/drivers/driver_nl80211.c

This value of this flag comes from driver's capability query results and is set to 1 in unagi. Since the problem is related to schedule scan, so I set it to 0 manually and disable background scan patch we have done in Bug 807148. After that,  I don't observe schedule scan bug anymore. It turns out to let me thought that this bug should be fixed in wpa_supplicant or in wifi driver level. 

The repo of wpa_supplicant seems link to codeaurora. Not sure if our friends can help to verify and fix it. mvines, may I have you comments ?
Use android's private command to do sched_scan
Hey Mikes, it looks like our backs are to the wall here. We've tried really hard to work around this apparent driver bug, but the most resilient fix is to simply patch the supplicant (note: the best fix would be to make the driver play nicely with wpa_supplicant, but it isn't clear if that is a possible avenue before the release. How bad/hard is it for us to carry a patch to wpa_supplicant?
Flags: needinfo?(mrbkap) → needinfo?(mwu)
Flags: needinfo?(mvines)
There is a patching mechanism that lets us put in B2G specific patches. We don't use it currently but it's available if all else fails. However, we might be able to get this fixed on hamachi - I'll need to ask around.
Flags: needinfo?(mwu)
I'm going to give this to mwu since it sounds like he's the best-placed guy to get movement from the driver authors.
Assignee: vchang → mwu
:mwu, do you need help here?  what's the update on this?
Michael Vines said he and our other friends can help but not for a bit as they're busy with other things right now.
I'm honestly not sure this is b-b+, since it may not be resolved by our code ship if this is a driver issue. Do we have another way of tracking partner work?
Target Milestone: B2G C2 (20nov-10dec) → B2G C3 (12dec-1jan)
Flags: needinfo?(mvines)
m1, can we get https://bug809663.bugzilla.mozilla.org/attachment.cgi?id=688188 added to the patches list? I would want to get the driver investigated and fixed here but I don't have the access to the kernel source nor much time.
Flags: needinfo?(mvines)
Yep, will take this up in the new year once back in the office.
Flags: needinfo?(mvines)
m1, feel free to close this after you've enqueued the patch.
10-4.  Still in the queue.  You may flog me in person next week if it's still outstanding (50/50 chance at this point I'd say).
I'm assigning to m1 just so we're clear.
Assignee: mwu → mvines
Whiteboard: [label:networking] → [awaiting partner help] [label:networking]
Target Milestone: B2G C3 (12dec-1jan) → B2G C4 (2jan on)
/flog
blocking-b2g: --- → tef+
blocking-basecamp: + → -
I can't seem to reproduce this issue.... removing qawanted.
Keywords: qawanted
This issue has not come up in our WLAN test, so at this point I'm a little uncomfortable landing this patch in our tree.   Can I get some clear STR?
Vincent, can you provide a reliable STR here?
Flags: needinfo?(vchang)
(In reply to Michael Vines [:m1] from comment #27)
> This issue has not come up in our WLAN test, so at this point I'm a little
> uncomfortable landing this patch in our tree.   Can I get some clear STR?

Hey Mike,

This is a very intermittent problem: I've never seen it on my own devices, but I have seen it on other folks' devices. The symptom is that b2g won't automatically connect to other networks and in some cases, it'll refuse to scan at all.
Flags: needinfo?(ggrisco)
(greg, can you please try to reproduce this here)
(In reply to Andrew Overholt [:overholt] from comment #28)
> Vincent, can you provide a reliable STR here?

STR for schedule scan gets stuck(reproductive every time). 

0. disable schedule scan workaround by setting manager.schedScanRecovery = false in http://mxr.mozilla.org/mozilla-central/source/dom/wifi/WifiWorker.js#57. You need to recompile and flash the code. 
1. connect to AP. 
2. leave the coverage of connected AP. 
=> sched_scan failed here. It tries to do sched_scan but stop there. The normal scan command will be rejected because sched_scan operation is in progress.  
3. After about 12 seconds, back to the coverage of AP. 

expect, the phone should connect to AP automatically. 
actual, the schedule scan get stuck. So the phone doesn't reconnect to AP automatically.


STR for wifi driver return no resource or device busy error(not reproductive every time) 
   0. make sure manager.schedScanRecovery is true(the default settings) 
   1. connect to AP1 
   2. connect to AP2
   3. turn off AP2, the phone will connect to AP1 automatically.  
   4. press the power button to turn off the screen of the phone and wait for about 2 minute. The phone will fall into sleep mode. 
   5. turn on the screen. 

expect, the phone is connected to AP1 automatically. 
actual, wpa_supplicant gets stuck and show "Failed to initiate AP scan" in logcat. Only reboot the device can recovery it.
Flags: needinfo?(vchang)
More information, just get a chance to try wpa_supplicant binary from unagi vendor.
It works very well on schedule scan bug.
(In reply to Vincent Chang[:vchang] from comment #32)
> More information, just get a chance to try wpa_supplicant binary from unagi
> vendor.
> It works very well on schedule scan bug.

So what's left to be done here?  Presumably we won't be providing the wpa_supplicant binary for shipping devices?
Flags: needinfo?(vchang)
(In reply to Andrew Overholt [:overholt] from comment #33)
> (In reply to Vincent Chang[:vchang] from comment #32)
> > More information, just get a chance to try wpa_supplicant binary from unagi
> > vendor.
> > It works very well on schedule scan bug.
> 
> So what's left to be done here?  Presumably we won't be providing the
> wpa_supplicant binary for shipping devices?

The wpa_supplicant is coming from codeaurora repo for unagi and otoro. Apparently, there is a compatibility bug for wpa_supplicant_codeaurora and wifi driver. Because I don't observe schedule scan bug when I replace wpa_supplicant_codeaurora to wpa_supplicant_unagi_vendor. 
It would be nice if our friend can reproduce the problem and help to fix it. So that we don't need the workaround in gecko which might make wifi driver getting stuck. We need to reboot the device when we encounter this situation.
Flags: needinfo?(vchang)
Why do the moz builds not just use the wpa_supplicant from the vendor as that's what is ultimately what is being used in the product?
(In reply to Michael Vines [:m1] from comment #35)
> Why do the moz builds not just use the wpa_supplicant from the vendor as
> that's what is ultimately what is being used in the product?

Makes sense to me.  We should be using whichever libs are closest to the shipping version.
(In reply to Michael Vines [:m1] from comment #35)
> Why do the moz builds not just use the wpa_supplicant from the vendor as
> that's what is ultimately what is being used in the product?

Who would do this work?  Do the right people have access to the wpa_supplicant binary?  Is it okay from a legal standpoint for us to do this?
Flags: needinfo?
Moz should already have access to the blobs
Flags: needinfo?
I think mwu would know the most about that.
Flags: needinfo?(mwu)
(In reply to Andrew Overholt [:overholt] from comment #37)
> (In reply to Michael Vines [:m1] from comment #35)
> > Why do the moz builds not just use the wpa_supplicant from the vendor as
> > that's what is ultimately what is being used in the product?
> 
> Who would do this work?  Do the right people have access to the
> wpa_supplicant binary?  Is it okay from a legal standpoint for us to do this?

If the original Otoro/Unagi Android image contains the right wpa_supplicant binary, we can simply extract it in build script without any legal issue, just like what we extract other blobs.

Vincent, do you know whether the wpa_supplicant received from partner is same as the one in Android image?
> If the original Otoro/Unagi Android image contains the right wpa_supplicant
> binary, we can simply extract it in build script without any legal issue,
> just like what we extract other blobs.
> 
> Vincent, do you know whether the wpa_supplicant received from partner is
> same as the one in Android image?
The checksum are different for these two blobs. But wpa_supplicant binary from Android image seems working well.
(In reply to Vincent Chang[:vchang] from comment #41)
> The checksum are different for these two blobs. But wpa_supplicant binary
> from Android image seems working well.

I thought you'd tried this in Berlin and it didn't work. I wonder what changed.
Does this even need to block?  If it's a vendor thing that we can't control ...
(In reply to Blake Kaplan (:mrbkap) from comment #42)
> (In reply to Vincent Chang[:vchang] from comment #41)
> > The checksum are different for these two blobs. But wpa_supplicant binary
> > from Android image seems working well.
> 
> I thought you'd tried this in Berlin and it didn't work. I wonder what
> changed.

I used the wpa_supplicant from otoro in Berlin. But I used wpa_supplicant from unagi in yesterday's try. Not sure the difference between these two binaries.
This doesn't appear to need to block since there's nothing we can do and things are in the OEM's hands.  Re-nomming.
blocking-b2g: tef+ → tef?
blocking-b2g: tef? → -
blocking-basecamp: - → ---
Assignee: mvines → nobody
Flags: needinfo?(ggrisco)
Flags: needinfo?(mwu)
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: