Closed Bug 1047258 Opened 7 years ago Closed 5 years ago

WiFi with Captive Portal and data connection can mess up

Categories

(Firefox OS Graveyard :: Wifi, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(tracking-b2g:backlog)

RESOLVED WONTFIX
tracking-b2g backlog

People

(Reporter: gerard-majax, Unassigned)

Details

(Keywords: foxfood)

[Blocking Requested - why for this release]: This is breaking network connectivity in real life use.

I could get into a state where none of WiFi or data could work on my Nexus S running master, but I believe it's unrelated to the device.

STR:
 0. Be under not perfect captive portal coverage
 1. Make sure WiFi and data are disabled
 2. Enable WiFi
 3. While it is requesting IP address, enable data

Expected:
 Data connection should be up and displayed in status bar.
 When Wi-Fi finally gets connected, I should get the captive portal notification.
 When connecting to a website, I should be redirected to the captive portal.

Actual:
 I see pending wifi icon in status bar.
 Once connected, going to a website (namely google.fr) ends up in network error/unreachable displayed in browser

As soon as I disable the data connection, I can get to the captive portal by hittin the reload button. With the same STR but disabling WiFi and reloading the page, I get to google.fr on data connection.
Component: RIL → Wifi
I don't think it's related to captive portal, it's more likely something wrong with the routing table in step 3, maybe a race condition changing the routing table because there might be async ops in step 3: (data call) set IP info of data call -- while setting IP info -> (wifi connected) remove IP info of data call, then set IP info of wifi
now we have two modules changing IP info at same time, which might get unexpected result.

But I don't think it's blocker because it's can be recovered by disabling one of the connections.
How high does the reproduce rate?
The behavior should acts like quick switching on data call right after wifi connecting, but can't be reproduce in such way so far.
may QA test what's the reproduce rate here? thanks.
Keywords: qawanted
(In reply to howie [:howie] from comment #3)
> may QA test what's the reproduce rate here? thanks.

Gerry - Could you assist with the QA request here? I don't think US QA has access to a captive portal setup on our side, but I think your team does.
Flags: needinfo?(gchang)
I can recreate the problem on Flame only once with following build.
And, the reproduce rate is very very low. I test this more than 40 times and only got once.
I think this is about timing issue because we need quick switching on data call right after wifi is connecting to captive portal.

Gaia      54c3c19d439f7dbafda5c6cc3b4850b545a068ba
Gecko     https://hg.mozilla.org/mozilla-central/rev/bd44d84142e8
BuildID   20140807160201
Version   34.0a1
Flags: needinfo?(gchang)
(In reply to Gerry Chang [:cfchang] from comment #5)
> I can recreate the problem on Flame only once with following build.
> And, the reproduce rate is very very low. I test this more than 40 times and
> only got once.
> I think this is about timing issue because we need quick switching on data
> call right after wifi is connecting to captive portal.
> 
> Gaia      54c3c19d439f7dbafda5c6cc3b4850b545a068ba
> Gecko     https://hg.mozilla.org/mozilla-central/rev/bd44d84142e8
> BuildID   20140807160201
> Version   34.0a1

And I do trigger this often enough so that it is annoying my dogfooding. I'm sorry, I cannot hepl more than saying this happens at the train station with the WiFi and data connectivity available there. It can probably be triggered more easily in not good network conditions (WiFi AP with bad signal, crowded area, etc.).
Keywords: qawanted
we'd like to know if this is regression. Gerry, thank you very much for this.
Flags: needinfo?(gchang)
sorry for the wrong tag, please have branch test on 1.4, 2.0, thanks.
(In reply to Alexandre LISSY :gerard-majax from comment #6)
> And I do trigger this often enough so that it is annoying my dogfooding. I'm
> sorry, I cannot hepl more than saying this happens at the train station with
> the WiFi and data connectivity available there. It can probably be triggered
> more easily in not good network conditions (WiFi AP with bad signal, crowded
> area, etc.).

It is an issue that we need to fix. However, even if we fix this bug, user still
may not be able to surf the internet in such bad network condition. So, I wonder
if this is a blocker.
My experience shows that once connected, there is no problem to surf, it's totally usable even if not perfect.
Removing the Qa-wanted tag (as indicated in comment 4 - we do not have access to a captive portal here) - the NI to Gerry (thanks Gerry!) should be all that is necessary to get a branch check.
Keywords: qawanted
I try to reproduce this by inserting code to enable data connection in different steps of wifi connecting process.
I have tested at steps of just associating[1], just connected[2] and right before running DHCP[3], but can't reproduce at any of these steps.

[1] http://hg.mozilla.org/mozilla-central/file/d7e78f0c1465/dom/wifi/WifiWorker.js#l2076
[2] http://hg.mozilla.org/mozilla-central/file/d7e78f0c1465/dom/wifi/WifiWorker.js#l617
[3] http://hg.mozilla.org/mozilla-central/file/d7e78f0c1465/dom/wifi/WifiWorker.js#l627
I can't recreate this problem on 2.0, either.
I use below build
Gaia      8b1b64ca3347e015d7a57df6d053f95cd26046ca
Gecko     https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/2f288e8aea09
BuildID   20140813160201
Version   32.0
Flags: needinfo?(gchang)
There is a very short of time, less than 500ms I estimated, both data connection and wifi interface are active, because NetworkManager makes sure Wifi is connected then ask RIL to disable data connection:

netcfg
> rmnet0   UP   111.81.199.208/27  0x00000041 00:00:00:00:00:00
> lo       UP        127.0.0.1/8   0x00000049 00:00:00:00:00:00
> wlan0    UP     10.247.30.92/21  0x00001043 00:0a:f5:df:60:60

routing table
> Iface   Destination  Gateway   Flags  RefCnt  Use  Metric  Mask      MTU  Window  IRTT
> wlan0   00000000     0118F70A  0003   0       0    0       00000000  0    0       0
> wlan0   0018F70A     00000000  0001   0       0    322     00F8FFFF  0    0       0
> rmnet0  C0C7516F     00000000  0001   0       0    0       E0FFFFFF  0    0       0
> rmnet0  01015FA8     D1C7516F  0007   0       0    0       FFFFFFFF  0    0       0
> rmnet0  01C05FA8     D1C7516F  0007   0       0    0       FFFFFFFF  0    0       0

DNS property
> [dhcp.wlan0.dns1]: [10.247.75.5]
> [dhcp.wlan0.dns2]: []
> [dhcp.wlan0.dns3]: []
> [dhcp.wlan0.dns4]: []
> [net.dns1]: [10.247.75.5]
> [net.dns2]: [168.95.192.1]
> [net.dnschange]: [7]
> [net.rmnet0.dns1]: [168.95.1.1]
> [net.rmnet0.dns2]: [168.95.192.1]
> [net.wlan0.dns1]: [10.247.75.5]
> [net.wlan0.dns2]: [0.0.0.0]

But the default route is already changed to wifi, and data connection interface will down:

netcfg
> rmnet0   DOWN                                   0.0.0.0/0   0x00000000 00:00:00:00:00:00
> lo       UP                                   127.0.0.1/8   0x00000049 00:00:00:00:00:00
> wlan0    UP                                10.247.30.92/21  0x00001043 00:0a:f5:df:60:60

routing table
> Iface   Destination  Gateway   Flags  RefCnt  Use  Metric  Mask      MTU  Window  IRTT
> wlan0   00000000     0118F70A  0003   0       0    0       00000000  0    0       0
> wlan0   0018F70A     00000000  0001   0       0    322     00F8FFFF  0    0       0

DNS property
> [dhcp.wlan0.dns1]: [10.247.75.5]
> [dhcp.wlan0.dns2]: []
> [dhcp.wlan0.dns3]: []
> [dhcp.wlan0.dns4]: []
> [net.dns1]: [10.247.75.5]
> [net.dns2]: [168.95.192.1]
> [net.dnschange]: [7]
> [net.rmnet0.dns1]: []
> [net.rmnet0.dns2]: []
> [net.wlan0.dns1]: [10.247.75.5]
> [net.wlan0.dns2]: [0.0.0.0]

Since default route and DNS are not changed, the captive portal detection should not be affected by the co-existence of data connection and wifi interface - the traffice always use wifi interface once it's connected.
Furthermore, the captive portal detection is executed after data connection is disabled(confirmed by adding debug message).

So I think the problem isn't caused by the timing of both interfaces are up.
Another assumption is captive portal detector try to send http request through wifi to check if captive portal exists, the request/response packet is lost due to poor signal or wifi is disconnected and switch to data call for a very short of time.
As I can remember, all traffic are handled by necko, and I am not sure if necko will get blocked in this case.
But I have to figure out how to create such scenario first.
Triage: Not a blocker due to the reproduce rate and user impact. But to keep investigate and track this.
blocking-b2g: 2.1? → backlog
blocking-b2g: backlog → ---
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.