Closed Bug 820617 Opened 12 years ago Closed 12 years ago

Add a hook to make NetworkManager not manage offline status and use it in Marionette for B2G CI

Categories

(Core :: DOM: Device Interfaces, defect, P1)

All
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla20
blocking-basecamp +
Tracking Status
firefox19 --- fixed
firefox20 --- fixed
b2g18 --- fixed

People

(Reporter: jgriffin, Assigned: philikon)

References

Details

Attachments

(1 file)

Marionette opens an nsIServerSocket on 0.0.0.0:2828 to listen for connections.  On the unagi and other configs, this works fine.  On the panda only, attempting to connect to this will cause this error when Marionette tries to write to the socket:

E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]"  nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame :: chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady :: line 94"  data: no]" {file: "chrome://global/content/devtools/dbg-transport.js" line: 94}]

To reproduce:  use a build from Dec 10 or newer, find the IP of the panda, and attempt to telnet to port 2828 on that IP.

It is possible to connect to this port from localhost successfully (following an adb forward), but not from a remote IP.  This blocks all buildbot gaia work.
Blocks gaia tests in TBPL, so noming for bb+
blocking-basecamp: --- → ?
(In reply to Jonathan Griffin (:jgriffin) from comment #1)
> Blocks gaia tests in TBPL, so noming for bb+

We need this to work.  Jonathan, is this something that you can change or something that needs a fix in the platform?
Assignee: nobody → jgriffin
blocking-basecamp: ? → +
(In reply to Andrew Overholt [:overholt] from comment #2)
> (In reply to Jonathan Griffin (:jgriffin) from comment #1)
> > Blocks gaia tests in TBPL, so noming for bb+
> 
> We need this to work.  Jonathan, is this something that you can change or
> something that needs a fix in the platform?

This isn't a Marionette bug, we need someone in the platform to take a look.
Assignee: jgriffin → nobody
Michael, can you take this?  Not too many people have access to Pandaboards AFAIK.
Assignee: nobody → mwu
Flags: needinfo?(mwu)
FYI, there are spare panda boards at the Toronto office which can be loaned.
We can also order and ship panda boards in a very short period of time.
I can't get to this till wednesday/thursday due to travel.

This sounds like an offline mode bug. I'm cc'ing some people who might know about offline mode.
Flags: needinfo?(mwu)
Josh Aas has asked some of the networking team members to take a look and hopefully one of them will be able to take this.
Target Milestone: --- → B2G C3 (12dec-1jan)
I ran tcpdump while attempting 'telnet 192.168.1.17 2828' and this is the output:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:27:47.764396 IP (tos 0x10, ttl 64, id 49017, offset 0, flags [DF], proto TCP (6), length 60)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [S], cksum 0x83a3 (incorrect -> 0x623c), seq 1093783542, win 14600, options [mss 1460,sackOK,TS val 3109007 ecr 0,nop,wscale 7], length 0
13:27:47.766384 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [S.], cksum 0x4a7d (correct), seq 3690802056, ack 1093783543, win 14360, options [mss 1448,sackOK,TS val 4391 ecr 3109007,nop,wscale 4], length 0
13:27:47.766447 IP (tos 0x10, ttl 64, id 49018, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [.], cksum 0x839b (incorrect -> 0xb0df), ack 1, win 115, options [nop,nop,TS val 3109007 ecr 4391], length 0
13:27:47.768194 IP (tos 0x0, ttl 64, id 15022, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [F.], cksum 0xadcf (correct), seq 1, ack 1, win 898, options [nop,nop,TS val 4391 ecr 3109007], length 0
13:27:47.768284 IP (tos 0x10, ttl 64, id 49019, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [F.], cksum 0x839b (incorrect -> 0xb0dc), seq 1, ack 2, win 115, options [nop,nop,TS val 3109008 ecr 4391], length 0
13:27:47.769384 IP (tos 0x0, ttl 64, id 15023, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [.], cksum 0xadcd (correct), ack 2, win 898, options [nop,nop,TS val 4391 ecr 3109008], length 0
13:27:52.771720 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has jgriffin-ubuntu11-64bit.local tell 192.168.1.17, length 46
13:27:52.771733 ARP, Ethernet (len 6), IPv4 (len 4), Reply jgriffin-ubuntu11-64bit.local is-at 00:0c:29:5f:e5:24 (oui Unknown), length 28
My theory is that on that hardware/environment necko doesn't correctly detect offline status (link up status).

It's probably a regression from bug 87717.  CC'ing author of the patch.
(In reply to Honza Bambas (:mayhemer) from comment #9)
> My theory is that on that hardware/environment necko doesn't correctly
> detect offline status (link up status).
> 
> It's probably a regression from bug 87717.  CC'ing author of the patch.

you cc'd me  but I didn't write that. I've added the patch author.
(In reply to Patrick McManus [:mcmanus] from comment #10)
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > My theory is that on that hardware/environment necko doesn't correctly
> > detect offline status (link up status).
> > 
> > It's probably a regression from bug 87717.  CC'ing author of the patch.
> 
> you cc'd me  but I didn't write that. I've added the patch author.

Ah, sorry, overlook.
A tcpdump from a similar situation, but against the unagi, which works correctly:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:45:59.224833 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.1.18 tell jgriffin-ubuntu11-64bit.local, length 28
16:45:59.318618 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.1.18 is-at 48:28:2f:f9:be:25 (oui Unknown), length 46
16:45:59.318631 IP (tos 0x10, ttl 64, id 9008, offset 0, flags [DF], proto TCP (6), length 60)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [S], cksum 0x83a4 (incorrect -> 0x114c), seq 264433279, win 14600, options [mss 1460,sackOK,TS val 6081871 ecr 0,nop,wscale 7], length 0
16:45:59.413214 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.1.18.2828 > jgriffin-ubuntu11-64bit.local.41788: Flags [S.], cksum 0x22ed (correct), seq 3801571169, ack 264433280, win 14480, options [mss 1460,sackOK,TS val 4294945997 ecr 6081871,nop,wscale 6], length 0
16:45:59.413264 IP (tos 0x10, ttl 64, id 9009, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [.], cksum 0x839c (incorrect -> 0x89a5), ack 1, win 115, options [nop,nop,TS val 6081919 ecr 4294945997], length 0
16:45:59.472350 IP (tos 0x0, ttl 64, id 62380, offset 0, flags [DF], proto TCP (6), length 121)
    192.168.1.18.2828 > jgriffin-ubuntu11-64bit.local.41788: Flags [P.], cksum 0x739d (correct), seq 1:70, ack 1, win 227, options [nop,nop,TS val 4294946003 ecr 6081919], length 69
16:45:59.472384 IP (tos 0x10, ttl 64, id 9010, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [.], cksum 0x839c (incorrect -> 0x894b), ack 70, win 115, options [nop,nop,TS val 6081934 ecr 4294946003], length 0
(In reply to Jonathan Griffin (:jgriffin) from comment #0)
> On the panda
> only, attempting to connect to this will cause this error when Marionette
> tries to write to the socket:
> 
> E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned
> failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]" 
> nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame ::
> chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady
> :: line 94"  data: no]" {file:
> "chrome://global/content/devtools/dbg-transport.js" line: 94}]

The callback (|nsIOutputStreamCallback|) gets called both when the stream is ready and when it's closed, so this is surely being called for the closed case.

> To reproduce:  use a build from Dec 10 or newer, find the IP of the panda,

What's the significance of builds after 10 December?  That is, does that represent a possible regression range, or was something turned on then and turns out it's broken?

TCP dumps probably won't reveal anything here.  The bug is likely either we never came online or we went offline when we shouldn't've. 

(In reply to Honza Bambas (:mayhemer) from comment #9)
> My theory is that on that hardware/environment necko doesn't correctly
> detect offline status (link up status).

Not too familiar with Marionette, but it appears it doesn't rely on link status?  In testing/marionette/marionette-listener.js:61 (|registerSelf|):  

> Services.io.manageOfflineStatus = false;

I didn't see anywhere that it turns it back on.

When does the connect happen?  We start offline (though we go online pretty soon); could it be before we're online for this device?
(In reply to Adam [:hobophobe] from comment #13)
> (In reply to Jonathan Griffin (:jgriffin) from comment #0)
> > On the panda
> > only, attempting to connect to this will cause this error when Marionette
> > tries to write to the socket:
> > 
> > E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned
> > failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]" 
> > nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame ::
> > chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady
> > :: line 94"  data: no]" {file:
> > "chrome://global/content/devtools/dbg-transport.js" line: 94}]
> 
> The callback (|nsIOutputStreamCallback|) gets called both when the stream is
> ready and when it's closed, so this is surely being called for the closed
> case.
> 
> > To reproduce:  use a build from Dec 10 or newer, find the IP of the panda,
> 
> What's the significance of builds after 10 December?  That is, does that
> represent a possible regression range, or was something turned on then and
> turns out it's broken?
> 

Before Dec 10, Marionette wasn't functional on the pandas at all due to bug 800138.  The fix for that landed on Dec 10.

> TCP dumps probably won't reveal anything here.  The bug is likely either we
> never came online or we went offline when we shouldn't've. 
> 
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > My theory is that on that hardware/environment necko doesn't correctly
> > detect offline status (link up status).
> 
> Not too familiar with Marionette, but it appears it doesn't rely on link
> status?  In testing/marionette/marionette-listener.js:61 (|registerSelf|):  
> 
> > Services.io.manageOfflineStatus = false;
> 
> I didn't see anywhere that it turns it back on.

We never turn it back on, since it can interfere with tests; see bug 777145 where this change was made.

> 
> When does the connect happen?  We start offline (though we go online pretty
> soon); could it be before we're online for this device?

The problem can be reproduced using 'telnet |ip of panda board| 2828' at any time; it isn't the case that we're attempting this before the device is actually online.
One difference between the unagis and the pandas is that on the pandas we use only an ethernet connection, vs a wifi connection on unagis and other devices.
Looking at desktop, this is the flow leading to the issue:

0. [connection occurs]
1. netwerk/base/src/nsSocketTransport2.cpp:~1050 |nsSocketTransport::InitiateSocket| 
2. netwerk/base/src/nsSocketTransport2.cpp:~470 |nsSocketTransport::OnSocketReady| 
3. toolkit/devtools/debugger/dbg-transport.js:~90 |DebuggerTransport::onOutputStreamReady| 

In |InitiateSocket|, a check was added in the patch for bug 87717:

>     if (gIOService->IsOffline() &&
>         !PR_IsNetAddrType(&mNetAddr, PR_IpAddrLoopback))
>         return NS_ERROR_OFFLINE;

So, the failure is being set offline when we shouldn't be.  It may be that we're still using a managed connection when we shouldn't be, or something else may be setting us offline?

Maybe something causes gonk (dom/system/gonk/NetworkManager.js:~410 |setAndConfigureActive|) to set us offline?

Debugging the target for the offline management state (|nsIOService::SetManageOfflineStatus|) and the offline state (|nsIOService::SetOffline|) should find the cause.
Similar to bug 820517
I'm looking at http://dxr.mozilla.org/mozilla-central/dom/system/gonk/nsINetworkManager.idl.html and http://dxr.mozilla.org/mozilla-central/dom/system/gonk/NetworkManager.js.html

I don't see an ethernet type for connection, just wifi and various mobile type connections.  NETWORK_TYPE_WIFI = 0 in the idl, so I will take a wild guess and say that even the ethernet connection is typed as wifi, causing this issue.
Sorry, that was bug 821084
I would like to remove bb+ since this is not a bug that blocks us from shipping.
It would be awesome to fix it so we could hopefully have b2g Gaia UI tests but it not something necessarily to ship.

Please bb- this bug if you agree with this.
blocking-basecamp: + → ?
(In reply to Armen Zambrano G. [:armenzg] from comment #19)
> I would like to remove bb+ since this is not a bug that blocks us from
> shipping.
> It would be awesome to fix it so we could hopefully have b2g Gaia UI tests
> but it not something necessarily to ship.
> 
> Please bb- this bug if you agree with this.

We're really like automated tests so let's keep it as a blocker for at least a bit longer.
Assignee: mwu → philipp
blocking-basecamp: ? → +
Priority: -- → P1
(In reply to Adam [:hobophobe] from comment #16)
> Maybe something causes gonk (dom/system/gonk/NetworkManager.js:~410
> |setAndConfigureActive|) to set us offline?

Yup. In Gonk, not having Wifi or a cellular data connection will put Gecko into offline mode. Which then sadly results in non-localhost connections being denied (see also bug 821084 for a similar symptom for the same cause).

We have three options:

1. The "proper" fix:

 a. Either make Gecko's NetworkManager aware of Ethernet.

 b. Or get rid of the "offline" handling in NetworkManager, and instead implement nsINetworkLinkService for Gonk. This could facilitate information from NetworkManager, but ideally would just look at the system's network configuration.

 One of these would be the ideal solution, but will also require a good amount of engineering. So not something we'd want right now I'd say.

2. Special-case the Pandaboard. We can detect when we're running on a Pandaboard and simply always be "online" when running on that platform. The only caveat that I can see is that it will prevent us from writing tests that involve offline mode on those devices. But given the constraints with Gecko killing the Marionette control socket when going to offline mode (bug 821084), I'm not sure we could do this right now anyway.

3. Introduce a hook that would put Marionette in control. I imagine we introduce a simple pref, e.g. similar to "network.manage-offline-status" which tells Gecko whether it should control "offline" mode based on nsINetworkLinkService information or not. In our case, let's say "network.gonk.manage-offline-status" would always be true normally. Marionette could set it to false, thereby disabling the NetworkManager's "offline" handling. I like this option the best.

jgriffin, what do you think?
Attached patch option #3 v1Splinter Review
Here's a simple patch for option #3. With this the error described in this bug no longer occurs.
Attachment #695896 - Flags: review?(jgriffin)
:jgriffin, are you available for a review?  anything sitting longer than a day for review should be reviewed by someone else.
I can review today.
Comment on attachment 695896 [details] [diff] [review]
option #3 v1

Review of attachment 695896 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, this works great on the panda, and also fixes the problem on the unagi described in bug 821084.
Attachment #695896 - Flags: review?(jgriffin) → review+
Pushed to try to make sure this doesn't break any of the existing emulator tests:

https://tbpl.mozilla.org/?tree=Try&rev=e9985870526c
https://hg.mozilla.org/integration/mozilla-inbound/rev/14a22134efd2
Component: Builds → DOM: Device Interfaces
Product: Boot2Gecko → Core
Summary: Connecting to an nsIServerSocket on panda fails with NS_ERROR_OFFLINE → Add a hook to make NetworkManager not manage offline status and use it in Marionette for B2G CI
Target Milestone: B2G C3 (12dec-1jan) → mozilla20
https://hg.mozilla.org/mozilla-central/rev/14a22134efd2
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: