Last Comment Bug 820617 - Add a hook to make NetworkManager not manage offline status and use it in Marionette for B2G CI
: Add a hook to make NetworkManager not manage offline status and use it in Mar...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: DOM: Device Interfaces (show other bugs)
: unspecified
: All Gonk (Firefox OS)
: P1 normal (vote)
: mozilla20
Assigned To: Philipp von Weitershausen [:philikon]
:
Mentors:
: 821084 (view as bug list)
Depends on:
Blocks: 802317
  Show dependency treegraph
 
Reported: 2012-12-11 16:01 PST by Jonathan Griffin (:jgriffin)
Modified: 2012-12-30 16:54 PST (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
fixed
fixed


Attachments
option #3 v1 (9.52 KB, patch)
2012-12-26 18:17 PST, Philipp von Weitershausen [:philikon]
jgriffin: review+
Details | Diff | Splinter Review

Description Jonathan Griffin (:jgriffin) 2012-12-11 16:01:40 PST
Marionette opens an nsIServerSocket on 0.0.0.0:2828 to listen for connections.  On the unagi and other configs, this works fine.  On the panda only, attempting to connect to this will cause this error when Marionette tries to write to the socket:

E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]"  nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame :: chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady :: line 94"  data: no]" {file: "chrome://global/content/devtools/dbg-transport.js" line: 94}]

To reproduce:  use a build from Dec 10 or newer, find the IP of the panda, and attempt to telnet to port 2828 on that IP.

It is possible to connect to this port from localhost successfully (following an adb forward), but not from a remote IP.  This blocks all buildbot gaia work.
Comment 1 Jonathan Griffin (:jgriffin) 2012-12-11 17:41:32 PST
Blocks gaia tests in TBPL, so noming for bb+
Comment 2 Andrew Overholt [:overholt] 2012-12-12 11:07:43 PST
(In reply to Jonathan Griffin (:jgriffin) from comment #1)
> Blocks gaia tests in TBPL, so noming for bb+

We need this to work.  Jonathan, is this something that you can change or something that needs a fix in the platform?
Comment 3 Jonathan Griffin (:jgriffin) 2012-12-12 13:18:48 PST
(In reply to Andrew Overholt [:overholt] from comment #2)
> (In reply to Jonathan Griffin (:jgriffin) from comment #1)
> > Blocks gaia tests in TBPL, so noming for bb+
> 
> We need this to work.  Jonathan, is this something that you can change or
> something that needs a fix in the platform?

This isn't a Marionette bug, we need someone in the platform to take a look.
Comment 4 Andrew Overholt [:overholt] 2012-12-13 07:22:45 PST
Michael, can you take this?  Not too many people have access to Pandaboards AFAIK.
Comment 5 Armen Zambrano [:armenzg] - Engineering productivity 2012-12-13 07:28:00 PST
FYI, there are spare panda boards at the Toronto office which can be loaned.
We can also order and ship panda boards in a very short period of time.
Comment 6 Michael Wu [:mwu] 2012-12-15 09:34:19 PST
I can't get to this till wednesday/thursday due to travel.

This sounds like an offline mode bug. I'm cc'ing some people who might know about offline mode.
Comment 7 Andrew Overholt [:overholt] 2012-12-17 07:03:17 PST
Josh Aas has asked some of the networking team members to take a look and hopefully one of them will be able to take this.
Comment 8 Jonathan Griffin (:jgriffin) 2012-12-17 13:28:33 PST
I ran tcpdump while attempting 'telnet 192.168.1.17 2828' and this is the output:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:27:47.764396 IP (tos 0x10, ttl 64, id 49017, offset 0, flags [DF], proto TCP (6), length 60)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [S], cksum 0x83a3 (incorrect -> 0x623c), seq 1093783542, win 14600, options [mss 1460,sackOK,TS val 3109007 ecr 0,nop,wscale 7], length 0
13:27:47.766384 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [S.], cksum 0x4a7d (correct), seq 3690802056, ack 1093783543, win 14360, options [mss 1448,sackOK,TS val 4391 ecr 3109007,nop,wscale 4], length 0
13:27:47.766447 IP (tos 0x10, ttl 64, id 49018, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [.], cksum 0x839b (incorrect -> 0xb0df), ack 1, win 115, options [nop,nop,TS val 3109007 ecr 4391], length 0
13:27:47.768194 IP (tos 0x0, ttl 64, id 15022, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [F.], cksum 0xadcf (correct), seq 1, ack 1, win 898, options [nop,nop,TS val 4391 ecr 3109007], length 0
13:27:47.768284 IP (tos 0x10, ttl 64, id 49019, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.40020 > 192.168.1.17.2828: Flags [F.], cksum 0x839b (incorrect -> 0xb0dc), seq 1, ack 2, win 115, options [nop,nop,TS val 3109008 ecr 4391], length 0
13:27:47.769384 IP (tos 0x0, ttl 64, id 15023, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.17.2828 > jgriffin-ubuntu11-64bit.local.40020: Flags [.], cksum 0xadcd (correct), ack 2, win 898, options [nop,nop,TS val 4391 ecr 3109008], length 0
13:27:52.771720 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has jgriffin-ubuntu11-64bit.local tell 192.168.1.17, length 46
13:27:52.771733 ARP, Ethernet (len 6), IPv4 (len 4), Reply jgriffin-ubuntu11-64bit.local is-at 00:0c:29:5f:e5:24 (oui Unknown), length 28
Comment 9 Honza Bambas (:mayhemer) 2012-12-17 14:10:18 PST
My theory is that on that hardware/environment necko doesn't correctly detect offline status (link up status).

It's probably a regression from bug 87717.  CC'ing author of the patch.
Comment 10 Patrick McManus [:mcmanus] 2012-12-17 14:22:52 PST
(In reply to Honza Bambas (:mayhemer) from comment #9)
> My theory is that on that hardware/environment necko doesn't correctly
> detect offline status (link up status).
> 
> It's probably a regression from bug 87717.  CC'ing author of the patch.

you cc'd me  but I didn't write that. I've added the patch author.
Comment 11 Honza Bambas (:mayhemer) 2012-12-17 14:55:00 PST
(In reply to Patrick McManus [:mcmanus] from comment #10)
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > My theory is that on that hardware/environment necko doesn't correctly
> > detect offline status (link up status).
> > 
> > It's probably a regression from bug 87717.  CC'ing author of the patch.
> 
> you cc'd me  but I didn't write that. I've added the patch author.

Ah, sorry, overlook.
Comment 12 Jonathan Griffin (:jgriffin) 2012-12-17 16:46:56 PST
A tcpdump from a similar situation, but against the unagi, which works correctly:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:45:59.224833 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.1.18 tell jgriffin-ubuntu11-64bit.local, length 28
16:45:59.318618 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.1.18 is-at 48:28:2f:f9:be:25 (oui Unknown), length 46
16:45:59.318631 IP (tos 0x10, ttl 64, id 9008, offset 0, flags [DF], proto TCP (6), length 60)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [S], cksum 0x83a4 (incorrect -> 0x114c), seq 264433279, win 14600, options [mss 1460,sackOK,TS val 6081871 ecr 0,nop,wscale 7], length 0
16:45:59.413214 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.1.18.2828 > jgriffin-ubuntu11-64bit.local.41788: Flags [S.], cksum 0x22ed (correct), seq 3801571169, ack 264433280, win 14480, options [mss 1460,sackOK,TS val 4294945997 ecr 6081871,nop,wscale 6], length 0
16:45:59.413264 IP (tos 0x10, ttl 64, id 9009, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [.], cksum 0x839c (incorrect -> 0x89a5), ack 1, win 115, options [nop,nop,TS val 6081919 ecr 4294945997], length 0
16:45:59.472350 IP (tos 0x0, ttl 64, id 62380, offset 0, flags [DF], proto TCP (6), length 121)
    192.168.1.18.2828 > jgriffin-ubuntu11-64bit.local.41788: Flags [P.], cksum 0x739d (correct), seq 1:70, ack 1, win 227, options [nop,nop,TS val 4294946003 ecr 6081919], length 69
16:45:59.472384 IP (tos 0x10, ttl 64, id 9010, offset 0, flags [DF], proto TCP (6), length 52)
    jgriffin-ubuntu11-64bit.local.41788 > 192.168.1.18.2828: Flags [.], cksum 0x839c (incorrect -> 0x894b), ack 70, win 115, options [nop,nop,TS val 6081934 ecr 4294946003], length 0
Comment 13 Adam [:hobophobe] 2012-12-17 17:58:08 PST
(In reply to Jonathan Griffin (:jgriffin) from comment #0)
> On the panda
> only, attempting to connect to this will cause this error when Marionette
> tries to write to the socket:
> 
> E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned
> failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]" 
> nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame ::
> chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady
> :: line 94"  data: no]" {file:
> "chrome://global/content/devtools/dbg-transport.js" line: 94}]

The callback (|nsIOutputStreamCallback|) gets called both when the stream is ready and when it's closed, so this is surely being called for the closed case.

> To reproduce:  use a build from Dec 10 or newer, find the IP of the panda,

What's the significance of builds after 10 December?  That is, does that represent a possible regression range, or was something turned on then and turns out it's broken?

TCP dumps probably won't reveal anything here.  The bug is likely either we never came online or we went offline when we shouldn't've. 

(In reply to Honza Bambas (:mayhemer) from comment #9)
> My theory is that on that hardware/environment necko doesn't correctly
> detect offline status (link up status).

Not too familiar with Marionette, but it appears it doesn't rely on link status?  In testing/marionette/marionette-listener.js:61 (|registerSelf|):  

> Services.io.manageOfflineStatus = false;

I didn't see anywhere that it turns it back on.

When does the connect happen?  We start offline (though we go online pretty soon); could it be before we're online for this device?
Comment 14 Jonathan Griffin (:jgriffin) 2012-12-17 18:26:09 PST
(In reply to Adam [:hobophobe] from comment #13)
> (In reply to Jonathan Griffin (:jgriffin) from comment #0)
> > On the panda
> > only, attempting to connect to this will cause this error when Marionette
> > tries to write to the socket:
> > 
> > E/GeckoConsole( 1243): [JavaScript Error: "[Exception... "Component returned
> > failure code: 0x804b0010 (NS_ERROR_OFFLINE) [nsIOutputStream.write]" 
> > nsresult: "0x804b0010 (NS_ERROR_OFFLINE)"  location: "JS frame ::
> > chrome://global/content/devtools/dbg-transport.js :: DT_onOutputStreamReady
> > :: line 94"  data: no]" {file:
> > "chrome://global/content/devtools/dbg-transport.js" line: 94}]
> 
> The callback (|nsIOutputStreamCallback|) gets called both when the stream is
> ready and when it's closed, so this is surely being called for the closed
> case.
> 
> > To reproduce:  use a build from Dec 10 or newer, find the IP of the panda,
> 
> What's the significance of builds after 10 December?  That is, does that
> represent a possible regression range, or was something turned on then and
> turns out it's broken?
> 

Before Dec 10, Marionette wasn't functional on the pandas at all due to bug 800138.  The fix for that landed on Dec 10.

> TCP dumps probably won't reveal anything here.  The bug is likely either we
> never came online or we went offline when we shouldn't've. 
> 
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > My theory is that on that hardware/environment necko doesn't correctly
> > detect offline status (link up status).
> 
> Not too familiar with Marionette, but it appears it doesn't rely on link
> status?  In testing/marionette/marionette-listener.js:61 (|registerSelf|):  
> 
> > Services.io.manageOfflineStatus = false;
> 
> I didn't see anywhere that it turns it back on.

We never turn it back on, since it can interfere with tests; see bug 777145 where this change was made.

> 
> When does the connect happen?  We start offline (though we go online pretty
> soon); could it be before we're online for this device?

The problem can be reproduced using 'telnet |ip of panda board| 2828' at any time; it isn't the case that we're attempting this before the device is actually online.
Comment 15 Jonathan Griffin (:jgriffin) 2012-12-18 09:52:37 PST
One difference between the unagis and the pandas is that on the pandas we use only an ethernet connection, vs a wifi connection on unagis and other devices.
Comment 16 Adam [:hobophobe] 2012-12-18 16:58:27 PST
Looking at desktop, this is the flow leading to the issue:

0. [connection occurs]
1. netwerk/base/src/nsSocketTransport2.cpp:~1050 |nsSocketTransport::InitiateSocket| 
2. netwerk/base/src/nsSocketTransport2.cpp:~470 |nsSocketTransport::OnSocketReady| 
3. toolkit/devtools/debugger/dbg-transport.js:~90 |DebuggerTransport::onOutputStreamReady| 

In |InitiateSocket|, a check was added in the patch for bug 87717:

>     if (gIOService->IsOffline() &&
>         !PR_IsNetAddrType(&mNetAddr, PR_IpAddrLoopback))
>         return NS_ERROR_OFFLINE;

So, the failure is being set offline when we shouldn't be.  It may be that we're still using a managed connection when we shouldn't be, or something else may be setting us offline?

Maybe something causes gonk (dom/system/gonk/NetworkManager.js:~410 |setAndConfigureActive|) to set us offline?

Debugging the target for the offline management state (|nsIOService::SetManageOfflineStatus|) and the offline state (|nsIOService::SetOffline|) should find the cause.
Comment 17 JP Rosevear [:jpr] 2012-12-19 06:37:18 PST
Similar to bug 820517
I'm looking at http://dxr.mozilla.org/mozilla-central/dom/system/gonk/nsINetworkManager.idl.html and http://dxr.mozilla.org/mozilla-central/dom/system/gonk/NetworkManager.js.html

I don't see an ethernet type for connection, just wifi and various mobile type connections.  NETWORK_TYPE_WIFI = 0 in the idl, so I will take a wild guess and say that even the ethernet connection is typed as wifi, causing this issue.
Comment 18 JP Rosevear [:jpr] 2012-12-19 09:43:52 PST
Sorry, that was bug 821084
Comment 19 Armen Zambrano [:armenzg] - Engineering productivity 2012-12-20 10:39:09 PST
I would like to remove bb+ since this is not a bug that blocks us from shipping.
It would be awesome to fix it so we could hopefully have b2g Gaia UI tests but it not something necessarily to ship.

Please bb- this bug if you agree with this.
Comment 20 Andrew Overholt [:overholt] 2012-12-20 10:48:06 PST
(In reply to Armen Zambrano G. [:armenzg] from comment #19)
> I would like to remove bb+ since this is not a bug that blocks us from
> shipping.
> It would be awesome to fix it so we could hopefully have b2g Gaia UI tests
> but it not something necessarily to ship.
> 
> Please bb- this bug if you agree with this.

We're really like automated tests so let's keep it as a blocker for at least a bit longer.
Comment 21 Philipp von Weitershausen [:philikon] 2012-12-26 15:11:18 PST
(In reply to Adam [:hobophobe] from comment #16)
> Maybe something causes gonk (dom/system/gonk/NetworkManager.js:~410
> |setAndConfigureActive|) to set us offline?

Yup. In Gonk, not having Wifi or a cellular data connection will put Gecko into offline mode. Which then sadly results in non-localhost connections being denied (see also bug 821084 for a similar symptom for the same cause).

We have three options:

1. The "proper" fix:

 a. Either make Gecko's NetworkManager aware of Ethernet.

 b. Or get rid of the "offline" handling in NetworkManager, and instead implement nsINetworkLinkService for Gonk. This could facilitate information from NetworkManager, but ideally would just look at the system's network configuration.

 One of these would be the ideal solution, but will also require a good amount of engineering. So not something we'd want right now I'd say.

2. Special-case the Pandaboard. We can detect when we're running on a Pandaboard and simply always be "online" when running on that platform. The only caveat that I can see is that it will prevent us from writing tests that involve offline mode on those devices. But given the constraints with Gecko killing the Marionette control socket when going to offline mode (bug 821084), I'm not sure we could do this right now anyway.

3. Introduce a hook that would put Marionette in control. I imagine we introduce a simple pref, e.g. similar to "network.manage-offline-status" which tells Gecko whether it should control "offline" mode based on nsINetworkLinkService information or not. In our case, let's say "network.gonk.manage-offline-status" would always be true normally. Marionette could set it to false, thereby disabling the NetworkManager's "offline" handling. I like this option the best.

jgriffin, what do you think?
Comment 22 Philipp von Weitershausen [:philikon] 2012-12-26 18:17:15 PST
Created attachment 695896 [details] [diff] [review]
option #3 v1

Here's a simple patch for option #3. With this the error described in this bug no longer occurs.
Comment 23 Faramarz [:faramarz] 2012-12-27 11:36:59 PST
:jgriffin, are you available for a review?  anything sitting longer than a day for review should be reviewed by someone else.
Comment 24 Jonathan Griffin (:jgriffin) 2012-12-28 10:04:42 PST
I can review today.
Comment 25 Jonathan Griffin (:jgriffin) 2012-12-28 15:31:52 PST
Comment on attachment 695896 [details] [diff] [review]
option #3 v1

Review of attachment 695896 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, this works great on the panda, and also fixes the problem on the unagi described in bug 821084.
Comment 26 Jonathan Griffin (:jgriffin) 2012-12-28 15:34:09 PST
Pushed to try to make sure this doesn't break any of the existing emulator tests:

https://tbpl.mozilla.org/?tree=Try&rev=e9985870526c
Comment 27 Philipp von Weitershausen [:philikon] 2012-12-29 10:15:05 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/14a22134efd2
Comment 28 Philipp von Weitershausen [:philikon] 2012-12-29 10:16:16 PST
*** Bug 821084 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.