Closed Bug 898074 Opened 8 years ago Closed 3 years ago

Frequent "raise TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None)" during WebAPI tests

Categories

(Firefox OS Graveyard :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jgriffin, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [hide-me: 8/2/13][leave open])

Attachments

(5 files)

Frequently during WebAPI tests, the Marionette socket is closed on the B2G side for some unknown reason, resulting in a test timeout.  This occurs randomly during any test; it's not test-specific.  It also does not occur on desktop Marionette test runs.

In the log, we see:

21:34:10     INFO -  I/Gecko   (   45): Handler function DebuggerTransport instance's this.hooks.onPacket threw an exception: TypeError: self.hooks is null
21:34:10     INFO -  I/Gecko   (   45): Call stack:
21:34:10     INFO -  I/Gecko   (   45): DT__processIncoming/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:183
21:34:10     INFO -  I/Gecko   (   45): makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:61
21:34:10     INFO -  I/Gecko   (   45):
21:34:10     INFO -  I/Gecko   (   45): 1374726521776	Marionette	TRACE	Connection closed.

These lines are from the debugger transport layer which Marionette is utilizing, but this doesn't appear to be a bug in the debugger; rather it seems that something is interrupting Marionette's socket randomly during tests.

Example logs:

https://tbpl.mozilla.org/php/getParsedLog.php?id=25639819&tree=Birch&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=25698886&tree=Birch&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=25721350&tree=Birch&full=1

This problem is causing the failure rate of WebAPI tests to exceed 10%, which is grounds for hiding the tests in TBPL.  I've asked the sheriffs to give engineering a chance to debug this before they do.  Can we get someone to look at this?

If this isn't fixed in a reasonable timeframe, the tests will likely be hidden, at which point they will not prevent bad commits from landing.

This occurs on both mozilla-b2g18 and mozilla-central.
Looking at the above stack trace, it's pretty odd.  It means that this.hooks is null *inside* _processIncoming.  Panos, how is that possible?
Flags: needinfo?(past)
(In reply to Jonathan Griffin (:jgriffin) from comment #1)
> Looking at the above stack trace, it's pretty odd.  It means that this.hooks
> is null *inside* _processIncoming.  Panos, how is that possible?

Actually it looks like Jim Blandy or Dave Camp own this file.
Flags: needinfo?(past) → needinfo?(dcamp)
Could be related to bug 881101?
Yeah, bug 881101 nulled out hooks on shutdown, but it looks like there still might be an outstanding event that tries to access them.

Should be safe to wrap that with if (self.hooks)
Flags: needinfo?(dcamp)
I'll run this on try and see how it affects this issue.
Assignee: nobody → jgriffin
pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=084e6972ee75

I'll retrigger the Mn tests like 20 times on this push to attempt to reproduce the problem.  If those are all green, then it's very likely this patch solves it.
(In reply to Jonathan Griffin (:jgriffin) from comment #6)
> pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=084e6972ee75
> 
> I'll retrigger the Mn tests like 20 times on this push to attempt to
> reproduce the problem.  If those are all green, then it's very likely this
> patch solves it.

This patch doesn't help; the same problem occurs, but without the error about "self.hooks" being null in the log.  *something* is causing our socket to close, but I can't tell what.
Assignee: jgriffin → nobody
A log with more logging enabled:

https://tbpl.mozilla.org/php/getParsedLog.php?id=25887988&tree=Cedar&full=1#error1

For some reason, onStopRequest is getting called while a connection should still be active:

http://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/server/transport.js#126
Tweaking summary so that TBPL can suggest this, now that bug 892958 has landed.
Depends on: 892958
Summary: Frequent TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None) during WebAPI tests → Frequent "raise TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None)" during WebAPI tests
https://tbpl.mozilla.org/php/getParsedLog.php?id=25970588&tree=Mozilla-Central

(Have now regenerated on the above log, so the bugscache will suggest this next time).
FYI, I just posted to dev-b2g that I will be hiding this suite by default on Friday due to the ongoing high failure rate.
Whiteboard: [hide-me: 8/2/13]
Jonas, your last post to the dev-b2g thread discussed the lack of testing in B2G. Can you please help find an owner on this so that the situation doesn't erode more?
Flags: needinfo?(jonas)
Some observations about this problem:

- we never see it during the first Marionette connection, only in subsequent ones (after the original was closed)
- we can run endurance tests for 12 hours or more that use only one connection, and not see this, but it will likely occur during a run of just a few minutes if we are closing and starting new sessions (meaning we close the connection, and then open a new one)
- we don't see this on desktop Firefox, just B2G
I don't have a b2g build handy to debug this, but have you tried getting a stack trace from nsSocketTransport::CloseWithStatus?
(In reply to TinderboxPushlog Robot from comment #15)
Recent changes to the TBPL log parser allow us to star this bug directly now instead of filing a new bug for each test that hits this. Apologies in advance for the spam.
I added some logging to nsSocketInputStream::CloseWithStatus, and see that this is called with the same parameters for expected closures as well as the unexpected one that's breaking the WebAPI tests.

This is an expected closure:

19:19:17     INFO -  I/Gecko   (   45): 1375323099902	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {2610f9cb-2647-4777-849f-c3f72f644401}, {2610f9cb-2647-4777-849f-c3f72f644401}
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=0]
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=80470002]
19:19:17     INFO -  I/Gecko   (   45): 1375323099939	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {54bfd6fb-5a27-462f-aba3-d88fca1ef2b5}, {54bfd6fb-5a27-462f-aba3-d88fca1ef2b5}
19:19:17     INFO -  I/Gecko   (   45): 1375323099954	Marionette	TRACE	Connection closed.

And this is the one that breaks the tests:

19:19:17     INFO -  I/Gecko   (   45): 1375323100246	Marionette	DEBUG	accepted connection on 127.0.0.1:36888
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=0]
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=80470002]
19:19:17     INFO -  I/Gecko   (   45): 1375323280446	Marionette	TRACE	Got: {"to": "root", "type": "getMarionetteID"}
19:19:17     INFO -  I/Gecko   (   45): 1375323280460	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {552cfcbc-424d-4d54-86c0-5bd75f317ff3}, {552cfcbc-424d-4d54-86c0-5bd75f317ff3}
19:19:17     INFO -  I/Gecko   (   45): Handler function DebuggerTransport instance's this.hooks.onPacket threw an exception: TypeError: self.hooks is null
19:19:17     INFO -  I/Gecko   (   45): Call stack:
19:19:17     INFO -  I/Gecko   (   45): DT__processIncoming/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:183
19:19:17     INFO -  I/Gecko   (   45): makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:61
19:19:17     INFO -  I/Gecko   (   45):

full log:  https://tbpl.mozilla.org/php/getParsedLog.php?id=26001089&tree=Try&full=1
So from looking at the log, it appears that there is a race or logic errror. I'm making some leaps here, since I don't really understand the code.

It looks like onStopRequest (this is speculation on my part) (https://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/server/transport.js#126) gets called and closes the socket and sets this.hooks to null.

It appears that just before onStopRequest was called, then a new connection came along with some data. But since onStopRequest set the hooks to null, it causes an exception.

So it seems like there is are two sessions which overlap, but the code assumes that there can only be one.

I think that there are 2 ways to deal with this:

1 - we shouldn't be calling accept while the socket is open (there is a listen backlog that may help here as well). I'm familiar with the underlying listen/accept/connect stuff but not how these are being used in marionette. If the server can only deal with one connection at a time, then it shouldn't call listen with a backlog of 1, and not call accept until the previously accepted socket was close. The client should detect EECONNREFUSED and retry.

2 - We need to modify the .js code to deal with multiple concurrent connections and not use globals.
(In reply to Dave Hylands [:dhylands] from comment #21)

> 1 - we shouldn't be calling accept while the socket is open (there is a
> listen backlog that may help here as well).

As far as I can tell, we're not.  The log shows:

19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=0]
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=80470002]
19:19:17     INFO -  I/Gecko   (   45): 1375323099939	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {54bfd6fb-5a27-462f-aba3-d88fca1ef2b5}, {54bfd6fb-5a27-462f-aba3-d88fca1ef2b5}
19:19:17     INFO -  I/Gecko   (   45): 1375323099954	Marionette	TRACE	Connection closed.
19:19:17     INFO -  I/Gecko   (   45): 1375323100246	Marionette	DEBUG	accepted connection on 127.0.0.1:36888
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=0]
19:19:17     INFO -  I/Gecko:Netwerk(   45): nsSocketInputStream::CloseWithStatus [this=0x4466beb0 reason=80470002]
19:19:17     INFO -  I/Gecko   (   45): 1375323280446	Marionette	TRACE	Got: {"to": "root", "type": "getMarionetteID"}
19:19:17     INFO -  I/Gecko   (   45): 1375323280460	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {552cfcbc-424d-4d54-86c0-5bd75f317ff3}, {552cfcbc-424d-4d54-86c0-5bd75f317ff3}
19:19:17     INFO -  I/Gecko   (   45): Handler function DebuggerTransport instance's this.hooks.onPacket threw an exception: TypeError: self.hooks is null
19:19:17     INFO -  I/Gecko   (   45): Call stack:
19:19:17     INFO -  I/Gecko   (   45): DT__processIncoming/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:183
19:19:17     INFO -  I/Gecko   (   45): makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:61
19:19:17     INFO -  I/Gecko   (   45):

i.e., "Connection closed", followed by "accepted connection".

> It appears that just before onStopRequest was called, then a new connection 
> came along with some data. But since onStopRequest set the hooks to null, it > causes an exception.

In this log yes, and this is the most common pattern, but we've also seen cases where this problem occurs several seconds into the next test.

> then it shouldn't
> call listen with a backlog of 1, and not call accept until the previously
> accepted socket was close. The client should detect EECONNREFUSED and retry.

I'll set the backlog to 0 and see what happens.

> 
> 2 - We need to modify the .js code to deal with multiple concurrent
> connections and not use globals.

It's more than just an issue of globals that prevents us from doing it; there are issues of focus, etc, that prevent multiple Marionette sessions from being able to run concurrently.  I don't think this will change.
So something I noticed in transport.js

Near the end of DT__processIncoming, it dispatches a function to process the onPacket

However, the call to DT_close appears to happen inline. So even though DT_close is called after DT__processIncoming, it's possible that the onPacket callback won't happen before the DT_close, so the close should dispatched the same way as the onPacket call to ensure that they happen in the right order.
Changing backlog to 0 showed no occurrences of this problem in 25 retriggers.  It's not a guarantee it's fixed, but pretty good evidence.  I didn't have to make any changes to the client (i.e., we are not seeing any connection refused errors), so I don't understand how or why this fixes the problem, but it would be great if it does.
Attachment #784523 - Flags: review?(dhylands)
Assignee: nobody → jgriffin
Comment on attachment 784523 [details] [diff] [review]
Open Marionette's port with a backlog of 0,

Review of attachment 784523 [details] [diff] [review]:
-----------------------------------------------------------------

I don't see how this can make things any worse, but I'm still dubious that this is the entire problem.
Attachment #784523 - Flags: review?(dhylands) → review+
(In reply to Dave Hylands [:dhylands] from comment #29)
> Comment on attachment 784523 [details] [diff] [review]
> Open Marionette's port with a backlog of 0,
> 
> Review of attachment 784523 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't see how this can make things any worse, but I'm still dubious that
> this is the entire problem.

I agree.  I'll land and we'll see what happens.
(In reply to Dave Hylands [:dhylands] from comment #27)
> So something I noticed in transport.js
> 
> Near the end of DT__processIncoming, it dispatches a function to process the
> onPacket
> 
> However, the call to DT_close appears to happen inline. So even though
> DT_close is called after DT__processIncoming, it's possible that the
> onPacket callback won't happen before the DT_close, so the close should
> dispatched the same way as the onPacket call to ensure that they happen in
> the right order.

Actually, it looks like the onStopRequest function is what needs to be dispatched (since this is what sets hooks to null)
Depends on: 900670
https://hg.mozilla.org/mozilla-central/rev/9d9856cf1648

Leaving open for now.
Whiteboard: [hide-me: 8/2/13] → [hide-me: 8/2/13][leave open]
I'm also seeing the following from my logs (it doesn't trigger a failure)

01 TRACE	DT_onDataAvailable
02 TRACE	Got: {"to": "0", "session": "6-b2g", "type": "deleteSession"}
03 TRACE	dispatched Got
04 INFO	sendToClient: {"from":"0","ok":true}, {1b22a6be-f34b-4c99-80fb-17780f147ca1}, {1b22a6be-f34b-4c99-80fb-17780f147ca1}
05 TRACE	DT_send
06 TRACE	DT_flushOutgoing
07 TRACE	DT_onOutputStreamReady
08 TRACE	DT_flushOutgoing
09 TRACE	DT_onStopRequest
10 TRACE	DT_close
11 INFO	sendToClient: {"from":"0","ok":true}, {730fe457-35a3-49ac-94f9-d86dd9a31570}, {730fe457-35a3-49ac-94f9-d86dd9a31570}
12 TRACE	DT_send
13 TRACE	DT_flushOutgoing
14 TRACE	DT_onStartRequest: setting this.hooks to null
15 TRACE	DT_onOutputStreamReady
16 TRACE	Connection closed.
17 TRACE	DT_close

What looks peculiar, is that we call sendToClient (line 11) AFTER calling close (line 10). That also seems wrong.
That is weird, I'll see where that message is originating from.
(In reply to Jonathan Griffin (:jgriffin) from comment #35)
> That is weird, I'll see where that message is originating from.

We're actually calling deleteSession twice, when we should be calling it once.  Still debugging...
I'm not sure if this is related to this problem, but it's good to do anyway.  pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=29d57c6719d2
Attachment #784655 - Flags: review?(dhylands)
Comment on attachment 784655 [details] [diff] [review]
Don't send multiple responses to deleteSession,

Review of attachment 784655 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/marionette/marionette-server.js
@@ +1928,5 @@
>  
>        // if there is only 1 window left, delete the session
>        if (numOpenWindows === 1){
> +        this.sessionTearDown();
> +        this.sendOk(command_id);

Aren't these 2 lines really the same thing as this.deleteSession() ?
(In reply to Dave Hylands [:dhylands] from comment #38)
> Comment on attachment 784655 [details] [diff] [review]
> Don't send multiple responses to deleteSession,
> 
> Review of attachment 784655 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: testing/marionette/marionette-server.js
> @@ +1928,5 @@
> >  
> >        // if there is only 1 window left, delete the session
> >        if (numOpenWindows === 1){
> > +        this.sessionTearDown();
> > +        this.sendOk(command_id);
> 
> Aren't these 2 lines really the same thing as this.deleteSession() ?

Yes, except that we're now using the correct command_id for sendOk, rather than (incorrectly) generating a new one via deleteSession.
Attachment #784655 - Flags: review?(dhylands) → review+
This is how I thing onStopRequest should be coded.

I did a try run to verify that the change didn't break anything.
https://tbpl.mozilla.org/?tree=Try&rev=4b30802389ef
Attachment #784737 - Flags: review?(jgriffin)
Comment on attachment 784737 [details] [diff] [review]
Dispatches close so that it happens after onPacket

I'll pass this review to dcamp since he owns this file.
Attachment #784737 - Flags: review?(jgriffin) → review?(dcamp)
Well that didn't fix things, because I see one failure (of about 20 runs).
Adding just some dumpn statements seems to make it run better:
https://tbpl.mozilla.org/?tree=Try&rev=affee059699e

To me this says that there is a race happening somewhere, and the dump statements are just changing the timing slightly.
If a few dump statements are enough to fix, we could probably temporarily bandaid this by adding a 1s delay in the client between session open/close.
(In reply to Jonathan Griffin (:jgriffin) from comment #50)
> If a few dump statements are enough to fix, we could probably temporarily
> bandaid this by adding a 1s delay in the client between session open/close.

Well, I think that there are multiple failure points.

I have yet to see a failure in transport.js. I did eventually see a failure (after retrying about 40 times), but the failure I saw (and the ones I've seen in the past day from m-c and m-i) are this failure:

20:24:53     INFO -  TEST-START test_battery_status_charging.js
20:27:53     INFO -  /home/cltbld/talos-slave/test/build/tests/marionette/tests/dom/battery/test/marionette/test_battery_status_charging.js, runTest (marionette_test.MarionetteJSTestCase) ... ERROR
20:27:53     INFO -  ======================================================================
20:27:53     INFO -  ERROR: /home/cltbld/talos-slave/test/build/tests/marionette/tests/dom/battery/test/marionette/test_battery_status_charging.js, runTest (marionette_test.MarionetteJSTestCase)
20:27:53     INFO -  ----------------------------------------------------------------------
20:27:53     INFO -  Traceback (most recent call last):
20:27:53     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette_test.py", line 123, in run
20:27:53     INFO -      self.setUp()
20:27:53     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette_test.py", line 236, in setUp
20:27:53     INFO -      self.marionette.start_session()
20:27:53     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette.py", line 446, in start_session
20:27:53     INFO -      self.session = self._send_message('newSession', 'value')
20:27:53     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette.py", line 342, in _send_message
20:27:53    ERROR -      raise TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None)
20:27:53    ERROR -  TEST-UNEXPECTED-FAIL | test_battery_status_charging.js | TimeoutException: socket.timeout

So I'm going to start looking at that code.
Comment on attachment 784737 [details] [diff] [review]
Dispatches close so that it happens after onPacket

Review of attachment 784737 [details] [diff] [review]:
-----------------------------------------------------------------

This should be ok.
Attachment #784737 - Flags: review?(dcamp) → review+
These last two failures are of a different pattern; here, after the initial error, all subsequent tests time out during newSession.  There may be some recent regression that's causing this; we'll have to closely examine future occurrences.
Duplicate of this bug: 894359
Given the number of reports that are still incoming, it seems like this issue still exists, right?

Did the landed patch reduce the number of failures?
(In reply to Jonas Sicking (:sicking) from comment #71)
> Given the number of reports that are still incoming, it seems like this
> issue still exists, right?
> 
> Did the landed patch reduce the number of failures?

The number of failures seems to be reduced, over the last 3 patches landed, but the most recent landing makes us fail harder than previously.  I'm considering backing it out based on some try runs that are still ongoing.
The test_ril_code_quality.py failures show a different pattern of failure in the logs, so I've filed a separate bug, bug 902203.
(In reply to Jonathan Griffin (:jgriffin) from comment #50)
> If a few dump statements are enough to fix, we could probably temporarily
> bandaid this by adding a 1s delay in the client between session open/close.

I tried this on a tryserver run and found it did not improve the situation any.
(In reply to Phil Ringnalda (:philor) from comment #79)
> https://hg.mozilla.org/mozilla-central/rev/c92dc0497dfb

This change has made us fail harder, but doesn't seem to have reduced the overall failure rate.  See https://tbpl.mozilla.org/?tree=Try&rev=e7e0cbcdc3c5 (with this patch) and https://tbpl.mozilla.org/?tree=Try&rev=1b2825c0a155 (without it).  With this patch, every test after the first socket.timeout also dies with a socket.timeout.  Without this patch, tests after the socket.timeout recover.

I'm going to back this out.
Hi All,

:sicking asked me to look into this as well and my current gut feeling is that finishing one test and starting the next isn't happening serially as we expect which is causing the server socket to be unexpectedly closed when we attempt to connect. I'm still trying to track where all this is happening and could definitely use a few pointers. Thanks!
(In reply to Ghislain 'Aus' Lacroix from comment #100)
> Hi All,
> 
> :sicking asked me to look into this as well and my current gut feeling is
> that finishing one test and starting the next isn't happening serially as we
> expect which is causing the server socket to be unexpectedly closed when we
> attempt to connect. I'm still trying to track where all this is happening
> and could definitely use a few pointers. Thanks!

I made a try run in which I added a 1s delay between finishing one test and starting the next, and it didn't impact this problem at all.

I'd be happy to try a longer delay, but I don't think this is the problem, as we see the expected disconnect from the previous test when it ends.  The disconnects under normal operation are always triggered from the Python side, here:  http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#265.

The next connection is created here:  http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#236 and http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/client.py#85
Per request, here's a try job with a 10s delay between tearDown and the next setUp:  https://tbpl.mozilla.org/?tree=Try&rev=7dc1e47a92c2

I'll retrigger a bunch of times after the initial run.
(In reply to Jonathan Griffin (:jgriffin) from comment #114)
> Per request, here's a try job with a 10s delay between tearDown and the next
> setUp:  https://tbpl.mozilla.org/?tree=Try&rev=7dc1e47a92c2
> 
> I'll retrigger a bunch of times after the initial run.

The error still occurs here:  https://tbpl.mozilla.org/php/getParsedLog.php?id=26462634&tree=Try&full=1
I've pushed a new version to try with a little more instrumentation on the Python side; this should be enough to categorically determine whether the problem lies on the Python side or not:

https://tbpl.mozilla.org/?tree=Try&rev=3ecb104263ce
Output from the latest try run:

15:21:47     INFO -  test_that_we_can_pass_in_floats (test_execute_script.TestExecuteContent) ... Tue Aug 13 15:21:47 2013 sleeping 10s
15:21:57     INFO -  Tue Aug 13 15:21:57 2013 open the socket
15:21:58     INFO -  Tue Aug 13 15:21:58 2013 close the socket
15:21:58     INFO -  ok
15:21:58     INFO -  ======================================================================
15:21:58     INFO -  ERROR: test_complex_return_values (test_execute_script.TestExecuteContent)
15:21:58     INFO -  ----------------------------------------------------------------------
15:21:58     INFO -  Traceback (most recent call last):
15:21:58     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette_test.py", line 123, in run
15:21:58     INFO -      self.setUp()
15:21:58     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette_test.py", line 303, in setUp
15:21:58     INFO -      CommonTestCase.setUp(self)
15:21:58     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette_test.py", line 238, in setUp
15:21:58     INFO -      self.marionette.start_session()
15:21:58     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette.py", line 445, in start_session
15:21:58     INFO -      self.session = self._send_message('newSession', 'value')
15:21:58     INFO -    File "/home/cltbld/talos-slave/test/build/tests/marionette/marionette/marionette.py", line 341, in _send_message
15:21:58    ERROR -      raise TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None)
15:21:58    ERROR -  TEST-UNEXPECTED-FAIL | test_execute_script.py TestExecuteContent.test_complex_return_values | TimeoutException: socket.timeout
15:21:58     INFO -  ----------------------------------------------------------------------
15:21:58     INFO -  Ran 18 tests in 377.684s
15:21:58  WARNING -  FAILED (errors=1)
15:21:58     INFO -  TEST-START test_simpletest_fail.js
15:21:58     INFO -  /home/cltbld/talos-slave/test/build/tests/marionette/tests/testing/marionette/client/marionette/tests/unit/test_simpletest_fail.js, runTest (marionette_test.MarionetteJSTestCase) ... Tue Aug 13 15:21:58 2013 sleeping 10s
15:22:08     INFO -  Tue Aug 13 15:22:08 2013 open the socket
15:22:09     INFO -  Tue Aug 13 15:22:09 2013 close the socket
15:22:09     INFO -  ok

For the socket.timeout error, I see neither the "sleeping 10s" nor "open the socket" messages I would expect.  Will investigate further.
Actually, the Python output is exactly as I would expect.  Here's some more:

https://tbpl.mozilla.org/php/getParsedLog.php?id=26571555&tree=Try&full=1#error1

20:53:49     INFO -  test_emulator_cmd (test_emulator.TestEmulatorContent) ... started test <bound method TestEmulatorContent.test_emulator_cmd of <test_emulator.TestEmulatorContent testMethod=test_emulator_cmd>>
20:53:49     INFO -  calling setUp
20:53:49     INFO -  Wed Aug 14 20:53:49 2013 sleeping 10s
20:53:59     INFO -  Wed Aug 14 20:53:59 2013 open the socket
20:53:59     INFO -  calling testMethod
20:53:59     INFO -  tearDown
20:54:00     INFO -  cleanTest
20:54:00     INFO -  Wed Aug 14 20:54:00 2013 close the socket
20:54:00     INFO -  ok
20:54:00     INFO -  test_emulator_order (test_emulator.TestEmulatorContent) ... started test <bound method TestEmulatorContent.test_emulator_order of <test_emulator.TestEmulatorContent testMethod=test_emulator_order>>
20:54:00     INFO -  calling setUp
20:54:00     INFO -  Wed Aug 14 20:54:00 2013 sleeping 10s
20:54:10     INFO -  Wed Aug 14 20:54:10 2013 open the socket
20:57:10     INFO -  Wed Aug 14 20:57:10 2013 close the socket
20:57:10     INFO -  ERROR
20:57:10     INFO -  cleanTest
20:57:11     INFO -  test_emulator_orientation (test_emulator.TestEmulatorScreen) ... started test <bound method TestEmulatorScreen.test_emulator_orientation of <test_emulator.TestEmulatorScreen testMethod=test_emulator_orientation>>
20:57:11     INFO -  calling setUp
20:57:11     INFO -  Wed Aug 14 20:57:11 2013 sleeping 10s
20:57:21     INFO -  Wed Aug 14 20:57:21 2013 open the socket
20:57:21     INFO -  calling testMethod
20:57:21     INFO -  tearDown
20:57:21     INFO -  cleanTest
20:57:21     INFO -  Wed Aug 14 20:57:21 2013 close the socket
20:57:21     INFO -  ok

The tests are not being interleaved, the socket is not being closed unexpectedly from the Python side, and putting a 10s delay between socket close and open does not impact this failure.  So I'm back to thinking something funny is happening on the gecko side.
The gecko log which corresponds to the above is:

21:46:27     INFO -  08-14 23:54:10.570    45    45 I Gecko   : 1376538850574	Marionette	DEBUG	accepted connection on 127.0.0.1:38981
21:46:27     INFO -  08-14 23:54:10.620    45    45 I Gecko   : 1376538850622	Marionette	TRACE	Got: {"to": "root", "type": "getMarionetteID"}
21:46:27     INFO -  08-14 23:57:10.801    45    45 I Gecko   : 1376539030795	Marionette	TRACE	Got: {"to": "0", "type": "newSession"}
21:46:27     INFO -  08-14 23:57:10.811    45    45 I Gecko   : Handler function DebuggerTransport instance's this.hooks.onPacket threw an exception: TypeError: self.hooks is null
21:46:27     INFO -  08-14 23:57:10.811    45    45 I Gecko   : Call stack:
21:46:27     INFO -  08-14 23:57:10.811    45    45 I Gecko   : DT__processIncoming/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:183
21:46:27     INFO -  08-14 23:57:10.811    45    45 I Gecko   : makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:61
21:46:27     INFO -  08-14 23:57:10.811    45    45 I Gecko   :

This shows something odd; the "newSession" command from the Python side isn't being logged by the debugger transport until the Python side reports a socket.timeout; it should be logged immediately after the socket is opened.

It looks like the message is getting stuck somewhere in the transport layer, and not delivered to Marionette.

:dhylands, does this suggest anything?
I added some more debugging statements, and ran another try job:

https://tbpl.mozilla.org/?tree=Try&rev=54f767ba6e9a

The result seems to show that the problem is somewhere on the gecko/debugger transport side.

On the Python side:

19:30:05     INFO -  sock.send 74:{"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 1000}
19:30:05     INFO -  received {u'ok': True, u'from': u'0'}
19:30:05     INFO -  calling testMethod
19:30:05     INFO -  sock.send 74:{"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 5000}
19:34:05     INFO -  Tue Aug 20 19:34:05 2013 close the socket

Note 4 minutes elapse after sending the request before the socket.timeout. 

On the gecko side:

08-20 22:30:05.763    45    45 I Gecko   : 1377052205763	Marionette	TRACE	incoming: 74:{"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 1000}
08-20 22:30:05.773    45    45 I Gecko   : 1377052205771	Marionette	TRACE	Got: {"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 1000}
08-20 22:30:05.783    45    45 I Gecko   : 1377052205779	Marionette	TRACE	Incomplete request, no sep:
08-20 22:30:05.792    45    45 I Gecko   : 1377052205790	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {e18dc090-4922-4ebd-a5b6-1868df4b2c06}, {e18dc090-4922-4ebd-a5b6-1868df4b2c06}
08-20 22:34:05.952    45    45 I Gecko   : 1377052445946	Marionette	TRACE	calling readInputStreamToString
08-20 22:34:05.962    45    45 I Gecko   : 1377052445954	Marionette	TRACE	incoming: 74:{"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 5000}
08-20 22:34:05.962    45    45 I Gecko   : 1377052445964	Marionette	TRACE	Got: {"to": "0", "session": "6-b2g", "type": "setScriptTimeout", "value": 5000}
08-20 22:34:05.982    45    45 I Gecko   : 1377052445973	Marionette	TRACE	Incomplete request, no sep:
08-20 22:34:05.992    45    45 I Gecko   : Handler function DebuggerTransport instance's this.hooks.onPacket threw an exception: TypeError: self.hooks is null
08-20 22:34:05.992    45    45 I Gecko   : Call stack:
08-20 22:34:05.992    45    45 I Gecko   : DT__processIncoming/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:187
08-20 22:34:05.992    45    45 I Gecko   : makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:61

The last "setScriptTimeout" message from the client is not read from the socket in gecko until the socket is closed, 4 minutes after the client sent it.  Somehow, the data is getting stuck somewhere in the socket on the gecko/B2G side.

Given that we do not see this problem on b2g desktop builds (or desktop Firefox), I wonder if this is some low-level problem with the emulator.

Dave, passing this back to you.
Flags: needinfo?(dhylands)
So, if I had to guess, then I'd say that there is probably a bug in the emulator's network driver which is being used. Unfortunately, the emulator is really old (2.6.29).

If I'm understanding thing, the client is sending a command to the server (setScriptTimeout) and waiting for a response, and the response never comes.

Can we introduce the notion of a NO-OP command?

Then instead of waiting for the longer time period, we could wait for a shorter period of time (say 10-30 seconds), and if we don't get a response, then send a NO-OP.

The receiving end would just ignore the NO-OP commands.

Then, if it is just the driver getting stuck, the NO-OP may be enough to wake it up and cause the data to be released to the other side.
Yes, I can implement a hacky POC of this and see if it resolves the problem, and if it does, make a more robust implementation.
(In reply to Dave Hylands [:dhylands] from comment #202)
> So, if I had to guess, then I'd say that there is probably a bug in the
> emulator's network driver which is being used. Unfortunately, the emulator
> is really old (2.6.29)

Could we use a newer emulator? :-)
We're actually moving to JB emulators for 1.2, but I'm not sure how long we'll have to run tests on the "old" ones.
pushed to try a patch which implements the no-op idea from comment #202: 

https://tbpl.mozilla.org/?tree=Try&rev=29afa794f577
(In reply to Jonathan Griffin (:jgriffin) from comment #213)
> pushed to try a patch which implements the no-op idea from comment #202: 
> 
> https://tbpl.mozilla.org/?tree=Try&rev=29afa794f577

Pushed a new version to try, which fixes a problem in the previous version, in which we were sometimes writing data to the socket too quickly, resulting in "[Errno 11] Resource temporarily unavailable".

https://tbpl.mozilla.org/?tree=Try&rev=98036720855d
(In reply to Jonathan Griffin (:jgriffin) from comment #203)
> Yes, I can implement a hacky POC of this and see if it resolves the problem,
> and if it does, make a more robust implementation.

This approach did not work.  https://tbpl.mozilla.org/php/getParsedLog.php?id=26987562&tree=Try&full=1#error1

16:48:33     INFO -  08-25 23:34:02.098    45    45 I Gecko   : 1377473642095	Marionette	INFO	sendToClient: {"from":"0","ok":true}, {7514bbc0-f06d-45f3-bbf9-bc30065a476d}, {7514bbc0-f06d-45f3-bbf9-bc30065a476d}
16:48:33     INFO -  08-25 23:38:02.888    45    45 I Gecko   : 1377473882880	Marionette	DEBUG	accepted connection on 127.0.0.1:38969
16:48:33     INFO -  08-25 23:38:02.908    45    45 I Gecko   : 1377473882909	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.918    45    45 I Gecko   : 1377473882916	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.928    45    45 I Gecko   : 1377473882924	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.938    45    45 I Gecko   : 1377473882937	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.948    45    45 I Gecko   : 1377473882947	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.958    45    45 I Gecko   : 1377473882956	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.968    45    45 I Gecko   : 1377473882965	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : 1377473882976	Marionette	TRACE	Got: {"to": "0", "type": "noOp"}
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : Handler function DebuggerTransport.prototype.onStopRequest threw an exception: TypeError: this.curBrowser is null
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : Stack: MDA_switchToGlobalMM@chrome://marionette/content/marionette-server.js:181
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : MDA_sessionTearDown@chrome://marionette/content/marionette-server.js:1927
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : MSC_onClosed@chrome://marionette/content/marionette-server.js:166
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : DT_onStopRequest@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/server/transport.js:130
16:48:33     INFO -  08-25 23:38:02.978    45    45 I Gecko   : makeInfallible/<@chrome://marionette/content/marionette-server.js -> resource://gre/modules/devtools/DevToolsUtils.js:72

The 'newSession' command, in this case, just disappeared.  The server received the subsequent 'noOp' commands.

Another approach we could try is to re-issue commands that die due to a socket.timeout, but this is starting to get pretty hacky.  :(
> Another approach we could try is to re-issue commands that die due to a 
> socket.timeout, but this is starting to get pretty hacky.  :(

This won't work either.  As shown in the above log, all the noOp commands (which were sent at 30s intervals) were not received until the socket was closed, so repeating failed commands will not be effective.

We could potentially close the socket and open a new one and try again, but some state would be lost, and this would lead to confusing errors when it failed.

I don't think we can really fix this from the client side.
I guess the next thing to try is to have the server side periodically poll.

Right now, it seems to be relying entirely on a data available event. If that event gets missed then that might explain things.

In low-level terms, I'd do this by using a select with a timeout, and do a non-blocking read each timeout. The read should normally just return zero bytes. This uses a bit more CPU, but is definitely more robust.

I'm not sure how this translates into the marionette framework, but that seems like the next direction to pursue.
Flags: needinfo?(dhylands)
(In reply to Dave Hylands [:dhylands] from comment #233)
> I guess the next thing to try is to have the server side periodically poll.
> 
> Right now, it seems to be relying entirely on a data available event. If
> that event gets missed then that might explain things.
> 
> In low-level terms, I'd do this by using a select with a timeout, and do a
> non-blocking read each timeout. The read should normally just return zero
> bytes. This uses a bit more CPU, but is definitely more robust.
> 
> I'm not sure how this translates into the marionette framework, but that
> seems like the next direction to pursue.

There is no select capability for xpcom sockets; even the lowest level API available, https://developer.mozilla.org/en-US/docs/Web/API/TCPSocket, uses event handlers.  It doesn't look like the select/read pattern is possible from xpcom JS.