Closed Bug 1362293 Opened 5 years ago Closed 4 years ago

Intermittent IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)

Categories

(Testing :: Marionette, defect, P3)

Version 3
defect

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: robert.strong.bugs, Assigned: whimboo)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

https://queue.taskcluster.net/v1/task/QhF7g-_ITg2FioQIg0SVjw/runs/0/artifacts/public/logs/live_backing.log

[task 2017-05-05T04:17:14.989160Z] 04:17:14     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py', '--gecko-log=/home/worker/workspace/build/blobber_upload_dir/gecko.log', '--log-raw=-', '-vv', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/marionette_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/marionette_errorsummary.log', '--log-html=/home/worker/workspace/build/blobber_upload_dir/report.html', '--binary=/home/worker/workspace/build/application/firefox/firefox', '--address=localhost:2828', '--symbols-path=https://queue.taskcluster.net/v1/task/WkysWBE5T7aqhYU4g63dIA/artifacts/public/build/target.crashreporter-symbols.zip', '/home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit-tests.ini'] with output_timeout 1000
[task 2017-05-05T04:17:15.409099Z] 04:17:15     INFO - Using workspace for temporary data: "/home/worker/workspace"
[task 2017-05-05T04:17:15.411063Z] 04:17:15     INFO - mozversion application_buildid: 20170505035020
[task 2017-05-05T04:17:15.412802Z] 04:17:15     INFO - mozversion application_changeset: bd5e30c8066b3a077822c894024bd752e8e9792c
[task 2017-05-05T04:17:15.414355Z] 04:17:15     INFO - mozversion application_display_name: Nightly
[task 2017-05-05T04:17:15.415891Z] 04:17:15     INFO - mozversion application_id: {ec8030f7-c20a-464f-9b0e-13a3a9e97384}
[task 2017-05-05T04:17:15.417251Z] 04:17:15     INFO - mozversion application_name: Firefox
[task 2017-05-05T04:17:15.418834Z] 04:17:15     INFO - mozversion application_remotingname: firefox
[task 2017-05-05T04:17:15.420019Z] 04:17:15     INFO - mozversion application_repository: https://hg.mozilla.org/projects/oak
[task 2017-05-05T04:17:15.420578Z] 04:17:15     INFO - mozversion application_vendor: Mozilla
[task 2017-05-05T04:17:15.421303Z] 04:17:15     INFO - mozversion application_version: 55.0a1
[task 2017-05-05T04:17:15.421400Z] 04:17:15     INFO - mozversion platform_buildid: 20170505035020
[task 2017-05-05T04:17:15.422045Z] 04:17:15     INFO - mozversion platform_changeset: bd5e30c8066b3a077822c894024bd752e8e9792c
[task 2017-05-05T04:17:15.422652Z] 04:17:15     INFO - mozversion platform_repository: https://hg.mozilla.org/projects/oak
[task 2017-05-05T04:17:15.422795Z] 04:17:15     INFO - mozversion platform_version: 55.0a1
[task 2017-05-05T04:17:15.423438Z] 04:17:15     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -no-remote -marionette -profile /tmp/tmpr1nxNw.mozrunner
[task 2017-05-05T04:19:21.850904Z] 04:19:21    ERROR - Failure during harness execution
[task 2017-05-05T04:19:21.861006Z] 04:19:21    ERROR - Traceback (most recent call last):
[task 2017-05-05T04:19:21.861065Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.867418Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py", line 92, in cli
[task 2017-05-05T04:19:21.867513Z] 04:19:21    ERROR -     failed = harness_instance.run()
[task 2017-05-05T04:19:21.872985Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.873080Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/tests/marionette/harness/marionette_harness/runtests.py", line 72, in run
[task 2017-05-05T04:19:21.873140Z] 04:19:21    ERROR -     runner.run_tests(tests)
[task 2017-05-05T04:19:21.873185Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.873275Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_harness/runner/base.py", line 818, in run_tests
[task 2017-05-05T04:19:21.873347Z] 04:19:21    ERROR -     self.marionette = self.driverclass(**self._build_kwargs())
[task 2017-05-05T04:19:21.873393Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.873770Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 611, in __init__
[task 2017-05-05T04:19:21.874316Z] 04:19:21    ERROR -     self.raise_for_port(timeout=startup_timeout)
[task 2017-05-05T04:19:21.874833Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.875342Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-05-05T04:19:21.875885Z] 04:19:21    ERROR -     m._handle_socket_failure()
[task 2017-05-05T04:19:21.876393Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.877010Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-05-05T04:19:21.877587Z] 04:19:21    ERROR -     return func(*args, **kwargs)
[task 2017-05-05T04:19:21.880531Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.881105Z] 04:19:21    ERROR -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 697, in raise_for_port
[task 2017-05-05T04:19:21.881668Z] 04:19:21    ERROR -     self.host, self.port))
[task 2017-05-05T04:19:21.882240Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.882884Z] 04:19:21    ERROR - IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)
[task 2017-05-05T04:19:21.883634Z] 04:19:21    ERROR - 
[task 2017-05-05T04:19:21.927997Z] 04:19:21    ERROR - Return code: 1
[task 2017-05-05T04:19:21.928795Z] 04:19:21    ERROR - No checks run.
Component: Firefox UI Tests → Marionette
QA Contact: hskupin
Robert, do you have the treeherder link for me? I would also need the gecko.log to examine the details. Thanks.
Flags: needinfo?(robert.strong.bugs)
Summary: IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!) → Intermittent IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)
Th raw log doesn't contain the output from the gecko process. So here it is:
https://public-artifacts.taskcluster.net/QhF7g-_ITg2FioQIg0SVjw/0/public/test_info//gecko.log

As we can see Marionette started successfully on port 2828 at 1493957903,384 (Fri, 05 May 2017 04:18:23 GMT). The above log tells that Firefox got started at 04:17:15, so it took a bit more than 1 minute to bring it up, and the observer notification "sessionstore-windows-restored" sent. 

The Python client timed out at 04:19:21, which is an additional minute. Given that it constantly tries on this port, I could only assume that there are more delays in Firefox itself. But hard to tell. Maybe a try build with enabled socket log might help if this problem is more often occurring in the future.
CC'ing Geoff because he is working on a couple of other investigations when we have a busted Firefox startup, like bug 1352671 or bug 1261598.
In the case of an ASAN build I see the following output of Marionette in the log:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=111383412&lineNumber=1204

[task 2017-07-01T21:34:32.594790Z] 21:34:32     INFO -  1498944872582	Marionette	INFO	Listening on port 2828
[task 2017-07-01T21:34:32.692007Z] 21:34:32     INFO -  JavaScript error: chrome://marionette/content/server.js, line 350: NS_ERROR_SOCKET_ADDRESS_IN_USE: Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]

This is strange because the `listening` output line should only be visible when the server got correctly started, and then even afterward. I wonder if there is a situation when we try to create the server twice. I will file a Marionette bug for that.
(In reply to Henrik Skupin (:whimboo) [partly available 07/10
-07/14] from comment #13)

> In the case of an ASAN build I see the following output of
> Marionette in the log:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=111383412&lineNumber=1204
> 
> [task 2017-07-01T21:34:32.594790Z] 21:34:32     INFO -  1498944872582 Marionette	INFO	Listening on port 2828
> [task 2017-07-01T21:34:32.692007Z] 21:34:32     INFO -  JavaScript error:
> chrome://marionette/content/server.js, line 350:
> NS_ERROR_SOCKET_ADDRESS_IN_USE: Component returned failure code: 0x804b0036
> (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]
> 
> This is strange because the `listening` output line should only
> be visible when the server got correctly started, and then even
> afterward. I wonder if there is a situation when we try to create
> the server twice. I will file a Marionette bug for that.

In

	https://searchfox.org/mozilla-central/rev/b7e531410ebc1c7d74a78e86a894e69608db318c/testing/marionette/components/marionette.js#257-280

there might, I suppose, be a race condition if

>   if (this.running || !this.enabled || !this.finalUIStartup) {

matches and this.running has not yet been set to true, which only
happens on :275.  Perhaps we need to set an initialisation flag
right at the beginning when this function is called to prevent it
from being called if e.g. we receive two calls to init() in close
succession?

There is also a possibility that we already have something listening
on port 2828.  Ideally we should use port 0 to have the system
atomically assign a free port number, and read the used port number
from marionette.port before attempting to connect.
Please see bug 1254136 for ongoing work about the double server socket start() call. Sorry for marking the wrong bug as dependency.
Depends on: 1254136
No longer depends on: 1378041
See Also: → 1382029
With bug 1254136 fixed we get the following observer notifications and correctly start the server:

[task 2017-08-17T14:53:48.609571Z] 14:53:48     INFO -  1502981628606	Marionette	DEBUG	Received observer notification "profile-after-change"
[task 2017-08-17T14:54:05.775492Z] 14:54:05     INFO -  1502981645767	Marionette	DEBUG	Received observer notification "command-line-startup"
[task 2017-08-17T14:54:05.776573Z] 14:54:05     INFO -  1502981645767	Marionette	INFO	Enabled via --marionette
[task 2017-08-17T14:55:16.219257Z] 14:55:16     INFO -  1502981716209	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
[task 2017-08-17T14:55:21.041807Z] 14:55:21     INFO -  1502981721035	Marionette	INFO	Listening on port 2828

It means that whether the client is not constantly trying to connect, or it cannot establish a connection because it gets blocked. I would like to see some more failure reports especially for debug builds over the next days.
:whimboo- I see this reporting a medium frequency of failures last week and continuing on this week:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1362293&startday=2017-08-14&endday=2017-08-25&tree=all

is there enough data to see a pattern or figure out what is going on?
Flags: needinfo?(hskupin)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24)
> :whimboo- I see this reporting a medium frequency of failures last week and
> continuing on this week:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1362293&startday=2017-08-14&endday=2017-08-25&tree=all
> 
> is there enough data to see a pattern or figure out what is going on?

The only idea I have here is that we do something similar as what geckodriver does in sending info level logging to stdout for each connection attempt from the client. Then we could see if the client still tries to connect, and why it fails.

I will file a new bug for this particular improvement.
Flags: needinfo?(hskupin)
With my patch on bug 1394381 landed for a short moment on autoland before it got backed out, I was able to reproduce this bug. So here some logging:

https://treeherder.mozilla.org/logviewer.html#?job_id=129146798&repo=autoland&lineNumber=1431

[task 2017-09-07T06:31:53.982817Z] 06:31:53     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -no-remote -marionette -profile /tmp/tmpf_6SeT.mozrunner
[task 2017-09-07T06:31:53.985692Z] 06:31:53     INFO -  Connection attempt to Marionette failed ([Errno 111] Connection refused)
[..]
[task 2017-09-07T06:32:45.113433Z] 06:32:45     INFO -  Connection attempt to Marionette failed ([Errno 111] Connection refused)
[task 2017-09-07T06:32:46.120309Z] 06:32:46     INFO -  Connection attempt to Marionette failed ([Errno 111] Connection refused)
[task 2017-09-07T06:32:46.621424Z] 06:32:46     INFO -  1504765966607	Marionette	DEBUG	Received observer notification "command-line-startup"
[task 2017-09-07T06:32:46.623572Z] 06:32:46     INFO -  1504765966608	Marionette	INFO	Enabled via --marionette
[..]
[task 2017-09-07T06:32:53.145463Z] 06:32:53     INFO -  Connection attempt to Marionette failed ([Errno 111] Connection refused)
[..]
[task 2017-09-07T06:32:58.746529Z] 06:32:58     INFO -  1504765978742	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
[..]
[task 2017-09-07T06:33:12.072290Z] 06:33:12     INFO -  1504765992061	Marionette	INFO	Listening on port 2828
[..]
[task 2017-09-07T06:34:55.255229Z] 06:34:55    ERROR - Failure during harness execution

As you can see we start Firefox and Marionette client is trying to connect to the server. About a minute later we receive the "sessionstore-windows-restored" observer notification which starts the initialization of Marionette, and the server is up and running another 14s later. Interestingly any connection attempt from the client stopped after 06:32:53, which is exactly 1 minute after we started the application. The startup timeout in use here is by default 120s, and the mozharness config also doesn't override it as I can see here. Even if we would run into this timeout an exception would be thrown. But nothing is seen here.

So something I will try now is to force delaying the initialization of the server component to see if I can reproduce an early abort of `wait_for_port`.
Ok, the problem here is because of the following situation... Because of bug 1397734 the Marionette harness sets a timeout value for startup of the browser until the server is ready to 60s. This differs from what Marionette driver has as default, and perfectly explains what I was seeing in comment 28. Additionally to that bug 1397675 plays into account because we delay the forced shutdown by 120s by waiting for the application to quit, which doesn't make sense at all at this point.

With both bugs fixed this failure should go away when we find a sane default for our jobs in CI.
Priority: -- → P3
There are no more aborts visible since Sep 12th which is nearly a week. Before we had multiple failures a day. So I would say this is fixed now by my patches as referenced in comment 29.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.