Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it

NEW
Unassigned

Status

Testing
Mochitest Chrome
25 days ago
2 days ago

People

(Reporter: aryx, Unassigned)

Tracking

({intermittent-failure})

Version 3
intermittent-failure
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [stockwell needswork])

https://treeherder.mozilla.org/logviewer.html#?job_id=88044829&repo=mozilla-inbound

23:38:15     INFO - TEST-START | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js
23:38:15     INFO - TEST-SKIP | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js | took 0ms
23:38:15     INFO -  dir: browser/base/content/test/pageinfo
23:38:15     INFO - Increasing default timeout to 90 seconds
23:38:17     INFO -  mozprofile.addons WARNING | Could not install c:\slave\test\build\tests\mochitest\extensions\mozscreenshots: [Errno 2] No such file or directory: 'c:\\slave\\test\\build\\tests\\mochitest\\extensions\\mozscreenshots\\install.rdf'
23:38:18     INFO -  c:\slave\test\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
23:38:18     INFO - MochitestServer : launching [u'c:\\slave\\test\\build\\tests\\bin\\xpcshell.exe', '-g', 'c:\\slave\\test\\build\\application\\firefox', '-v', '170', '-f', 'c:\\slave\\test\\build\\tests\\bin\\components\\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\\\users\\\\cltbld\\\\appdata\\\\local\\\\temp\\\\tmpit0jny.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'c:\\slave\\test\\build\\tests\\mochitest\\server.js']
23:38:18     INFO - runtests.py | Server pid: 3420
23:38:18     INFO - runtests.py | Websocket server pid: 3372
23:38:18     INFO - runtests.py | SSL tunnel pid: 3700
23:38:18     INFO -  Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:18     INFO -  Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:19     INFO - runtests.py | Running with e10s: False
23:38:19     INFO - runtests.py | Running tests: start.
23:38:19     INFO - 
23:38:19     INFO - Application command: c:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner
23:38:19     INFO -  [3420] WARNING: Could not get the program name for a cubeb stream.: 'NS_SUCCEEDED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/dom/media/CubebUtils.cpp, line 317
23:38:19     INFO - runtests.py | Application pid: 1956
23:38:19     INFO - TEST-INFO | started process GECKO(1956)
23:38:19     INFO - GECKO(1956) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner\runtests_leaks.log
23:38:19     INFO - GECKO(1956) | [1956] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/startupcache/StartupCache.cpp, line 218
23:38:19     INFO - GECKO(1956) | [1956] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
23:38:19     INFO - GECKO(1956) | [1956] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572
23:38:23     INFO - GECKO(1956) | ++DOCSHELL 0F245000 == 1 [pid = 1956] [id = {ee67b6f8-4579-4dd7-9749-c9cc84f28909}]
23:38:23     INFO - GECKO(1956) | ++DOMWINDOW == 1 (0F245400) [pid = 1956] [serial = 1] [outer = 00000000]
23:38:23     INFO - GECKO(1956) | ++DOMWINDOW == 2 (0F245800) [pid = 1956] [serial = 2] [outer = 0F245400]
23:38:24     INFO - GECKO(1956) | [1956] WARNING: DWM not enabled, falling back to software vsync: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/gfx/thebes/gfxWindowsPlatform.cpp, line 1895
23:38:24     INFO - GECKO(1956) | ++DOCSHELL 0F48B400 == 2 [pid = 1956] [id = {26704789-4e7c-4371-a12d-f4dccec85fe1}]
23:38:24     INFO - GECKO(1956) | ++DOMWINDOW == 3 (0F48B800) [pid = 1956] [serial = 3] [outer = 00000000]
23:38:24     INFO - GECKO(1956) | ++DOMWINDOW == 4 (0F48BC00) [pid = 1956] [serial = 4] [outer = 0F48B800]
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | ++DOCSHELL 10DEBC00 == 3 [pid = 1956] [id = {d41c9df6-feec-49ee-af69-6ab54802bc26}]
23:38:26     INFO - GECKO(1956) | ++DOMWINDOW == 5 (10DEC000) [pid = 1956] [serial = 5] [outer = 00000000]
23:38:26     INFO - GECKO(1956) | ++DOMWINDOW == 6 (10DEC400) [pid = 1956] [serial = 6] [outer = 10DEC000]
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 7 (11892400) [pid = 1956] [serial = 7] [outer = 0F48B800]
23:38:27     INFO - GECKO(1956) | ++DOCSHELL 0D548000 == 4 [pid = 1956] [id = {cfff4cf8-490e-4d3d-9635-7023fc43150a}]
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 8 (0F244800) [pid = 1956] [serial = 8] [outer = 00000000]
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 9 (12579400) [pid = 1956] [serial = 9] [outer = 0F244800]
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 10 (12B63800) [pid = 1956] [serial = 10] [outer = 0F244800]
23:40:20     INFO -  Traceback (most recent call last):
23:40:20     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
23:40:20     INFO -      marionette_args=marionette_args,
23:40:20     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
23:40:20     INFO -      self.marionette.start_session(timeout=port_timeout)
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
23:40:20     INFO -      m._handle_socket_failure()
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
23:40:20     INFO -      return func(*args, **kwargs)
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
23:40:20     INFO -      self.protocol, _ = self.client.connect()
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
23:40:20     INFO -      self.sock.connect((self.addr, self.port))
23:40:20     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
23:40:20     INFO -      return getattr(self._sock,name)(*args)
23:40:20     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
23:40:20    ERROR - Automation Error: Received unexpected exception while running application
23:40:20    ERROR -

Comment 1

23 days ago
14 failures in 845 pushes (0.017 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 5
* mozilla-central: 2
* try: 1
* mozilla-aurora: 1

Platform breakdown:
* windows8-64: 11
* windows7-32-vm: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352671&startday=2017-03-27&endday=2017-04-02&tree=all
this just started a week ago and we have 42 failures so far.

doing a handful of retriggers nearby where this started:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0

this needs a lot of data to get confidence in a patch, right now I am looking at 1291397.
Whiteboard: [stockwell needswork]
this is hard to find the root cause:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest-e10s-1&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0&group_state=expanded

this hints at bug 1352389- doing 70+ retriggers/revision gets me to that conclusion, I am happy to do more though.

and to repeat the error:
0:41:52     INFO - runtests.py | Application pid: 2912
10:41:52     INFO - TEST-INFO | started process GECKO(2912)
10:43:53     INFO -  Traceback (most recent call last):
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53     INFO -      marionette_args=marionette_args,
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53     INFO -      self.marionette.start_session(timeout=port_timeout)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53     INFO -      m._handle_socket_failure()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53     INFO -      return func(*args, **kwargs)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53     INFO -      self.protocol, _ = self.client.connect()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53     INFO -      self.sock.connect((self.addr, self.port))
10:43:53     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53     INFO -      return getattr(self._sock,name)(*args)
10:43:53     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53    ERROR - Automation Error: Received unexpected exception while running application
10:43:53    ERROR - 

this seems to happen in the middle of a test pass, or randomly at the beginning, etc.


:smaug, I see you authored bug 1352389, do you think this could be causing us to have Marionette fail to connect to the server in the browser?
Flags: needinfo?(bugs)

Comment 4

18 days ago
Very very unlikely. 
That just dropped extra script blocker from stack, which means we don't need to allocate script runner from stack to run stuff when the outermost script blocker goes out from stack.

If my patch caused that, some setup is likely wrong elsewhere.
Flags: needinfo?(bugs)
Summary: Intermittent mochitest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it → Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it

Comment 5

16 days ago
49 failures in 867 pushes (0.057 failures/push) were associated with this bug in the last 7 days. 

This is the #37 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 23
* autoland: 17
* mozilla-central: 7
* mozilla-aurora: 2

Platform breakdown:
* windows8-64: 40
* windows7-32-vm: 6
* windows7-32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352671&startday=2017-04-03&endday=2017-04-09&tree=all
:whimboo, I see that marionette is failing to establish a connection, could you help shed some light on what is going on here?
Flags: needinfo?(hskupin)
Do we know why it started on April 1st? Was there a change in mozilla-central? First failure happened on inbound btw. Maybe it got to central via a merge later that day. Or might this have been a change in the infrastructure?

I would propose that we check the pushlog between March 31st and April 1st.

A quick look showed me the following change for `mozJSComponentLoader.cpp`, and given that Marionette is a JS component I wonder if this could have had an influence:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&selectedJob=87998560

At least it looks like that the Marionette code is not getting started at all. And as such no server socket will be created. But all this is vague given that I don't really have Gecko log output. So if that failure is happening that often, why not enable Gecko logging and printing all out.
Flags: needinfo?(hskupin)
:whimboo, looking at a larger range you can see this occurring a few pushes in the past:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0

you can see we did some backouts around where this first started.  Is there more ways to debug this or do any of the revisions in that range look interesting to you?
Flags: needinfo?(hskupin)

Comment 9

9 days ago
54 failures in 894 pushes (0.06 failures/push) were associated with this bug in the last 7 days. 

This is the #24 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 28
* mozilla-inbound: 18
* mozilla-central: 6
* try: 2

Platform breakdown:
* windows8-64: 43
* windows7-32-vm: 8
* windows7-32: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352671&startday=2017-04-10&endday=2017-04-16&tree=all
the first occurrence of this is 6551d56d6587, but this is a clobber build, the previous clobber was 3 pushes prior in rev 13d070200dfe, here is the range:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=6551d56d65871c61e52bbd8e0a728ec70f488aed&fromchange=13d070200dfef7296a1d6b6aa0a877a0d37c434b

the only code change in there is 30fe7c61b5db but that is backed out shortly thereafter and we still see this failure- so I am confused why we get this.

I would prefer to debug this as a real problem vs tracking it down by bisecting.  

In looking at:
10:41:52     INFO - TEST-INFO | started process GECKO(2912)
10:43:53     INFO -  Traceback (most recent call last):
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53     INFO -      marionette_args=marionette_args,
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53     INFO -      self.marionette.start_session(timeout=port_timeout)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53     INFO -      m._handle_socket_failure()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53     INFO -      return func(*args, **kwargs)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53     INFO -      self.protocol, _ = self.client.connect()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53     INFO -      self.sock.connect((self.addr, self.port))
10:43:53     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53     INFO -      return getattr(self._sock,name)(*args)
10:43:53     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53    ERROR - Automation Error: Received unexpected exception while running application

why do we actively refuse a connection?  Is there more info we could get here about processes running, a debug log of firefox networking?  open ports and connections on the machine?
This could be the Windows version of bug 1261598. I'll see if I can verify that.
Flags: needinfo?(gbrown)
(In reply to Joel Maher ( :jmaher) from comment #8)
> you can see we did some backouts around where this first started.  Is there
> more ways to debug this or do any of the revisions in that range look
> interesting to you?

There is nothing obvious to me. Sorry.

(In reply to Joel Maher ( :jmaher) from comment #10)
> why do we actively refuse a connection?  Is there more info we could get
> here about processes running, a debug log of firefox networking?  open ports
> and connections on the machine?

We could add some psutils output to the log when we run test jobs through Mozharness. Especially in failure cases. Maybe this could help.
Flags: needinfo?(hskupin)
I would propose that we get try builds up with tracing output enabled for Marionette. It could give us a hint if the server component of it fails to create the socket.

This should be able to get done by the consumer of Marionette which in those cases would be the Mochitest runner. There is a `verbose` argument to the Marionette `__init__()` method. It should be set to `2`. 

https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/testing/mochitest/runtests.py#2103

Otherwise setting the preference `marionette.logging` should also work. We could do it via the pref_general.py file which Mochitests also utilize, or? I would suggest that we turn this on by default for all test suites. Given that Marionette is only used to install addons, it should not add that much to the log file anyway.
Flags: needinfo?(ato)
Flags: needinfo?(ato)
See Also: → bug 1353236
(In reply to Geoff Brown [:gbrown] from comment #11)
> This could be the Windows version of bug 1261598. I'll see if I can verify that.

I've been trying, but I cannot reproduce this failure with my logging. Unlucky?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=19f0da4c10f4e09852e1c5b5483923a14d10d9e9
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bba198dad5ebe86ad35561727a8c97b36e62b79
I finally reproduced with my logging, but the failed run did not get to my logging. That suggests that Firefox is hanging on startup earlier than in bug 1261598.

https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&selectedJob=92818111
Flags: needinfo?(gbrown)
Better link, for the curious: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6635301672f6595ade83979a93a912a164df7868&selectedJob=92818111
I notice now that failed runs do not have the normal startup chatter:

19:56:44     INFO - TEST-INFO | started process GECKO(2968)
19:56:44     INFO - GECKO(2968) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld~1.t-w\appdata\local\temp\tmpdtspxy.mozrunner\runtests_leaks.log
19:56:44     INFO - GECKO(2968) | [2968] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/startupcache/StartupCache.cpp, line 218
19:56:44     INFO - GECKO(2968) | [2968] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
19:56:45     INFO - GECKO(2968) | [2968] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572

vs

16:02:50     INFO - TEST-INFO | started process GECKO(3844)
16:04:51     INFO -  Traceback (most recent call last):
...
16:04:51     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
16:04:51    ERROR - Automation Error: Received unexpected exception while running application

I think that means we're not even running NS_InitXPCOM.
46 failures in 817 pushes (0.056 failures/push) were associated with this bug in the last 7 days. 

This is the #15 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-central: 15
* autoland: 15
* mozilla-inbound: 13
* try: 1
* mozilla-beta: 1
* graphics: 1

Platform breakdown:
* windows8-64: 34
* windows7-32-vm: 10
* windows7-32: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352671&startday=2017-04-17&endday=2017-04-23&tree=all
You need to log in before you can comment on or make changes to this bug.