Closed Bug 1352671 Opened 7 years ago Closed 7 years ago

Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it

Categories

(Testing :: Mochitest, defect)

Version 3
Unspecified
Windows 10
defect
Not set
normal

Tracking

(firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
Tracking Status
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: aryx, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])

https://treeherder.mozilla.org/logviewer.html#?job_id=88044829&repo=mozilla-inbound

23:38:15     INFO - TEST-START | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js
23:38:15     INFO - TEST-SKIP | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js | took 0ms
23:38:15     INFO -  dir: browser/base/content/test/pageinfo
23:38:15     INFO - Increasing default timeout to 90 seconds
23:38:17     INFO -  mozprofile.addons WARNING | Could not install c:\slave\test\build\tests\mochitest\extensions\mozscreenshots: [Errno 2] No such file or directory: 'c:\\slave\\test\\build\\tests\\mochitest\\extensions\\mozscreenshots\\install.rdf'
23:38:18     INFO -  c:\slave\test\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
23:38:18     INFO - MochitestServer : launching [u'c:\\slave\\test\\build\\tests\\bin\\xpcshell.exe', '-g', 'c:\\slave\\test\\build\\application\\firefox', '-v', '170', '-f', 'c:\\slave\\test\\build\\tests\\bin\\components\\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\\\users\\\\cltbld\\\\appdata\\\\local\\\\temp\\\\tmpit0jny.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'c:\\slave\\test\\build\\tests\\mochitest\\server.js']
23:38:18     INFO - runtests.py | Server pid: 3420
23:38:18     INFO - runtests.py | Websocket server pid: 3372
23:38:18     INFO - runtests.py | SSL tunnel pid: 3700
23:38:18     INFO -  Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:18     INFO -  Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:19     INFO - runtests.py | Running with e10s: False
23:38:19     INFO - runtests.py | Running tests: start.
23:38:19     INFO - 
23:38:19     INFO - Application command: c:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner
23:38:19     INFO -  [3420] WARNING: Could not get the program name for a cubeb stream.: 'NS_SUCCEEDED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/dom/media/CubebUtils.cpp, line 317
23:38:19     INFO - runtests.py | Application pid: 1956
23:38:19     INFO - TEST-INFO | started process GECKO(1956)
23:38:19     INFO - GECKO(1956) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner\runtests_leaks.log
23:38:19     INFO - GECKO(1956) | [1956] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/startupcache/StartupCache.cpp, line 218
23:38:19     INFO - GECKO(1956) | [1956] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
23:38:19     INFO - GECKO(1956) | [1956] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572
23:38:23     INFO - GECKO(1956) | ++DOCSHELL 0F245000 == 1 [pid = 1956] [id = {ee67b6f8-4579-4dd7-9749-c9cc84f28909}]
23:38:23     INFO - GECKO(1956) | ++DOMWINDOW == 1 (0F245400) [pid = 1956] [serial = 1] [outer = 00000000]
23:38:23     INFO - GECKO(1956) | ++DOMWINDOW == 2 (0F245800) [pid = 1956] [serial = 2] [outer = 0F245400]
23:38:24     INFO - GECKO(1956) | [1956] WARNING: DWM not enabled, falling back to software vsync: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/gfx/thebes/gfxWindowsPlatform.cpp, line 1895
23:38:24     INFO - GECKO(1956) | ++DOCSHELL 0F48B400 == 2 [pid = 1956] [id = {26704789-4e7c-4371-a12d-f4dccec85fe1}]
23:38:24     INFO - GECKO(1956) | ++DOMWINDOW == 3 (0F48B800) [pid = 1956] [serial = 3] [outer = 00000000]
23:38:24     INFO - GECKO(1956) | ++DOMWINDOW == 4 (0F48BC00) [pid = 1956] [serial = 4] [outer = 0F48B800]
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26     INFO - GECKO(1956) | ++DOCSHELL 10DEBC00 == 3 [pid = 1956] [id = {d41c9df6-feec-49ee-af69-6ab54802bc26}]
23:38:26     INFO - GECKO(1956) | ++DOMWINDOW == 5 (10DEC000) [pid = 1956] [serial = 5] [outer = 00000000]
23:38:26     INFO - GECKO(1956) | ++DOMWINDOW == 6 (10DEC400) [pid = 1956] [serial = 6] [outer = 10DEC000]
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 7 (11892400) [pid = 1956] [serial = 7] [outer = 0F48B800]
23:38:27     INFO - GECKO(1956) | ++DOCSHELL 0D548000 == 4 [pid = 1956] [id = {cfff4cf8-490e-4d3d-9635-7023fc43150a}]
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 8 (0F244800) [pid = 1956] [serial = 8] [outer = 00000000]
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 9 (12579400) [pid = 1956] [serial = 9] [outer = 0F244800]
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27     INFO - GECKO(1956) | ++DOMWINDOW == 10 (12B63800) [pid = 1956] [serial = 10] [outer = 0F244800]
23:40:20     INFO -  Traceback (most recent call last):
23:40:20     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
23:40:20     INFO -      marionette_args=marionette_args,
23:40:20     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
23:40:20     INFO -      self.marionette.start_session(timeout=port_timeout)
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
23:40:20     INFO -      m._handle_socket_failure()
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
23:40:20     INFO -      return func(*args, **kwargs)
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
23:40:20     INFO -      self.protocol, _ = self.client.connect()
23:40:20     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
23:40:20     INFO -      self.sock.connect((self.addr, self.port))
23:40:20     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
23:40:20     INFO -      return getattr(self._sock,name)(*args)
23:40:20     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
23:40:20    ERROR - Automation Error: Received unexpected exception while running application
23:40:20    ERROR -
this just started a week ago and we have 42 failures so far.

doing a handful of retriggers nearby where this started:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0

this needs a lot of data to get confidence in a patch, right now I am looking at 1291397.
Whiteboard: [stockwell needswork]
this is hard to find the root cause:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest-e10s-1&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0&group_state=expanded

this hints at bug 1352389- doing 70+ retriggers/revision gets me to that conclusion, I am happy to do more though.

and to repeat the error:
0:41:52     INFO - runtests.py | Application pid: 2912
10:41:52     INFO - TEST-INFO | started process GECKO(2912)
10:43:53     INFO -  Traceback (most recent call last):
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53     INFO -      marionette_args=marionette_args,
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53     INFO -      self.marionette.start_session(timeout=port_timeout)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53     INFO -      m._handle_socket_failure()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53     INFO -      return func(*args, **kwargs)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53     INFO -      self.protocol, _ = self.client.connect()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53     INFO -      self.sock.connect((self.addr, self.port))
10:43:53     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53     INFO -      return getattr(self._sock,name)(*args)
10:43:53     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53    ERROR - Automation Error: Received unexpected exception while running application
10:43:53    ERROR - 

this seems to happen in the middle of a test pass, or randomly at the beginning, etc.


:smaug, I see you authored bug 1352389, do you think this could be causing us to have Marionette fail to connect to the server in the browser?
Flags: needinfo?(bugs)
Very very unlikely. 
That just dropped extra script blocker from stack, which means we don't need to allocate script runner from stack to run stuff when the outermost script blocker goes out from stack.

If my patch caused that, some setup is likely wrong elsewhere.
Flags: needinfo?(bugs)
Summary: Intermittent mochitest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it → Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it
:whimboo, I see that marionette is failing to establish a connection, could you help shed some light on what is going on here?
Flags: needinfo?(hskupin)
Do we know why it started on April 1st? Was there a change in mozilla-central? First failure happened on inbound btw. Maybe it got to central via a merge later that day. Or might this have been a change in the infrastructure?

I would propose that we check the pushlog between March 31st and April 1st.

A quick look showed me the following change for `mozJSComponentLoader.cpp`, and given that Marionette is a JS component I wonder if this could have had an influence:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&selectedJob=87998560

At least it looks like that the Marionette code is not getting started at all. And as such no server socket will be created. But all this is vague given that I don't really have Gecko log output. So if that failure is happening that often, why not enable Gecko logging and printing all out.
Flags: needinfo?(hskupin)
:whimboo, looking at a larger range you can see this occurring a few pushes in the past:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0

you can see we did some backouts around where this first started.  Is there more ways to debug this or do any of the revisions in that range look interesting to you?
Flags: needinfo?(hskupin)
the first occurrence of this is 6551d56d6587, but this is a clobber build, the previous clobber was 3 pushes prior in rev 13d070200dfe, here is the range:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=6551d56d65871c61e52bbd8e0a728ec70f488aed&fromchange=13d070200dfef7296a1d6b6aa0a877a0d37c434b

the only code change in there is 30fe7c61b5db but that is backed out shortly thereafter and we still see this failure- so I am confused why we get this.

I would prefer to debug this as a real problem vs tracking it down by bisecting.  

In looking at:
10:41:52     INFO - TEST-INFO | started process GECKO(2912)
10:43:53     INFO -  Traceback (most recent call last):
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53     INFO -      marionette_args=marionette_args,
10:43:53     INFO -    File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53     INFO -      self.marionette.start_session(timeout=port_timeout)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53     INFO -      m._handle_socket_failure()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53     INFO -      return func(*args, **kwargs)
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53     INFO -      self.protocol, _ = self.client.connect()
10:43:53     INFO -    File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53     INFO -      self.sock.connect((self.addr, self.port))
10:43:53     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53     INFO -      return getattr(self._sock,name)(*args)
10:43:53     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53    ERROR - Automation Error: Received unexpected exception while running application

why do we actively refuse a connection?  Is there more info we could get here about processes running, a debug log of firefox networking?  open ports and connections on the machine?
This could be the Windows version of bug 1261598. I'll see if I can verify that.
Flags: needinfo?(gbrown)
(In reply to Joel Maher ( :jmaher) from comment #8)
> you can see we did some backouts around where this first started.  Is there
> more ways to debug this or do any of the revisions in that range look
> interesting to you?

There is nothing obvious to me. Sorry.

(In reply to Joel Maher ( :jmaher) from comment #10)
> why do we actively refuse a connection?  Is there more info we could get
> here about processes running, a debug log of firefox networking?  open ports
> and connections on the machine?

We could add some psutils output to the log when we run test jobs through Mozharness. Especially in failure cases. Maybe this could help.
Flags: needinfo?(hskupin)
I would propose that we get try builds up with tracing output enabled for Marionette. It could give us a hint if the server component of it fails to create the socket.

This should be able to get done by the consumer of Marionette which in those cases would be the Mochitest runner. There is a `verbose` argument to the Marionette `__init__()` method. It should be set to `2`. 

https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/testing/mochitest/runtests.py#2103

Otherwise setting the preference `marionette.logging` should also work. We could do it via the pref_general.py file which Mochitests also utilize, or? I would suggest that we turn this on by default for all test suites. Given that Marionette is only used to install addons, it should not add that much to the log file anyway.
Flags: needinfo?(ato)
Flags: needinfo?(ato)
See Also: → 1353236
(In reply to Geoff Brown [:gbrown] from comment #11)
> This could be the Windows version of bug 1261598. I'll see if I can verify that.

I've been trying, but I cannot reproduce this failure with my logging. Unlucky?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=19f0da4c10f4e09852e1c5b5483923a14d10d9e9
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bba198dad5ebe86ad35561727a8c97b36e62b79
I finally reproduced with my logging, but the failed run did not get to my logging. That suggests that Firefox is hanging on startup earlier than in bug 1261598.

https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&selectedJob=92818111
Flags: needinfo?(gbrown)
I notice now that failed runs do not have the normal startup chatter:

19:56:44     INFO - TEST-INFO | started process GECKO(2968)
19:56:44     INFO - GECKO(2968) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld~1.t-w\appdata\local\temp\tmpdtspxy.mozrunner\runtests_leaks.log
19:56:44     INFO - GECKO(2968) | [2968] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/startupcache/StartupCache.cpp, line 218
19:56:44     INFO - GECKO(2968) | [2968] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
19:56:45     INFO - GECKO(2968) | [2968] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572

vs

16:02:50     INFO - TEST-INFO | started process GECKO(3844)
16:04:51     INFO -  Traceback (most recent call last):
...
16:04:51     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
16:04:51    ERROR - Automation Error: Received unexpected exception while running application

I think that means we're not even running NS_InitXPCOM.
this is still failing quite frequently- as this is a marionette issue, can we get someone with marionette expertise to own this?
Flags: needinfo?(dburns)
I haven't managed to get back to my earlier investigation, but my conclusion from comments 15 through 17 was that the browser is hanging on startup well before marionette is started - I think it's anything but marionette.
(In reply to Joel Maher ( :jmaher) from comment #19)
> this is still failing quite frequently- as this is a marionette issue, can
> we get someone with marionette expertise to own this?

according to comment 20, this is a bigger issue than marionette, will find owner when it becomes Marionette issue
Flags: needinfo?(dburns)
Flags: needinfo?(gbrown)
A significant number of these failures have been reported under bug 1353236; that's been closed so I expect a frequency increase here.
Flags: needinfo?(gbrown)
See Also: → 1261598
Depends on: 1362901
I narrowed down the hang (crash?) to bug 1362901 - a problem in install_rust_panic_hook(), which was introduced just before these failures started.
There has been an excellent reduction in frequency here due to the change in bug 1362901/bug 1358151.

Will keep open for a bit to monitor, but may be able to resolve as a dup of 1358151.
Whiteboard: [stockwell needswork] → [stockwell fixed]
A few failures linger -- can't quite close this.
Assignee: gbrown → nobody
I can see the following in the logs:

19:05:03     INFO - GECKO(1144) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x28008A,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv
19:05:03     INFO - GECKO(1144) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x28007E,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
19:05:03     INFO - GECKO(1144) | Unable to read VR Path Registry from C:\Users\cltbld.T-W864-IX-319.001\AppData\Local\openvr\openvrpaths.vrpath
19:05:03     INFO - GECKO(1144) | JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 135: TypeError: WeakMap key must be an object, got undefined

As by the Javascript failure it would refer to a "oop-frameloader-crashed" situation, which would perfectly explain the situation we see here. Due to the failure the crash handler doesn't seem to kill Firefox, and it hangs.

I will file a bug to get the Javascript error fixed.
Whiteboard: [stockwell fixed] → [stockwell unknown]
Depends on: 1378036
There is a huge increase of those failures since Sep 15th, when Marionette fails after 120s of application startup:

14:20:14     INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpcv9aot.mozrunner
14:20:14     INFO - runtests.py | Application pid: 7332
14:20:14     INFO - TEST-INFO | started process GECKO(7332)
14:22:15     INFO -  Traceback (most recent call last):

Generally this is a failure for tests as run on Windows TC workers. Because I worked on bug 1399592 last week, I wondered about the `startup_timeout` of Marionette for Windows workers. And well, there is the default value used, which is 120s now. For Linux we use 180s.

I would propose that we explicitly set this timeout in the mozharness configs for Windows. This should drop the number of failures again. Otherwise this should also depend on bug 1382162.
Depends on: 1382162
Depends on: 1400786
With my recent changes we should wait 180s for startup_timeout now, but I still see tests failing after 120s:

09:24:40     INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpjr0x3l.mozrunner
09:24:40     INFO - runtests.py | Application pid: 5188
09:24:40     INFO - TEST-INFO | started process GECKO(5188)
09:26:41     INFO -  Traceback (most recent call last):

I wonder where this is coming from. Maybe at some place we do not correctly set the startup_timeout of Marionette and fallback as such to the default 120s.
So I think that something is really busted with the build when that failure happens. The reason is that in all the other cases we see the following inside the same run!

18:18:10     INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpjajzd5.mozrunner
18:18:10     INFO - runtests.py | Application pid: 1372
18:18:10     INFO - TEST-INFO | started process GECKO(1372)
18:18:11     INFO - GECKO(1372) | 1505783891676	Marionette	INFO	Enabled via --marionette
18:18:16     INFO - GECKO(1372) | 1505783896234	Marionette	INFO	Listening on port 2828

Marionette is enabled immediately within 1s, and starts listening after 6s. Why doesn't happen anything during some starts of Firefox? We should see at least "Enabled via --marionette" which means that the Marionette component got loaded/registered, and enabled. But nothing happens within 2min. It gives me the feeling that something is horribly broken during the application startup.

Lately this only happens on Windows 10 machines for opt/pgo builds.
Blocks: 1386590
OS: Unspecified → Windows 10
I checked a couple of the latest results as reported through OF and noticed that the problem here seems to be that only test jobs are affected which still run on the buildbot testers. All the jobs run via TC are fine.

Joel, what is our plan with those old builders? Do we have to keep them for now due to some specific hardware requirements?

I could bump the startup_timeout for Marionette for those jobs by modifying the mochitest options here:
https://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/configs/unittests/win_unittest.py#l90
Flags: needinfo?(jmaher)
good way to reduce the problem.  It is odd we don't see this on debug- maybe that helps reduce the problem more.  We are mostly mochitest-chrome-X and win10-pgo, there are other data points, but that is the largest sample.

Historically we have not been able to get mochitest-chrome-X converted for windows in AWS due to 1 or 2 tests- possibly we need to push harder on those and disable them and migrate :)

As it stands there are no concrete plans to run windows 10 tests from buildbot on aws- the list right now is:
* mochitest-media
* mochitest-clipboard
* mochitest-gpu
* mochitest-webgl
* mochitest-browser-chrome
* mochitest-chrome
* reftest
* reftest-no-accel

We are really close to getting all the tests running on windows 10- what is odd is we don't see these issues on windows 8 which is 100% hardware.
Flags: needinfo?(jmaher)
Depends on: 1401170
here is a try push running mochitest-chrome on taskcluster VM for win7/10:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=77b341eb9ba768225a79fbfbe734c689afe2487f

^ jobs are still in progress.
when the trees reopen, I will move mochitest-chrome to VM and that will reduce the frequency of this.
Depends on: 1401184
Joel, the browser-chrome mochitest test suite seems to be most affected, and still gets run on buildbot machines. Are there also blockers or could this suite also be moved over to TC as you did for mochitest-chrome yesterday?
Flags: needinfo?(jmaher)
I moved mochitest-chrome to taskcluster, browser-chrome had a lot of leaks which were not trivial to track down, although- on opt we could get browser-chrome on tc faster.  As a note, I think on Tuesday we moved browser-chrome to windows10, so that is why we didn't see it last week.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #64)
> browser-chrome to windows10, so that is why we didn't see it last week.

On which platform version did those tests run before? Was it Windows 7 or 8? Do we already have a bug which tracks the move of those jobs to TC? Opt would be great to get done earlier because that's exactly the type of build which fails most.
we migrated from windows8->windows10 (hardware to hardware) (bug 1397226).

right now opt is failing quite a bit, looks like a few unique test cases, maybe this is doable:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4dcb574fae73c45cf5ebeb7db795a05cae79f00a
ok, a new push with 5 tests disabled, lets see how green this is and if so we can probably work on fixing these 5 issues:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=148c429105c2f1376c870d0f00592f87c23480cc
Whiteboard: [stockwell unknown] → [stockwell needswork]
Joel, there was a massive increase of this failure on Saturday. Can you remind anything which got landed lately and could have caused that? It's mainly for tests running on Windows 10 (Buildbot machines).
Flags: needinfo?(jmaher)
the main thing that would have changed is the screen resolution- the good news is mochitest-browser-chrome is ready to move to virtual machines, sadly I don't think this will help.

As a note, the spike on Saturday is actually misleading- it is because we have fewer builds and the same amount of failures- so what you see is there is no coalescing and we get a full set of failures instead of some builds with no failures and some builds with many.
Flags: needinfo?(jmaher)
Looks like the move of mochitest-browser-chrome tests from hardware to VMs (bug 1402068) also helped a lot on this bug.  Since it got landed on inbound yesterday no more failures are visible for this kind of job.
Depends on: 1402068
All of those failures are coming from Buildbot machines. So the move of browser-chrome mochitests to TC workers helped a lot!

Similar to bug 1261598 the most affected test jobs are mochitest-gl-e10s-*. But I also see mochitest-media-e10s. Joel, what holds us of from getting those to TC? Is that related to graphics too? Maybe can we file tracking bugs at least for those standing out migrations, so that we can mark dependencies correctly?
Flags: needinfo?(jmaher)
Depends on: 1403484
I will work on getting bug 1403523 annotated properly- it will take a day or two as I am in meetings most of Wednesday and Thursday.
Flags: needinfo?(jmaher)
We just got a bunch of these in bug 1403220. e.g.

10:35:13     INFO -  REFTEST INFO | Application command: Z:\task_1506418603\build\application\firefox\firefox.exe -marionette -profile c:\users\genericworker\appdata\local\temp\tmpnrbrlw.mozrunner
10:35:14     INFO -  1506422114370	Marionette	INFO	Enabled via --marionette
10:38:14     INFO -  Traceback (most recent call last):
10:38:14     INFO -    File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 854, in <module>
10:38:14     INFO -      sys.exit(run_test_harness(parser, options))
10:38:14     INFO -    File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 846, in run_test_harness
10:38:14     INFO -      result = reftest.runTests(options.tests, options)
10:38:14     INFO -    File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 507, in runTests
10:38:14     INFO -      return self.runSerialTests(manifests, options, cmdargs)
10:38:14     INFO -    File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 766, in runSerialTests
10:38:14     INFO -      debuggerInfo=debuggerInfo)
10:38:14     INFO -    File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 702, in runApp
10:38:14     INFO -      marionette.start_session()
10:38:14     INFO -    File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:38:14     INFO -      m._handle_socket_failure()
10:38:14     INFO -    File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:38:14     INFO -      return func(*args, **kwargs)
10:38:14     INFO -    File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1228, in start_session
10:38:14     INFO -      self.protocol, _ = self.client.connect()
10:38:14     INFO -    File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:38:14     INFO -      self.sock.connect((self.addr, self.port))
10:38:14     INFO -    File "c:\mozilla-build\python\Lib\socket.py", line 228, in meth
10:38:14     INFO -      return getattr(self._sock,name)(*args)
10:38:14     INFO -  socket.error: [Errno 10061] No connection could be made because the target machine actively refused it
10:38:14    ERROR - Return code: 1
10:38:14    ERROR - No tests run or test summary not found

In that bug, it appears as if Firefox is crashing during or shortly after startup.

What's really unfortunate is that the socket I/O error thrown by Marionette appears to be masking a general application-level failure. It isn't obvious from the log output that the launched process died unexpectedly. At the very least, I'd expect some kind of log message around unexpected process termination. Ideally it would print any crash info, if available.
Depends on: 1403220
Hm, interesting! So this is indeed a problem of the test harness which uses Marionette. It doesn't correctly handle the exception as thrown by marionette_driver, and as such cannot print out the crash details.

So check that:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#2156

If `start_session` throws an exception we completely drop everything behind it! This is awful! I believe the same applies to reftests, and maybe wptrunner.
Depends on: 1403616
Bug 1403220 doesn't really block this failure to be fixed. The patch I already have up on bug 1403616 will actually do it, and this failure and bug 1261598 should be gone completely.
No longer depends on: 1403220
With the patch on bug 1403616 landed we no longer see this failure, but bug 1397201.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
Component: Mochitest Chrome → Mochitest
You need to log in before you can comment on or make changes to this bug.