Closed
Bug 1352671
Opened 8 years ago
Closed 7 years ago
Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it
Categories
(Testing :: Mochitest, defect)
Tracking
(firefox57 fixed, firefox58 fixed)
RESOLVED
FIXED
People
(Reporter: aryx, Assigned: whimboo)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])
https://treeherder.mozilla.org/logviewer.html#?job_id=88044829&repo=mozilla-inbound
23:38:15 INFO - TEST-START | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js
23:38:15 INFO - TEST-SKIP | toolkit/components/remotebrowserutils/tests/browser/browser_RemoteWebNavigation.js | took 0ms
23:38:15 INFO - dir: browser/base/content/test/pageinfo
23:38:15 INFO - Increasing default timeout to 90 seconds
23:38:17 INFO - mozprofile.addons WARNING | Could not install c:\slave\test\build\tests\mochitest\extensions\mozscreenshots: [Errno 2] No such file or directory: 'c:\\slave\\test\\build\\tests\\mochitest\\extensions\\mozscreenshots\\install.rdf'
23:38:18 INFO - c:\slave\test\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
23:38:18 INFO - MochitestServer : launching [u'c:\\slave\\test\\build\\tests\\bin\\xpcshell.exe', '-g', 'c:\\slave\\test\\build\\application\\firefox', '-v', '170', '-f', 'c:\\slave\\test\\build\\tests\\bin\\components\\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\\\users\\\\cltbld\\\\appdata\\\\local\\\\temp\\\\tmpit0jny.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'c:\\slave\\test\\build\\tests\\mochitest\\server.js']
23:38:18 INFO - runtests.py | Server pid: 3420
23:38:18 INFO - runtests.py | Websocket server pid: 3372
23:38:18 INFO - runtests.py | SSL tunnel pid: 3700
23:38:18 INFO - Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:18 INFO - Couldn't convert chrome URL: chrome://branding/locale/brand.properties
23:38:19 INFO - runtests.py | Running with e10s: False
23:38:19 INFO - runtests.py | Running tests: start.
23:38:19 INFO -
23:38:19 INFO - Application command: c:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner
23:38:19 INFO - [3420] WARNING: Could not get the program name for a cubeb stream.: 'NS_SUCCEEDED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/dom/media/CubebUtils.cpp, line 317
23:38:19 INFO - runtests.py | Application pid: 1956
23:38:19 INFO - TEST-INFO | started process GECKO(1956)
23:38:19 INFO - GECKO(1956) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld\appdata\local\temp\tmpit0jny.mozrunner\runtests_leaks.log
23:38:19 INFO - GECKO(1956) | [1956] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/startupcache/StartupCache.cpp, line 218
23:38:19 INFO - GECKO(1956) | [1956] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
23:38:19 INFO - GECKO(1956) | [1956] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572
23:38:23 INFO - GECKO(1956) | ++DOCSHELL 0F245000 == 1 [pid = 1956] [id = {ee67b6f8-4579-4dd7-9749-c9cc84f28909}]
23:38:23 INFO - GECKO(1956) | ++DOMWINDOW == 1 (0F245400) [pid = 1956] [serial = 1] [outer = 00000000]
23:38:23 INFO - GECKO(1956) | ++DOMWINDOW == 2 (0F245800) [pid = 1956] [serial = 2] [outer = 0F245400]
23:38:24 INFO - GECKO(1956) | [1956] WARNING: DWM not enabled, falling back to software vsync: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/gfx/thebes/gfxWindowsPlatform.cpp, line 1895
23:38:24 INFO - GECKO(1956) | ++DOCSHELL 0F48B400 == 2 [pid = 1956] [id = {26704789-4e7c-4371-a12d-f4dccec85fe1}]
23:38:24 INFO - GECKO(1956) | ++DOMWINDOW == 3 (0F48B800) [pid = 1956] [serial = 3] [outer = 00000000]
23:38:24 INFO - GECKO(1956) | ++DOMWINDOW == 4 (0F48BC00) [pid = 1956] [serial = 4] [outer = 0F48B800]
23:38:26 INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26 INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26 INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26 INFO - GECKO(1956) | [1956] WARNING: getting z level of unregistered window: file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/xpfe/appshell/nsWindowMediator.cpp, line 620
23:38:26 INFO - GECKO(1956) | ++DOCSHELL 10DEBC00 == 3 [pid = 1956] [id = {d41c9df6-feec-49ee-af69-6ab54802bc26}]
23:38:26 INFO - GECKO(1956) | ++DOMWINDOW == 5 (10DEC000) [pid = 1956] [serial = 5] [outer = 00000000]
23:38:26 INFO - GECKO(1956) | ++DOMWINDOW == 6 (10DEC400) [pid = 1956] [serial = 6] [outer = 10DEC000]
23:38:27 INFO - GECKO(1956) | ++DOMWINDOW == 7 (11892400) [pid = 1956] [serial = 7] [outer = 0F48B800]
23:38:27 INFO - GECKO(1956) | ++DOCSHELL 0D548000 == 4 [pid = 1956] [id = {cfff4cf8-490e-4d3d-9635-7023fc43150a}]
23:38:27 INFO - GECKO(1956) | ++DOMWINDOW == 8 (0F244800) [pid = 1956] [serial = 8] [outer = 00000000]
23:38:27 INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27 INFO - GECKO(1956) | ++DOMWINDOW == 9 (12579400) [pid = 1956] [serial = 9] [outer = 0F244800]
23:38:27 INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27 INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27 INFO - GECKO(1956) | [1956] WARNING: 'NS_FAILED(rv)', file c:/builds/moz2_slave/m-in-w32-d-0000000000000000000/build/src/caps/BasePrincipal.cpp, line 380
23:38:27 INFO - GECKO(1956) | ++DOMWINDOW == 10 (12B63800) [pid = 1956] [serial = 10] [outer = 0F244800]
23:40:20 INFO - Traceback (most recent call last):
23:40:20 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
23:40:20 INFO - marionette_args=marionette_args,
23:40:20 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
23:40:20 INFO - self.marionette.start_session(timeout=port_timeout)
23:40:20 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
23:40:20 INFO - m._handle_socket_failure()
23:40:20 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
23:40:20 INFO - return func(*args, **kwargs)
23:40:20 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
23:40:20 INFO - self.protocol, _ = self.client.connect()
23:40:20 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
23:40:20 INFO - self.sock.connect((self.addr, self.port))
23:40:20 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
23:40:20 INFO - return getattr(self._sock,name)(*args)
23:40:20 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it
23:40:20 ERROR - Automation Error: Received unexpected exception while running application
23:40:20 ERROR -
Comment hidden (Intermittent Failures Robot) |
Comment 2•8 years ago
|
||
this just started a week ago and we have 42 failures so far.
doing a handful of retriggers nearby where this started:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0
this needs a lot of data to get confidence in a patch, right now I am looking at 1291397.
Whiteboard: [stockwell needswork]
Comment 3•8 years ago
|
||
this is hard to find the root cause:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20debug%20x64%20mochitest-e10s-1&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0&group_state=expanded
this hints at bug 1352389- doing 70+ retriggers/revision gets me to that conclusion, I am happy to do more though.
and to repeat the error:
0:41:52 INFO - runtests.py | Application pid: 2912
10:41:52 INFO - TEST-INFO | started process GECKO(2912)
10:43:53 INFO - Traceback (most recent call last):
10:43:53 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53 INFO - marionette_args=marionette_args,
10:43:53 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53 INFO - self.marionette.start_session(timeout=port_timeout)
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53 INFO - m._handle_socket_failure()
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53 INFO - return func(*args, **kwargs)
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53 INFO - self.protocol, _ = self.client.connect()
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53 INFO - self.sock.connect((self.addr, self.port))
10:43:53 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53 INFO - return getattr(self._sock,name)(*args)
10:43:53 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53 ERROR - Automation Error: Received unexpected exception while running application
10:43:53 ERROR -
this seems to happen in the middle of a test pass, or randomly at the beginning, etc.
:smaug, I see you authored bug 1352389, do you think this could be causing us to have Marionette fail to connect to the server in the browser?
Flags: needinfo?(bugs)
Comment 4•8 years ago
|
||
Very very unlikely.
That just dropped extra script blocker from stack, which means we don't need to allocate script runner from stack to run stuff when the outermost script blocker goes out from stack.
If my patch caused that, some setup is likely wrong elsewhere.
Flags: needinfo?(bugs)
![]() |
Reporter | |
Updated•8 years ago
|
Summary: Intermittent mochitest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it → Intermittent mochitest/reftest Automation Error: Received unexpected exception while running application after error: [Errno 10061] No connection could be made because the target machine actively refused it
Comment hidden (Intermittent Failures Robot) |
Comment 6•8 years ago
|
||
:whimboo, I see that marionette is failing to establish a connection, could you help shed some light on what is going on here?
Flags: needinfo?(hskupin)
Assignee | ||
Comment 7•8 years ago
|
||
Do we know why it started on April 1st? Was there a change in mozilla-central? First failure happened on inbound btw. Maybe it got to central via a merge later that day. Or might this have been a change in the infrastructure?
I would propose that we check the pushlog between March 31st and April 1st.
A quick look showed me the following change for `mozJSComponentLoader.cpp`, and given that Marionette is a JS component I wonder if this could have had an influence:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&selectedJob=87998560
At least it looks like that the Marionette code is not getting started at all. And as such no server socket will be created. But all this is vague given that I don't really have Gecko log output. So if that failure is happening that often, why not enable Gecko logging and printing all out.
Flags: needinfo?(hskupin)
Comment 8•8 years ago
|
||
:whimboo, looking at a larger range you can see this occurring a few pushes in the past:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=e31e16fd6f9f0ab9d4994f3ddc83c43b77370d80&fromchange=d8e59908610f4e0c922ee01904c4817adf3934f0
you can see we did some backouts around where this first started. Is there more ways to debug this or do any of the revisions in that range look interesting to you?
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Comment 10•8 years ago
|
||
the first occurrence of this is 6551d56d6587, but this is a clobber build, the previous clobber was 3 pushes prior in rev 13d070200dfe, here is the range:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug&tochange=6551d56d65871c61e52bbd8e0a728ec70f488aed&fromchange=13d070200dfef7296a1d6b6aa0a877a0d37c434b
the only code change in there is 30fe7c61b5db but that is backed out shortly thereafter and we still see this failure- so I am confused why we get this.
I would prefer to debug this as a real problem vs tracking it down by bisecting.
In looking at:
10:41:52 INFO - TEST-INFO | started process GECKO(2912)
10:43:53 INFO - Traceback (most recent call last):
10:43:53 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2493, in doTests
10:43:53 INFO - marionette_args=marionette_args,
10:43:53 INFO - File "C:\slave\test\build\tests\mochitest\runtests.py", line 2093, in runApp
10:43:53 INFO - self.marionette.start_session(timeout=port_timeout)
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:43:53 INFO - m._handle_socket_failure()
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:43:53 INFO - return func(*args, **kwargs)
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1319, in start_session
10:43:53 INFO - self.protocol, _ = self.client.connect()
10:43:53 INFO - File "C:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:43:53 INFO - self.sock.connect((self.addr, self.port))
10:43:53 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
10:43:53 INFO - return getattr(self._sock,name)(*args)
10:43:53 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it
10:43:53 ERROR - Automation Error: Received unexpected exception while running application
why do we actively refuse a connection? Is there more info we could get here about processes running, a debug log of firefox networking? open ports and connections on the machine?
![]() |
||
Comment 11•8 years ago
|
||
This could be the Windows version of bug 1261598. I'll see if I can verify that.
Flags: needinfo?(gbrown)
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #8)
> you can see we did some backouts around where this first started. Is there
> more ways to debug this or do any of the revisions in that range look
> interesting to you?
There is nothing obvious to me. Sorry.
(In reply to Joel Maher ( :jmaher) from comment #10)
> why do we actively refuse a connection? Is there more info we could get
> here about processes running, a debug log of firefox networking? open ports
> and connections on the machine?
We could add some psutils output to the log when we run test jobs through Mozharness. Especially in failure cases. Maybe this could help.
Flags: needinfo?(hskupin)
Assignee | ||
Comment 13•8 years ago
|
||
I would propose that we get try builds up with tracing output enabled for Marionette. It could give us a hint if the server component of it fails to create the socket.
This should be able to get done by the consumer of Marionette which in those cases would be the Mochitest runner. There is a `verbose` argument to the Marionette `__init__()` method. It should be set to `2`.
https://dxr.mozilla.org/mozilla-central/rev/a374c35469935a874fefe64d3e07003fc5bc8884/testing/mochitest/runtests.py#2103
Otherwise setting the preference `marionette.logging` should also work. We could do it via the pref_general.py file which Mochitests also utilize, or? I would suggest that we turn this on by default for all test suites. Given that Marionette is only used to install addons, it should not add that much to the log file anyway.
Flags: needinfo?(ato)
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(ato)
![]() |
||
Comment 14•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #11)
> This could be the Windows version of bug 1261598. I'll see if I can verify that.
I've been trying, but I cannot reproduce this failure with my logging. Unlucky?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=19f0da4c10f4e09852e1c5b5483923a14d10d9e9
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bba198dad5ebe86ad35561727a8c97b36e62b79
![]() |
||
Comment 15•8 years ago
|
||
I finally reproduced with my logging, but the failed run did not get to my logging. That suggests that Firefox is hanging on startup earlier than in bug 1261598.
https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&selectedJob=92818111
Flags: needinfo?(gbrown)
![]() |
||
Comment 16•8 years ago
|
||
Better link, for the curious: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6635301672f6595ade83979a93a912a164df7868&selectedJob=92818111
![]() |
||
Comment 17•8 years ago
|
||
I notice now that failed runs do not have the normal startup chatter:
19:56:44 INFO - TEST-INFO | started process GECKO(2968)
19:56:44 INFO - GECKO(2968) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\cltbld~1.t-w\appdata\local\temp\tmpdtspxy.mozrunner\runtests_leaks.log
19:56:44 INFO - GECKO(2968) | [2968] WARNING: Failed to load startupcache file correctly, removing!: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/startupcache/StartupCache.cpp, line 218
19:56:44 INFO - GECKO(2968) | [2968] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/base/nsSystemInfo.cpp, line 117
19:56:45 INFO - GECKO(2968) | [2968] WARNING: This method is lossy. Use GetCanonicalPath !: file c:/builds/moz2_slave/autoland-w64-d-000000000000000/build/src/xpcom/io/nsLocalFileWin.cpp, line 3572
vs
16:02:50 INFO - TEST-INFO | started process GECKO(3844)
16:04:51 INFO - Traceback (most recent call last):
...
16:04:51 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it
16:04:51 ERROR - Automation Error: Received unexpected exception while running application
I think that means we're not even running NS_InitXPCOM.
Comment hidden (Intermittent Failures Robot) |
Comment 19•8 years ago
|
||
this is still failing quite frequently- as this is a marionette issue, can we get someone with marionette expertise to own this?
Flags: needinfo?(dburns)
![]() |
||
Comment 20•8 years ago
|
||
I haven't managed to get back to my earlier investigation, but my conclusion from comments 15 through 17 was that the browser is hanging on startup well before marionette is started - I think it's anything but marionette.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 23•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #19)
> this is still failing quite frequently- as this is a marionette issue, can
> we get someone with marionette expertise to own this?
according to comment 20, this is a bigger issue than marionette, will find owner when it becomes Marionette issue
Flags: needinfo?(dburns)
![]() |
||
Updated•8 years ago
|
Flags: needinfo?(gbrown)
![]() |
||
Comment 24•8 years ago
|
||
A significant number of these failures have been reported under bug 1353236; that's been closed so I expect a frequency increase here.
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 26•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d5597553484fcdaf5333410cfd11d02b6302de4 reproduces startup hangs on Windows 8.
Execution reaches:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/toolkit/xre/nsWindowsWMain.cpp#109
but does not reach:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/xpcom/build/XPCOMInit.cpp#468
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 28•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=674cdc86a4519873f20e0d7309d71c1fea547431
Execution reaches:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/toolkit/xre/nsWindowsWMain.cpp#109
https://dxr.mozilla.org/mozilla-central/rev/4a6a71f4aa22e4dc3961884ce505ce34bdd799a2/browser/app/nsBrowserApp.cpp#306
but does not reach:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/xpcom/build/XPCOMInit.cpp#468
Assignee: nobody → gbrown
![]() |
||
Comment 29•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f6cbbd3257e65c28afb50e5a00e089a3e633493
Execution reaches:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/toolkit/xre/nsWindowsWMain.cpp#109
https://dxr.mozilla.org/mozilla-central/rev/4a6a71f4aa22e4dc3961884ce505ce34bdd799a2/browser/app/nsBrowserApp.cpp#306
https://dxr.mozilla.org/mozilla-central/rev/33b92d9c40562dab3d7b602368c75619f1d793f7/browser/app/nsBrowserApp.cpp#235
but does not reach:
https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/xpcom/build/XPCOMInit.cpp#468
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 33•8 years ago
|
||
I narrowed down the hang (crash?) to bug 1362901 - a problem in install_rust_panic_hook(), which was introduced just before these failures started.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 36•8 years ago
|
||
There has been an excellent reduction in frequency here due to the change in bug 1362901/bug 1358151.
Will keep open for a bit to monitor, but may be able to resolve as a dup of 1358151.
Whiteboard: [stockwell needswork] → [stockwell fixed]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 44•8 years ago
|
||
I can see the following in the logs:
19:05:03 INFO - GECKO(1144) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x28008A,name=PBrowser::Msg_UpdateNativeWindowHandle) Channel error: cannot send/recv
19:05:03 INFO - GECKO(1144) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x28007E,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
19:05:03 INFO - GECKO(1144) | Unable to read VR Path Registry from C:\Users\cltbld.T-W864-IX-319.001\AppData\Local\openvr\openvrpaths.vrpath
19:05:03 INFO - GECKO(1144) | JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 135: TypeError: WeakMap key must be an object, got undefined
As by the Javascript failure it would refer to a "oop-frameloader-crashed" situation, which would perfectly explain the situation we see here. Due to the failure the crash handler doesn't seem to kill Firefox, and it hangs.
I will file a bug to get the Javascript error fixed.
Assignee | ||
Updated•8 years ago
|
Whiteboard: [stockwell fixed] → [stockwell unknown]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 55•7 years ago
|
||
There is a huge increase of those failures since Sep 15th, when Marionette fails after 120s of application startup:
14:20:14 INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpcv9aot.mozrunner
14:20:14 INFO - runtests.py | Application pid: 7332
14:20:14 INFO - TEST-INFO | started process GECKO(7332)
14:22:15 INFO - Traceback (most recent call last):
Generally this is a failure for tests as run on Windows TC workers. Because I worked on bug 1399592 last week, I wondered about the `startup_timeout` of Marionette for Windows workers. And well, there is the default value used, which is 120s now. For Linux we use 180s.
I would propose that we explicitly set this timeout in the mozharness configs for Windows. This should drop the number of failures again. Otherwise this should also depend on bug 1382162.
Depends on: 1382162
Assignee | ||
Comment 56•7 years ago
|
||
With my recent changes we should wait 180s for startup_timeout now, but I still see tests failing after 120s:
09:24:40 INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpjr0x3l.mozrunner
09:24:40 INFO - runtests.py | Application pid: 5188
09:24:40 INFO - TEST-INFO | started process GECKO(5188)
09:26:41 INFO - Traceback (most recent call last):
I wonder where this is coming from. Maybe at some place we do not correctly set the startup_timeout of Marionette and fallback as such to the default 120s.
Assignee | ||
Comment 57•7 years ago
|
||
So I think that something is really busted with the build when that failure happens. The reason is that in all the other cases we see the following inside the same run!
18:18:10 INFO - Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\cltbld\appdata\local\temp\tmpjajzd5.mozrunner
18:18:10 INFO - runtests.py | Application pid: 1372
18:18:10 INFO - TEST-INFO | started process GECKO(1372)
18:18:11 INFO - GECKO(1372) | 1505783891676 Marionette INFO Enabled via --marionette
18:18:16 INFO - GECKO(1372) | 1505783896234 Marionette INFO Listening on port 2828
Marionette is enabled immediately within 1s, and starts listening after 6s. Why doesn't happen anything during some starts of Firefox? We should see at least "Enabled via --marionette" which means that the Marionette component got loaded/registered, and enabled. But nothing happens within 2min. It gives me the feeling that something is horribly broken during the application startup.
Lately this only happens on Windows 10 machines for opt/pgo builds.
Blocks: 1386590
OS: Unspecified → Windows 10
Assignee | ||
Comment 58•7 years ago
|
||
I checked a couple of the latest results as reported through OF and noticed that the problem here seems to be that only test jobs are affected which still run on the buildbot testers. All the jobs run via TC are fine.
Joel, what is our plan with those old builders? Do we have to keep them for now due to some specific hardware requirements?
I could bump the startup_timeout for Marionette for those jobs by modifying the mochitest options here:
https://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/configs/unittests/win_unittest.py#l90
Flags: needinfo?(jmaher)
Comment 59•7 years ago
|
||
good way to reduce the problem. It is odd we don't see this on debug- maybe that helps reduce the problem more. We are mostly mochitest-chrome-X and win10-pgo, there are other data points, but that is the largest sample.
Historically we have not been able to get mochitest-chrome-X converted for windows in AWS due to 1 or 2 tests- possibly we need to push harder on those and disable them and migrate :)
As it stands there are no concrete plans to run windows 10 tests from buildbot on aws- the list right now is:
* mochitest-media
* mochitest-clipboard
* mochitest-gpu
* mochitest-webgl
* mochitest-browser-chrome
* mochitest-chrome
* reftest
* reftest-no-accel
We are really close to getting all the tests running on windows 10- what is odd is we don't see these issues on windows 8 which is 100% hardware.
Flags: needinfo?(jmaher)
Comment 60•7 years ago
|
||
here is a try push running mochitest-chrome on taskcluster VM for win7/10:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=77b341eb9ba768225a79fbfbe734c689afe2487f
^ jobs are still in progress.
Comment 61•7 years ago
|
||
when the trees reopen, I will move mochitest-chrome to VM and that will reduce the frequency of this.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 63•7 years ago
|
||
Joel, the browser-chrome mochitest test suite seems to be most affected, and still gets run on buildbot machines. Are there also blockers or could this suite also be moved over to TC as you did for mochitest-chrome yesterday?
Flags: needinfo?(jmaher)
Comment 64•7 years ago
|
||
I moved mochitest-chrome to taskcluster, browser-chrome had a lot of leaks which were not trivial to track down, although- on opt we could get browser-chrome on tc faster. As a note, I think on Tuesday we moved browser-chrome to windows10, so that is why we didn't see it last week.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 65•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #64)
> browser-chrome to windows10, so that is why we didn't see it last week.
On which platform version did those tests run before? Was it Windows 7 or 8? Do we already have a bug which tracks the move of those jobs to TC? Opt would be great to get done earlier because that's exactly the type of build which fails most.
Comment 66•7 years ago
|
||
we migrated from windows8->windows10 (hardware to hardware) (bug 1397226).
right now opt is failing quite a bit, looks like a few unique test cases, maybe this is doable:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4dcb574fae73c45cf5ebeb7db795a05cae79f00a
Comment 67•7 years ago
|
||
ok, a new push with 5 tests disabled, lets see how green this is and if so we can probably work on fixing these 5 issues:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=148c429105c2f1376c870d0f00592f87c23480cc
![]() |
||
Updated•7 years ago
|
Whiteboard: [stockwell unknown] → [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 73•7 years ago
|
||
Joel, there was a massive increase of this failure on Saturday. Can you remind anything which got landed lately and could have caused that? It's mainly for tests running on Windows 10 (Buildbot machines).
Flags: needinfo?(jmaher)
Comment 74•7 years ago
|
||
the main thing that would have changed is the screen resolution- the good news is mochitest-browser-chrome is ready to move to virtual machines, sadly I don't think this will help.
As a note, the spike on Saturday is actually misleading- it is because we have fewer builds and the same amount of failures- so what you see is there is no coalescing and we get a full set of failures instead of some builds with no failures and some builds with many.
Flags: needinfo?(jmaher)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 76•7 years ago
|
||
Looks like the move of mochitest-browser-chrome tests from hardware to VMs (bug 1402068) also helped a lot on this bug. Since it got landed on inbound yesterday no more failures are visible for this kind of job.
Depends on: 1402068
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 78•7 years ago
|
||
All of those failures are coming from Buildbot machines. So the move of browser-chrome mochitests to TC workers helped a lot!
Similar to bug 1261598 the most affected test jobs are mochitest-gl-e10s-*. But I also see mochitest-media-e10s. Joel, what holds us of from getting those to TC? Is that related to graphics too? Maybe can we file tracking bugs at least for those standing out migrations, so that we can mark dependencies correctly?
Flags: needinfo?(jmaher)
Comment 79•7 years ago
|
||
I will work on getting bug 1403523 annotated properly- it will take a day or two as I am in meetings most of Wednesday and Thursday.
Flags: needinfo?(jmaher)
Comment 80•7 years ago
|
||
We just got a bunch of these in bug 1403220. e.g.
10:35:13 INFO - REFTEST INFO | Application command: Z:\task_1506418603\build\application\firefox\firefox.exe -marionette -profile c:\users\genericworker\appdata\local\temp\tmpnrbrlw.mozrunner
10:35:14 INFO - 1506422114370 Marionette INFO Enabled via --marionette
10:38:14 INFO - Traceback (most recent call last):
10:38:14 INFO - File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 854, in <module>
10:38:14 INFO - sys.exit(run_test_harness(parser, options))
10:38:14 INFO - File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 846, in run_test_harness
10:38:14 INFO - result = reftest.runTests(options.tests, options)
10:38:14 INFO - File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 507, in runTests
10:38:14 INFO - return self.runSerialTests(manifests, options, cmdargs)
10:38:14 INFO - File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 766, in runSerialTests
10:38:14 INFO - debuggerInfo=debuggerInfo)
10:38:14 INFO - File "Z:\task_1506418603\build\tests\reftest\runreftest.py", line 702, in runApp
10:38:14 INFO - marionette.start_session()
10:38:14 INFO - File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
10:38:14 INFO - m._handle_socket_failure()
10:38:14 INFO - File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
10:38:14 INFO - return func(*args, **kwargs)
10:38:14 INFO - File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1228, in start_session
10:38:14 INFO - self.protocol, _ = self.client.connect()
10:38:14 INFO - File "Z:\task_1506418603\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
10:38:14 INFO - self.sock.connect((self.addr, self.port))
10:38:14 INFO - File "c:\mozilla-build\python\Lib\socket.py", line 228, in meth
10:38:14 INFO - return getattr(self._sock,name)(*args)
10:38:14 INFO - socket.error: [Errno 10061] No connection could be made because the target machine actively refused it
10:38:14 ERROR - Return code: 1
10:38:14 ERROR - No tests run or test summary not found
In that bug, it appears as if Firefox is crashing during or shortly after startup.
What's really unfortunate is that the socket I/O error thrown by Marionette appears to be masking a general application-level failure. It isn't obvious from the log output that the launched process died unexpectedly. At the very least, I'd expect some kind of log message around unexpected process termination. Ideally it would print any crash info, if available.
Depends on: 1403220
Assignee | ||
Comment 81•7 years ago
|
||
Hm, interesting! So this is indeed a problem of the test harness which uses Marionette. It doesn't correctly handle the exception as thrown by marionette_driver, and as such cannot print out the crash details.
So check that:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#2156
If `start_session` throws an exception we completely drop everything behind it! This is awful! I believe the same applies to reftests, and maybe wptrunner.
Assignee | ||
Comment 82•7 years ago
|
||
Bug 1403220 doesn't really block this failure to be fixed. The patch I already have up on bug 1403616 will actually do it, and this failure and bug 1261598 should be gone completely.
No longer depends on: 1403220
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 85•7 years ago
|
||
With the patch on bug 1403616 landed we no longer see this failure, but bug 1397201.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Component: Mochitest Chrome → Mochitest
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•