mac wpt tests failing with marionette error
Categories
(Testing :: web-platform-tests, defect, P3)
Tracking
(Not tracked)
People
(Reporter: dhouse, Unassigned)
Details
Attachments
(3 files)
I see failures (380 in the last month) on the mac minis for test-macosx64/debug-web-platform-tests-e10s-12 ending with an exception about "Loading initial page http://web-platform.test:8000/testharness_runner.html failed.". I think this test is not always failing.
Example from this morning (https://tools.taskcluster.net/groups/aywPnxcNRZi3786fsCc4Kw/tasks/afHEO9sSR7-F6QAgBQDutQ/runs/0):
"t-yosemite-r7-145":{"afHEO9sSR7-F6QAgBQDutQ":["exception 2019-02-04T18:34:15.754Z 2019-02-04T19:28:17.591Z","test-macosx64/debug-web-platform-tests-e10s-12"],[""up 1:37","worker 0","","be1beccfb86d394a75f4af64dd8e21f5fff8ff88","",10.11.3
11:14:30 WARNING - Forcibly terminating runner process
11:20:30 WARNING - Timed out waiting for browser to start
11:20:30 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30 CRITICAL - Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.raceback (most recent call last):
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
11:20:30 CRITICAL - self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
11:20:30 CRITICAL - f()
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
11:20:30 CRITICAL - self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'
11:20:30 CRITICAL -
11:20:30 CRITICAL - Traceback (most recent call last):
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 141, in start_runner
11:20:30 CRITICAL - runner.run()
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 86, in run
11:20:30 CRITICAL - self.setup()
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 73, in setup
11:20:30 CRITICAL - self.executor.setup(self)
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 664, in setup
11:20:30 CRITICAL - self.protocol.testharness.load_runner(self.last_environment["protocol"])
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
11:20:30 CRITICAL - self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
11:20:30 CRITICAL - f()
11:20:30 CRITICAL - File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
11:20:30 CRITICAL - self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'
11:20:30 CRITICAL -
11:20:30 WARNING - Command left in remote_queue during cleanup: u'stop', ()
11:20:30 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30 CRITICAL - Test harness output was not a valid structured log message:
11:20:30 CRITICAL - Traceback (most recent call last):
11:20:30 CRITICAL - File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
11:20:30 CRITICAL - send(obj)
11:20:30 CRITICAL - IOError: [Errno 32] Broken pipe
11:20:31 CRITICAL - # TBPL FAILURE #
11:20:31 WARNING - setting return code to 2
11:20:32 WARNING - returning nonzero exit status 2
The logs do not get uploaded since it kills itself off. I'll download logs from one instance and attach them.
It looks like firefox is failing to start (error about WindowServer):
Application command: /Users/cltbld/tasks/task_1549308498/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --marionette about:blank -foreground -profile /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmpFqSCTV.mozrunner
Starting runner
PID 1733 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmpFqSCTV.mozrunner/runtests_leaks_655.log
PID 1733 | _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
Browser exited with return code -15
[...]
Application command: /Users/cltbld/tasks/task_1549308498/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --marionette about:blank -foreground -profile /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmp1Qb0ct.mozrunner
Starting runner
STDERR: Exception in thread Thread-65:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 1082, in run
self.function(*self.args, **self.kwargs)
File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 220, in init_timeout
self.send_message("init_failed")
File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 214, in send_message
self.command_queue.put((command, args))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 100, in put
assert not self._closed
AssertionError
Timed out waiting for browser to start
Browser exited with return code -15
PROCESS LEAKS /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmp1Qb0ct.mozrunner/runtests_leaks_655.log
leakcheck | refcount logging is off, so leaks can't be detected!
Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.\eTraceback (most recent call last):
File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
self.dismiss_alert(lambda: self.marionette.navigate(url))
File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
f()
File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
self.dismiss_alert(lambda: self.marionette.navigate(url))
AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'
The example from earlier ran twice more. On the second run it hit the same problem (on #130) and then it completed successfully (on #16, which had failed the same test 8hr earlier for a different task group).
https://tools.taskcluster.net/groups/aywPnxcNRZi3786fsCc4Kw/tasks/afHEO9sSR7-F6QAgBQDutQ/runs/2
Hi Joel,
This is a failure specifically of debug-web-platform-tests-e10s-12 that happened regularly over the last month. It was not appearing last summer/fall from what I can see.
I've checked and the test has passed on the same machines that it has failed on, and I did not find any warnings or other issues around the times that the failures occurred. So it looks like it is a problem with the test.
Do you know where I can send this bug?
Comment 6•6 years ago
|
||
I don't think this is a relops issue, and I don't see any reason it should be employee-only.
It looks like the harness is crashing out; in the supplied log we see
2:12:11 INFO - STDERR: Exception in thread Thread-65:
12:12:11 INFO - Traceback (most recent call last):
12:12:11 INFO - File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
12:12:11 INFO - self.run()
12:12:11 INFO - File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 1082, in run
12:12:11 INFO - self.function(*self.args, **self.kwargs)
12:12:11 INFO - File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 220, in init_timeout
12:12:11 INFO - self.send_message("init_failed")
12:12:11 INFO - File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 214, in send_message
12:12:11 INFO - self.command_queue.put((command, args))
12:12:11 INFO - File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 100, in put
12:12:11 INFO - assert not self._closed
after that assert happens we don't recover and try to access a property that doesn't exist, presumably because the object is not initialised. I'm not sure why this would happen with that job in particular; it might be related to a specific test or to an environmental condition like OOM.
Thanks! I had put it as employee-only since I was attaching logs which I did not review the content of. I'm guessing they are the same as what is uploaded/public in taskcluster log results for a run.
Comment 8•6 years ago
|
||
I am not sure if this is a simple harness issue- possibly the machine is in a bad state instead of an issue with tools or Firefox. I wonder if:
- this is seen on other platforms/configs
- how frequently this occurs
- is there a pattern of jobs/failures/tools that run prior to these instances on the same machine
Possibly we need some tooling prior to starting a test to ensure a ready state so we can make this a 'blue' job on treeherder and ignore the failure.
I also second :jgraham's point this doesn't need to be confidential.
:dhouse- do you have thoughts on my 3 questions?
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)
- this is seen on other platforms/configs
I'm not certain if we have seen this on other platforms. Perhaps someone can check history for this test across multiple platforms? (I don't know where to look for that)
- how frequently this occurs
For frequency on the macs, I have counts for failures that I have recovered machines from but I don't know how many other times the test ran without failing.
I'll collect a list of the test/task ids for the failures and attach it to this bug.
- is there a pattern of jobs/failures/tools that run prior to these instances on the same machine
I'll look at a sampling of failures today to see if there is any pattern. I think there is not a pattern; the machines are rebooted between tasks. I'll try running the test on staging to see if I can reproduce it and then see and control the environment (trying with a newly imaging machine versus with other test runs to see if there is something left over).
Reporter | ||
Comment 10•6 years ago
|
||
Reporter | ||
Comment 11•6 years ago
|
||
I attached a list of the tests hitting this problem (that I recovered the machine from) for this year so far (so about a month):
[david@george logs]$ cat marionette_january.csv |awk '{print $4}'|sort|uniq -c
8 "test-macosx64/debug-test-verify-wpt-e10s",
382 "test-macosx64/debug-web-platform-tests-e10s-12",
8 "test-macosx64-devedition/opt-web-platform-tests-e10s-4",
24 "test-macosx64-nightly/opt-web-platform-tests-e10s-4",
4 "test-macosx64/opt-test-verify-wpt-e10s",
219 "test-macosx64/opt-web-platform-tests-e10s-4",
Reporter | ||
Comment 12•6 years ago
|
||
I've caught it on 336 of ~470 machines and that is spread across those and not focused on any subset:
[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $1}'|sort|uniq --count|sort
1 "t-yosemite-r7-006",
1 "t-yosemite-r7-011",
1 "t-yosemite-r7-012",
1 "t-yosemite-r7-013",
1 "t-yosemite-r7-015",
1 "t-yosemite-r7-016",
[...]
5 "t-yosemite-r7-353",
5 "t-yosemite-r7-428",
5 "t-yosemite-r7-437",
5 "t-yosemite-r7-470",
6 "t-yosemite-r7-098",
6 "t-yosemite-r7-411",
[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $1}'|sort|uniq --count|sort|wc -l
336
Reporter | ||
Comment 13•6 years ago
|
||
The failures are spread across the month also (no spikes I notice):
[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $3}' |sed -e 's/.*201[89]\-\([-0-9]*\)T.*/\1/'|sort|uniq --count|sort -m -k2
4 01-01
16 01-02
20 01-03
15 01-04
8 01-05
8 01-06
29 01-07
22 01-08
25 01-09
30 01-10
24 01-11
15 01-12
10 01-13
21 01-14
34 01-15
27 01-16
33 01-17
20 01-18
1 01-20
41 01-21
7 01-22
23 01-23
26 01-24
19 01-25
5 01-26
4 01-27
15 01-28
23 01-29
35 01-30
22 01-31
26 02-01
6 02-04
7 02-05
15 12-28
2 12-29
2 12-30
5 12-31
The missing days are most likely days that I did not check and reboot machines (and so they were rebooted by CIDuty and I did not capture any logs).
Comment 14•6 years ago
|
||
Given comment 11, I would say that we are looking at some interaction from the test harness/tests that is causing this. Why would this only show up on wpt debug12/opt4?
This attached file is the 528 tests in common on the two jobs as seen last night on mozilla-central (some tests can shift over time)
If we can identify some of the test-verify-wpt jobs that failed, that might give us specific test names that could help narrow down the list :)
Dupe of bug 1511764? The failure messages are the same. But I wonder why that other bug isn't that high frequent. Is something wrongly classified? Or where do we get all those errors from?
Reporter | ||
Comment 16•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #14)
Created attachment 9042046 [details]
debug12opt4_commontests.txtGiven comment 11, I would say that we are looking at some interaction from the test harness/tests that is causing this. Why would this only show up on wpt debug12/opt4?
This attached file is the 528 tests in common on the two jobs as seen last night on mozilla-central (some tests can shift over time)
If we can identify some of the test-verify-wpt jobs that failed, that might give us specific test names that could help narrow down the list :)
Joel, is this identification something I can help with? Or would it help for me to download the logs for a few more examples?
Here are three I recovered this morning:
"t-yosemite-r7-039", "b6GYWNtYQs-wHB_6xUjtVQ", "2019-02-07T12:28:41.462Z", "test-macosx64/opt-web-platform-tests-e10s-4", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"
"t-yosemite-r7-179", "WO8ULzJRTQ6lYVcj3eCqOA", "2019-02-07T12:56:49.714Z", "test-macosx64/opt-web-platform-tests-e10s-4", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"
"t-yosemite-r7-142", "C3-1KDNESI-tVL9eCegFpQ", "2019-02-07T14:50:09.096Z", "test-macosx64/debug-web-platform-tests-e10s-12", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"
No logs for the failures (claim-expired):
https://queue.taskcluster.net/v1/task/b6GYWNtYQs-wHB_6xUjtVQ/status
https://tools.taskcluster.net/tasks/b6GYWNtYQs-wHB_6xUjtVQ
https://queue.taskcluster.net/v1/task/WO8ULzJRTQ6lYVcj3eCqOA/status
https://tools.taskcluster.net/tasks/WO8ULzJRTQ6lYVcj3eCqOA
https://queue.taskcluster.net/v1/task/C3-1KDNESI-tVL9eCegFpQ/status
https://tools.taskcluster.net/tasks/C3-1KDNESI-tVL9eCegFpQ
Comment 17•6 years ago
|
||
I cannot find errors in the logs referenced above. Do we get logs when these errors occur?
Reporter | ||
Comment 18•6 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)
I cannot find errors in the logs referenced above. Do we get logs when these errors occur?
No, the logs do not get uploaded. Generic-worker kills off itself and the OS UI (bug 1475689). So I or CIDuty ends up finding it and rebooting the machine. So, I can download the logs for a few when I do that next.
Reporter | ||
Comment 19•6 years ago
|
||
cc'ing pmoore since this may be generic-worker related
Reporter | ||
Comment 21•5 years ago
|
||
This is not occurring now.
Description
•