Closed Bug 1525109 Opened 6 years ago Closed 5 years ago

mac wpt tests failing with marionette error

Categories

(Testing :: web-platform-tests, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dhouse, Unassigned)

Details

Attachments

(3 files)

I see failures (380 in the last month) on the mac minis for test-macosx64/debug-web-platform-tests-e10s-12 ending with an exception about "Loading initial page http://web-platform.test:8000/testharness_runner.html failed.". I think this test is not always failing.

Example from this morning (https://tools.taskcluster.net/groups/aywPnxcNRZi3786fsCc4Kw/tasks/afHEO9sSR7-F6QAgBQDutQ/runs/0):

"t-yosemite-r7-145":{"afHEO9sSR7-F6QAgBQDutQ":["exception 2019-02-04T18:34:15.754Z 2019-02-04T19:28:17.591Z","test-macosx64/debug-web-platform-tests-e10s-12"],[""up  1:37","worker 0","","be1beccfb86d394a75f4af64dd8e21f5fff8ff88","",10.11.3
11:14:30  WARNING - Forcibly terminating runner process
11:20:30  WARNING - Timed out waiting for browser to start
11:20:30  WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30 CRITICAL - Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.raceback (most recent call last):
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
11:20:30 CRITICAL -     self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
11:20:30 CRITICAL -     f()
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
11:20:30 CRITICAL -     self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'
11:20:30 CRITICAL - 
11:20:30 CRITICAL - Traceback (most recent call last):
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 141, in start_runner
11:20:30 CRITICAL -     runner.run()
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 86, in run
11:20:30 CRITICAL -     self.setup()
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 73, in setup
11:20:30 CRITICAL -     self.executor.setup(self)
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 664, in setup
11:20:30 CRITICAL -     self.protocol.testharness.load_runner(self.last_environment["protocol"])
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
11:20:30 CRITICAL -     self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
11:20:30 CRITICAL -     f()
11:20:30 CRITICAL -   File "/Users/cltbld/tasks/task_1549305253/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
11:20:30 CRITICAL -     self.dismiss_alert(lambda: self.marionette.navigate(url))
11:20:30 CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'
11:20:30 CRITICAL - 
11:20:30  WARNING - Command left in remote_queue during cleanup: u'stop', ()
11:20:30  WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30  WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
11:20:30 CRITICAL - Test harness output was not a valid structured log message: 
11:20:30 CRITICAL - Traceback (most recent call last):
11:20:30 CRITICAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
11:20:30 CRITICAL -     send(obj)
11:20:30 CRITICAL - IOError: [Errno 32] Broken pipe
11:20:31 CRITICAL - # TBPL FAILURE #
11:20:31  WARNING - setting return code to 2
11:20:32  WARNING - returning nonzero exit status 2

The logs do not get uploaded since it kills itself off. I'll download logs from one instance and attach them.

Group: mozilla-employee-confidential

It looks like firefox is failing to start (error about WindowServer):

Application command: /Users/cltbld/tasks/task_1549308498/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --marionette about:blank -foreground -profile /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmpFqSCTV.mozrunner
Starting runner
PID 1733 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmpFqSCTV.mozrunner/runtests_leaks_655.log
PID 1733 | _RegisterApplication(), FAILED TO establish the default connection to the WindowServer, _CGSDefaultConnection() is NULL.
Browser exited with return code -15
[...]
Application command: /Users/cltbld/tasks/task_1549308498/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --marionette about:blank -foreground -profile /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmp1Qb0ct.mozrunner
Starting runner
STDERR: Exception in thread Thread-65:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 1082, in run
    self.function(*self.args, **self.kwargs)
  File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 220, in init_timeout
    self.send_message("init_failed")
  File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 214, in send_message
    self.command_queue.put((command, args))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 100, in put
    assert not self._closed
AssertionError
Timed out waiting for browser to start
Browser exited with return code -15
PROCESS LEAKS /var/folders/wp/68vm064n32q2f0zx_5hh1z4h00000x/T/tmp1Qb0ct.mozrunner/runtests_leaks_655.log
leakcheck | refcount logging is off, so leaks can't be detected!
Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.\eTraceback (most recent call last):
  File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in load_runner
    self.dismiss_alert(lambda: self.marionette.navigate(url))
  File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 180, in dismiss_alert
    f()
  File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py", line 132, in <lambda>
    self.dismiss_alert(lambda: self.marionette.navigate(url))
AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'

The example from earlier ran twice more. On the second run it hit the same problem (on #130) and then it completed successfully (on #16, which had failed the same test 8hr earlier for a different task group).

https://tools.taskcluster.net/groups/aywPnxcNRZi3786fsCc4Kw/tasks/afHEO9sSR7-F6QAgBQDutQ/runs/2

Hi Joel,

This is a failure specifically of debug-web-platform-tests-e10s-12 that happened regularly over the last month. It was not appearing last summer/fall from what I can see.
I've checked and the test has passed on the same machines that it has failed on, and I did not find any warnings or other issues around the times that the failures occurred. So it looks like it is a problem with the test.

Do you know where I can send this bug?

Flags: needinfo?(jmaher)

I don't think this is a relops issue, and I don't see any reason it should be employee-only.

It looks like the harness is crashing out; in the supplied log we see

2:12:11     INFO - STDERR: Exception in thread Thread-65:
12:12:11     INFO - Traceback (most recent call last):
12:12:11     INFO -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
12:12:11     INFO -     self.run()
12:12:11     INFO -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 1082, in run
12:12:11     INFO -     self.function(*self.args, **self.kwargs)
12:12:11     INFO -   File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 220, in init_timeout
12:12:11     INFO -     self.send_message("init_failed")
12:12:11     INFO -   File "/Users/cltbld/tasks/task_1549308498/build/tests/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py", line 214, in send_message
12:12:11     INFO -     self.command_queue.put((command, args))
12:12:11     INFO -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 100, in put
12:12:11     INFO -     assert not self._closed

after that assert happens we don't recover and try to access a property that doesn't exist, presumably because the object is not initialised. I'm not sure why this would happen with that job in particular; it might be related to a specific test or to an environmental condition like OOM.

Component: RelOps: General → web-platform-tests
Product: Infrastructure & Operations → Testing
QA Contact: klibby

Thanks! I had put it as employee-only since I was attaching logs which I did not review the content of. I'm guessing they are the same as what is uploaded/public in taskcluster log results for a run.

Assignee: dhouse → nobody

I am not sure if this is a simple harness issue- possibly the machine is in a bad state instead of an issue with tools or Firefox. I wonder if:

  • this is seen on other platforms/configs
  • how frequently this occurs
  • is there a pattern of jobs/failures/tools that run prior to these instances on the same machine

Possibly we need some tooling prior to starting a test to ensure a ready state so we can make this a 'blue' job on treeherder and ignore the failure.

I also second :jgraham's point this doesn't need to be confidential.

:dhouse- do you have thoughts on my 3 questions?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

  • this is seen on other platforms/configs
    I'm not certain if we have seen this on other platforms. Perhaps someone can check history for this test across multiple platforms? (I don't know where to look for that)
  • how frequently this occurs
    For frequency on the macs, I have counts for failures that I have recovered machines from but I don't know how many other times the test ran without failing.
    I'll collect a list of the test/task ids for the failures and attach it to this bug.
  • is there a pattern of jobs/failures/tools that run prior to these instances on the same machine
    I'll look at a sampling of failures today to see if there is any pattern. I think there is not a pattern; the machines are rebooted between tasks. I'll try running the test on staging to see if I can reproduce it and then see and control the environment (trying with a newly imaging machine versus with other test runs to see if there is something left over).
Group: mozilla-employee-confidential
Attached file marionette_january.csv

I attached a list of the tests hitting this problem (that I recovered the machine from) for this year so far (so about a month):

[david@george logs]$ cat marionette_january.csv |awk '{print $4}'|sort|uniq -c
      8 "test-macosx64/debug-test-verify-wpt-e10s",
    382 "test-macosx64/debug-web-platform-tests-e10s-12",
      8 "test-macosx64-devedition/opt-web-platform-tests-e10s-4",
     24 "test-macosx64-nightly/opt-web-platform-tests-e10s-4",
      4 "test-macosx64/opt-test-verify-wpt-e10s",
    219 "test-macosx64/opt-web-platform-tests-e10s-4",

I've caught it on 336 of ~470 machines and that is spread across those and not focused on any subset:

[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $1}'|sort|uniq --count|sort
      1 "t-yosemite-r7-006",
      1 "t-yosemite-r7-011",
      1 "t-yosemite-r7-012",
      1 "t-yosemite-r7-013",
      1 "t-yosemite-r7-015",
      1 "t-yosemite-r7-016",
[...]
      5 "t-yosemite-r7-353",
      5 "t-yosemite-r7-428",
      5 "t-yosemite-r7-437",
      5 "t-yosemite-r7-470",
      6 "t-yosemite-r7-098",
      6 "t-yosemite-r7-411",
[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $1}'|sort|uniq --count|sort|wc -l
336

The failures are spread across the month also (no spikes I notice):

[david@george logs]$ cat marionette_january.csv |tr '\r' ' ' |awk '{print $3}' |sed -e 's/.*201[89]\-\([-0-9]*\)T.*/\1/'|sort|uniq --count|sort -m -k2
      4 01-01
     16 01-02
     20 01-03
     15 01-04
      8 01-05
      8 01-06
     29 01-07
     22 01-08
     25 01-09
     30 01-10
     24 01-11
     15 01-12
     10 01-13
     21 01-14
     34 01-15
     27 01-16
     33 01-17
     20 01-18
      1 01-20
     41 01-21
      7 01-22
     23 01-23
     26 01-24
     19 01-25
      5 01-26
      4 01-27
     15 01-28
     23 01-29
     35 01-30
     22 01-31
     26 02-01
      6 02-04
      7 02-05
     15 12-28
      2 12-29
      2 12-30
      5 12-31

The missing days are most likely days that I did not check and reboot machines (and so they were rebooted by CIDuty and I did not capture any logs).

Given comment 11, I would say that we are looking at some interaction from the test harness/tests that is causing this. Why would this only show up on wpt debug12/opt4?

This attached file is the 528 tests in common on the two jobs as seen last night on mozilla-central (some tests can shift over time)

If we can identify some of the test-verify-wpt jobs that failed, that might give us specific test names that could help narrow down the list :)

Flags: needinfo?(jmaher)

Dupe of bug 1511764? The failure messages are the same. But I wonder why that other bug isn't that high frequent. Is something wrongly classified? Or where do we get all those errors from?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #14)

Created attachment 9042046 [details]
debug12opt4_commontests.txt

Given comment 11, I would say that we are looking at some interaction from the test harness/tests that is causing this. Why would this only show up on wpt debug12/opt4?

This attached file is the 528 tests in common on the two jobs as seen last night on mozilla-central (some tests can shift over time)

If we can identify some of the test-verify-wpt jobs that failed, that might give us specific test names that could help narrow down the list :)

Joel, is this identification something I can help with? Or would it help for me to download the logs for a few more examples?

Here are three I recovered this morning:

"t-yosemite-r7-039", "b6GYWNtYQs-wHB_6xUjtVQ", "2019-02-07T12:28:41.462Z", "test-macosx64/opt-web-platform-tests-e10s-4", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"
"t-yosemite-r7-179", "WO8ULzJRTQ6lYVcj3eCqOA", "2019-02-07T12:56:49.714Z", "test-macosx64/opt-web-platform-tests-e10s-4", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"
"t-yosemite-r7-142", "C3-1KDNESI-tVL9eCegFpQ", "2019-02-07T14:50:09.096Z", "test-macosx64/debug-web-platform-tests-e10s-12", "CRITICAL - AttributeError: 'MarionetteTestharnessProtocolPart' object has no attribute 'marionette'"

No logs for the failures (claim-expired):
https://queue.taskcluster.net/v1/task/b6GYWNtYQs-wHB_6xUjtVQ/status
https://tools.taskcluster.net/tasks/b6GYWNtYQs-wHB_6xUjtVQ
https://queue.taskcluster.net/v1/task/WO8ULzJRTQ6lYVcj3eCqOA/status
https://tools.taskcluster.net/tasks/WO8ULzJRTQ6lYVcj3eCqOA
https://queue.taskcluster.net/v1/task/C3-1KDNESI-tVL9eCegFpQ/status
https://tools.taskcluster.net/tasks/C3-1KDNESI-tVL9eCegFpQ

Flags: needinfo?(jmaher)

I cannot find errors in the logs referenced above. Do we get logs when these errors occur?

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)

I cannot find errors in the logs referenced above. Do we get logs when these errors occur?

No, the logs do not get uploaded. Generic-worker kills off itself and the OS UI (bug 1475689). So I or CIDuty ends up finding it and rebooting the machine. So, I can download the logs for a few when I do that next.

cc'ing pmoore since this may be generic-worker related

Is this still an issue?

Priority: -- → P3

This is not occurring now.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: