Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-'] due to hang in mozlog

RESOLVED FIXED in Firefox 60

Status

defect
P1
normal
RESOLVED FIXED
2 years ago
2 months ago

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

Tracking

(Depends on 1 bug, {intermittent-failure})

Version 3
mozilla60
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox58 disabled, firefox59 disabled, firefox60 fixed)

Details

(Whiteboard: [stockwell fixed][wptsync upstream])

Attachments

(6 attachments, 4 obsolete attachments)

59 bytes, text/x-review-board-request
ato
: review+
Details
59 bytes, text/x-review-board-request
jgraham
: review+
Details
1.39 KB, text/plain
Details
3.06 KB, text/plain
Details
59 bytes, text/x-review-board-request
jgraham
: review+
Details
59 bytes, text/x-review-board-request
jgraham
: review+
Details
Filed by: archaeopteryx [at] coole-files.de

https://treeherder.mozilla.org/logviewer.html#?job_id=135990490&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/eHR4XOXGRtmWRls2xymnfw/runs/0/artifacts/public/logs/live_backing.log

[task 2017-10-10T17:37:20.644Z] 17:37:20     INFO - TEST-START | /webdriver/tests/actions/key.py
[task 2017-10-10T17:54:00.677Z] 17:54:00     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/DcV_FpLGTCOfRXX2wSyYiA/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2017-10-10T17:54:00.714Z] 17:54:00    ERROR - timed out after 1000 seconds of no output
[task 2017-10-10T17:54:00.715Z] 17:54:00    ERROR - Return code: -15
[task 2017-10-10T17:54:00.715Z] 17:54:00    ERROR - No suite end message was emitted by this harness.
This particular failure is the earliest one, I couldn't find any other before that: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=656d53c7fd810507564844a6b0b732c116ad8efd&filter-searchStr=linux%20x64%20web-platform%20wdspec&selectedJob=135990490

:ato Could you please take a look at this bug?
Flags: needinfo?(ato)
Whiteboard: [stockwell needswork]
Depends on: 1409049
The orange factor report [1] seems to indicate that this is due to
a timeout, and because it mostly occurs on pgo and debug builds it
is probable to think this is because the timeout of the test is too
short.

I will try increasing it by marking it as a ‘long’ test in
https://bugzil.la/1409049.

  [1] https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1407383&entireHistory=true&tree=all
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #4)

> The orange factor report [1] seems to indicate that this is due to
> a timeout, and because it mostly occurs on pgo and debug builds it
> is probable to think this is because the timeout of the test is
> too short.

Scratch that.  This is an overall 1000 seconds (16.6 minutes)
timeout from mozprocess, meaning wptrunner or something else stops
responding.  I would expect wptrunner to abort the test much sooner
and return with a TEST-TIMEOUT status.
No longer depends on: 1409049
:ato Will you be able to look into this?
Flags: needinfo?(ato)
(In reply to Henrietta Maior [:henrietta_maior] from comment #7)

> :ato Will you be able to look into this?

I already replied to your needinfo request.  I believe this is
something more systemic to wptrunner or an infrastructure issue.

Cc’ing jgraham to see if he can shed some light on this, or at
least move the bug to the correct component.
Flags: needinfo?(ato)
Flags: needinfo?(james)
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/879f2ee477b4
Disable /webdriver/tests/actions/key.py on linux for frequent failures. r=me, a=testonly
I disabled this test for having >200 failures in the last 30 days, please remember to enable this when testing a fix
Keywords: leave-open
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
maja_zf: Just to make you aware, the WPT key tests are now disabled
due to some problem in wptrunner.
To be clear, the diagnosis that the problem's in wptrunner doesn't strike me as very certain. wptrunner tries to run a timer thread that will stop waiting for the pytest results if they exceed the timeout. And presumably that works at least sometimes since we get TIMEOUT results. But in this case we apparently don't get that far, which probably is a bug in wptrunner in that it's not handling some behaviour of pytest, but I expect that something bad is happening when the test runs so that the python interpreter never gets to run the main thread and so we never report an EXTERNAL-TIMEOUT for this and so kill the TestRunner process and continue.

Without someone managing to debug the problem I don't think it's clear what the underlying fault is.
Flags: needinfo?(james)
Please also see bug 1391545 for a similar output affecting Marionette tests.
I don't think this failure has anything to do with the test that got disabled (key.py). It just happened to be ordered as the first test to run. While investigating another intermittent, I'm still seeing this failure a lot despite key.py being disabled.
now these are failing on:
TEST-START | /webdriver/tests/actions/modifier_click.py

So comment 19 is spot on that we are not executing any tests, we are just hanging.
This failure has occurred 100 times in the last 7 days. It has failed only on Linux x64 and on opt,pgo,debug build types.
Failing test suites: tests-wdspec-e10s, marionette-e10s, opt-mochitest-3, platform-tests-e10s.

Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144012458&lineNumber=1229 

Part of the log: 

task 2017-11-11T19:21:33.560Z] 19:21:33     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
1228
[task 2017-11-11T19:21:33.560Z] 19:21:33     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
1229
[task 2017-11-11T19:38:13.591Z] 19:38:13     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/daJK4fMkTQC-LumF6CRNug/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
1230
[task 2017-11-11T19:38:13.625Z] 19:38:13    ERROR - timed out after 1000 seconds of no output
1231
[task 2017-11-11T19:38:13.626Z] 19:38:13    ERROR - Return code: -15
1232
[task 2017-11-11T19:38:13.626Z] 19:38:13    ERROR - No suite end message was emitted by this harness.
1233
[task 2017-11-11T19:38:13.627Z] 19:38:13    ERROR - # TBPL FAILURE #
1234
[task 2017-11-11T19:38:13.627Z] 19:38:13  WARNING - setting return code to 2
1235
[task 2017-11-11T19:38:13.628Z] 19:38:13     INFO - Running post-action listener: _package_coverage_data
1236
[task 2017-11-11T19:38:13.628Z] 19:38:13     INFO - Running post-action listener: _resource_record_post_action
1237
[task 2017-11-11T19:38:13.628Z] 19:38:13     INFO - [mozharness: 2017-11-11 19:38:13.628489Z] Finished run-tests step (success)
1238
[task 2017-11-11T19:38:13.628Z] 19:38:13     INFO - Running post-run listener: _resource_record_post_run
1239
[task 2017-11-11T19:38:13.728Z] 19:38:13     INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 1077248; Write bytes: 1633046528; Read time: 60; Write time: 1293480
1240
[task 2017-11-11T19:38:13.728Z] 19:38:13     INFO - TinderboxPrint: CPU usage<br/>0.9%
1241
[task 2017-11-11T19:38:13.728Z] 19:38:13     INFO - TinderboxPrint: I/O read bytes / time<br/>1,077,248 / 60
1242
[task 2017-11-11T19:38:13.729Z] 19:38:13     INFO - TinderboxPrint: I/O write bytes / time<br/>1,633,046,528 / 1,293,480
1243
[task 2017-11-11T19:38:13.729Z] 19:38:13     INFO - TinderboxPrint: CPU idle<br/>4,038.3 (99.1%)
1244
[task 2017-11-11T19:38:13.729Z] 19:38:13     INFO - TinderboxPrint: Swap in / out<br/>0 / 0
1245
[task 2017-11-11T19:38:13.730Z] 19:38:13     INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
1246
[task 2017-11-11T19:38:13.732Z] 19:38:13     INFO - install - Wall time: 18s; CPU: 28.0%; Read bytes: 139264; Write bytes: 33554432; Read time: 28; Write time: 110332
1247
[task 2017-11-11T19:38:13.743Z] 19:38:13     INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 937984; Write bytes: 1595269120; Read time: 32; Write time: 1182940
1248
[task 2017-11-11T19:38:14.070Z] 19:38:14     INFO - Running post-run listener: _upload_blobber_files
1249
[task 2017-11-11T19:38:14.070Z] 19:38:14  WARNING - Blob upload gear skipped. Missing cmdline options.
1250
[task 2017-11-11T19:38:14.070Z] 19:38:14     INFO - Running post-run listener: copy_logs_to_upload_dir
1251
[task 2017-11-11T19:38:14.071Z] 19:38:14     INFO - Copying logs to upload dir...
1252
[task 2017-11-11T19:38:14.071Z] 19:38:14     INFO - mkdir: /builds/worker/workspace/build/upload/logs
1253
[task 2017-11-11T19:38:14.073Z] 19:38:14     INFO - Copying logs to upload dir...
1254
[task 2017-11-11T19:38:14.075Z] 19:38:14  WARNING - returning nonzero exit status 2
1255
[task 2017-11-11T19:38:14.099Z] cleanup
1256
[task 2017-11-11T19:38:14.099Z] + cleanup
1257
[task 2017-11-11T19:38:14.099Z] + local rv=2
1258
[task 2017-11-11T19:38:14.099Z] + [[ -s /builds/worker/.xsession-errors ]]
1259
[task 2017-11-11T19:38:14.099Z] + cp /builds/worker/.xsession-errors /builds/worker/artifacts/public/xsession-errors.log
1260
[task 2017-11-11T19:38:14.101Z] + true
1261
[task 2017-11-11T19:38:14.101Z] + cleanup_xvfb
1262
[task 2017-11-11T19:38:14.102Z] pidof Xvfb
1263
[task 2017-11-11T19:38:14.102Z] ++ pidof Xvfb
1264
[task 2017-11-11T19:38:14.105Z] + local xvfb_pid=25
1265
[task 2017-11-11T19:38:14.105Z] + local vnc=false
1266
[task 2017-11-11T19:38:14.105Z] + local interactive=false
1267
[task 2017-11-11T19:38:14.106Z] + '[' -n 25 ']'
1268
[task 2017-11-11T19:38:14.106Z] + [[ false == false ]]
1269
[task 2017-11-11T19:38:14.106Z] + [[ false == false ]]
1270
[task 2017-11-11T19:38:14.106Z] + kill 25
1271
[task 2017-11-11T19:38:14.107Z] + screen -XS xvfb quit
1272
[task 2017-11-11T19:38:14.107Z] XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
1273
[task 2017-11-11T19:38:14.107Z]       after 783 requests (783 known processed) with 0 events remaining.
1274
[task 2017-11-11T19:38:14.111Z] compizconfig - Info: Backend     : ini
1275
[task 2017-11-11T19:38:14.112Z] compizconfig - Info: Integration : true
1276
[task 2017-11-11T19:38:14.112Z] compizconfig - Info: Profile     : default
1277
[task 2017-11-11T19:38:14.157Z] No screen session found.
1278
[task 2017-11-11T19:38:14.157Z] + true
1279
[task 2017-11-11T19:38:14.157Z] + exit 2
1280
[taskcluster 2017-11-11 19:38:14.457Z] === Task Finished ===
1281
[taskcluster 2017-11-11 19:38:19.671Z] Unsuccessful task run with exit code: 2 completed in 2514.307 seconds

:whimboo
Do you have any updates?
Flags: needinfo?(hskupin)
(In reply to OrangeFactor Robot from comment #24)
> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1407383&startday=2017-10-30&endday=2017-11-05&tree=all

If that is becoming a blocking issue James, Andreas, or Maja should have a look at this bug. They worked most on wptrunner. But I'm not sure whom of them have the time right now.
Flags: needinfo?(mjzffr)
Flags: needinfo?(james)
Flags: needinfo?(ato)
Something that's happening on mochitest, marionette and wpt clearly isn't a single bug in the wpt test harness. There are either multiple unrelated issues being bucketed together here, or it's in a lower layer like mozharness or marionette.
Correct, but someone would have to investigate that first. wptrunner handles the process itself, and as such it cannot be related to the process handling in Marionette. Although we have a similar issue there (bug 1391545).
Flags: needinfo?(hskupin)
Has anyone found a reliable way to reproduce this locally?
looking at this bug, Firefox is not starting up. There is 0 output from firefox or from Marionette after geckdriver starts up for wdspec tests. For mochitest we dont use geckodriver so thats not the cause there
If this is happening fairly consistent the best option is really to just request a one click loaner and observe the tests via VNC. Given that those failures are on Linux, it is very easy to handle via Treeherder/Taskcluster.
There were 144 failures related to this bug in the last 7 days on Linux.

:whimboo any updates for this, it has spiked pretty high lately?
Whiteboard: [stockwell disabled] → [stockwell needswork]
(In reply to Andrei Ciure[:andrei_ciure_] from comment #36)
> :whimboo any updates for this, it has spiked pretty high lately?

I'm not the only one who has to work on all the intermittent failures. That doesn't scale. As such I put ni? for James, Andreas, and Maja a while ago. I hope one of those can find the time to check specifically this bug.
James, Andreas,

Looking at the log it looks like we never get to the point where we call Firefox to startup (if I understand the flow).

# An Example of a test running

[task 2017-11-20T10:46:49.696Z] 10:46:49     INFO - Starting http server on 127.0.0.1:8000
[task 2017-11-20T10:46:49.697Z] 10:46:49     INFO - Starting http server on 127.0.0.1:8443
[task 2017-11-20T10:46:49.706Z] 10:46:49     INFO - Starting http server on 127.0.0.1:8001
[task 2017-11-20T10:46:49.717Z] 10:46:49     INFO - PID 852 | 1511174809713	geckodriver	INFO	geckodriver 0.19.1 ( '589acfe789d8'  '2017-11-20 13:59 +0800')
[task 2017-11-20T10:46:49.721Z] 10:46:49     INFO - PID 852 | 1511174809716	geckodriver	INFO	Listening on 127.0.0.1:4444
[task 2017-11-20T10:46:50.220Z] 10:46:50     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
[task 2017-11-20T10:46:50.221Z] 10:46:50     INFO - TEST-START | /webdriver/tests/actions/modifier_click.py
[task 2017-11-20T10:46:50.540Z] 10:46:50     INFO - STDOUT: ============================= test session starts ==============================
[task 2017-11-20T10:46:50.540Z] 10:46:50     INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python
[task 2017-11-20T10:46:50.541Z] 10:46:50     INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: 
[task 2017-11-20T10:46:50.541Z] 10:46:50     INFO - STDOUT: collecting ... 
[task 2017-11-20T10:46:50.558Z] 10:46:50     INFO - STDOUT: collected 9 items
[task 2017-11-20T10:46:50.558Z] 10:46:50     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/modifier_click.py::test_modifier_click[modifier0-ctrlKey] 
[task 2017-11-20T10:46:50.558Z] 10:46:50     INFO - PID 852 | 1511174810555	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi"
[task 2017-11-20T10:46:52.629Z] 10:46:52     INFO - PID 852 | [865, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /builds/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2872
[task 2017-11-20T10:47:19.603Z] 10:47:19     INFO - PID 852 | 1511174839597	Marionette	INFO	Enabled via --marionette

# An example of a failure

[task 2017-11-20T10:44:56.874Z] 10:44:56     INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms
[task 2017-11-20T10:44:56.891Z] 10:44:56     INFO - Starting runner
[task 2017-11-20T10:44:56.899Z] 10:44:56     INFO - Starting http server on 127.0.0.1:8443
[task 2017-11-20T10:44:56.900Z] 10:44:56     INFO - Starting http server on 127.0.0.1:8000
[task 2017-11-20T10:44:56.917Z] 10:44:56     INFO - PID 852 | 1511174696914	geckodriver	INFO	geckodriver 0.19.1 ( '589acfe789d8'  '2017-11-20 13:59 +0800')
[task 2017-11-20T10:44:56.921Z] 10:44:56     INFO - PID 852 | 1511174696918	geckodriver	INFO	Listening on 127.0.0.1:4444
[task 2017-11-20T10:44:57.421Z] 10:44:57     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
[task 2017-11-20T10:44:57.422Z] 10:44:57     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
[task 2017-11-20T11:01:37.471Z] 11:01:37     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/O13YgUo-SwOE8BF5SXh-kw/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2017-11-20T11:01:37.505Z] 11:01:37    ERROR - timed out after 1000 seconds of no output
[task 2017-11-20T11:01:37.506Z] 11:01:37    ERROR - Return code: -15
[task 2017-11-20T11:01:37.508Z] 11:01:37    ERROR - No suite end message was emitted by this harness.

The only obvious this for me is that Firefox doesnt start up but also looking we don't see any logging from mozrunner being emitted either from GeckoDriver.

E.g. mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi" is not shown anywhere.
(In reply to David Burns :automatedtester from comment #40)

> Looking at the log it looks like we never get to the point where
> we call Firefox to startup (if I understand the flow).

So based on the log you pasted, geckodriver starts but never accepts
a new connection.  It would be extremely useful here, I think, with
a trace-level log from geckodriver.

I wonder if doing a try build with lots of Wd retriggers and a patch
to pass --webdriver-arg='-vv' by default would show us anything
interesting.
Flags: needinfo?(ato)
Flags: needinfo?(james)
looking at the log with more verbose logging we don't get pytest starting up and collecting tests

Good run

[task 2017-11-28T12:25:29.916Z] 12:25:29     INFO - PID 847 | 1511871929913	geckodriver	INFO	geckodriver 0.19.1 ( '2fd843112adc'  '2017-11-28 12:01 +0000')
[task 2017-11-28T12:25:29.916Z] 12:25:29     INFO - Starting http server on 127.0.0.1:8443
[task 2017-11-28T12:25:29.919Z] 12:25:29     INFO - PID 847 | 1511871929913	webdriver::httpapi	DEBUG	Creating routes
[task 2017-11-28T12:25:29.926Z] 12:25:29     INFO - PID 847 | 1511871929918	geckodriver	INFO	Listening on 127.0.0.1:4444
[task 2017-11-28T12:25:30.418Z] 12:25:30     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
[task 2017-11-28T12:25:30.418Z] 12:25:30     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
[task 2017-11-28T12:25:30.762Z] 12:25:30     INFO - STDOUT: ============================= test session starts ==============================
[task 2017-11-28T12:25:30.763Z] 12:25:30     INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python
[task 2017-11-28T12:25:30.763Z] 12:25:30     INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: 
[task 2017-11-28T12:25:30.764Z] 12:25:30     INFO - STDOUT: collecting ... 
[task 2017-11-28T12:25:30.780Z] 12:25:30     INFO - STDOUT: collected 3 items
[task 2017-11-28T12:25:30.780Z] 12:25:30     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key_shortcuts.py::test_mod_a_and_backspace_deletes_all_text 
[task 2017-11-28T12:25:30.781Z] 12:25:30     INFO - PID 847 | 1511871930773	webdriver::server	DEBUG	-> POST /session {"capabilities": {"alwaysMatch": {"moz:firefoxOptions": {"binary": "/builds/worker/workspace/build/application/firefox/firefox", "prefs": {"network.dns.localDomains": "web-platform.test,www.web-platform.test,www1.web-platform.test,www2.web-platform.test,xn--n8j6ds53lwwkrqhv28a.web-platform.test,xn--lve-6lad.web-platform.test"}}}}}
[task 2017-11-28T12:25:30.784Z] 12:25:30     INFO - PID 847 | 1511871930777	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.hpPAvne6jv1N"
[task 2017-11-28T12:25:30.785Z] 12:25:30     INFO - PID 847 | 1511871930780	geckodriver::marionette	TRACE	  connection attempt 0/600
[task 2017-11-28T12:25:30.886Z] 12:25:30     INFO - PID 847 | 1511871930880	geckodriver::marionette	TRACE	  connection attempt 1/600
[task 2017-11-28T12:25:30.988Z] 12:25:30     INFO - PID 847 | 1511871930981	geckodriver::marionette	TRACE	  connection attempt 2/600
[task 2017-11-28T12:25:31.089Z] 12:25:31     INFO - PID 847 | 1511871931081	geckodriver::marionette	TRACE	  connection attempt 3/600


bad run

[task 2017-11-28T13:27:29.259Z] 13:27:29     INFO - TEST-SKIP | /webdriver/tests/element_click/select.py | took 0ms
[task 2017-11-28T13:27:29.282Z] 13:27:29     INFO - Starting http server on 127.0.0.1:8001
[task 2017-11-28T13:27:29.282Z] 13:27:29     INFO - Starting runner
[task 2017-11-28T13:27:29.285Z] 13:27:29     INFO - Starting http server on 127.0.0.1:8000
[task 2017-11-28T13:27:29.294Z] 13:27:29     INFO - Starting http server on 127.0.0.1:8443
[task 2017-11-28T13:27:29.306Z] 13:27:29     INFO - PID 849 | 1511875649302	geckodriver	INFO	geckodriver 0.19.1 ( '2fd843112adc'  '2017-11-28 12:01 +0000')
[task 2017-11-28T13:27:29.307Z] 13:27:29     INFO - PID 849 | 1511875649302	webdriver::httpapi	DEBUG	Creating routes
[task 2017-11-28T13:27:29.310Z] 13:27:29     INFO - PID 849 | 1511875649306	geckodriver	INFO	Listening on 127.0.0.1:4444
[task 2017-11-28T13:27:29.809Z] 13:27:29     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
[task 2017-11-28T13:27:29.810Z] 13:27:29     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
[task 2017-11-28T13:44:09.832Z] 13:44:09     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/UBMTC4VvSEOM_TnO_q7h0A/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2017-11-28T13:44:09.883Z] 13:44:09    ERROR - timed out after 1000 seconds of no output
[task 2017-11-28T13:44:09.883Z] 13:44:09    ERROR - Return code: -15
[task 2017-11-28T13:44:09.883Z] 13:44:09    ERROR - No suite end message was emitted by this harness.
[task 2017-11-28T13:44:09.884Z] 13:44:09    ERROR - # TBPL FAILURE #
[task 2017-11-28T13:44:09.884Z] 13:44:09  WARNING - setting return code to 2
[task 2017-11-28T13:44:09.885Z] 13:44:09     INFO - Running post-action listener: _package_coverage_data

because we dont get oytest starting we then never get a session request which then times out


Dave, can you suggest how we can delve deeper into this?
Flags: needinfo?(dave.hunt)
Maybe this is also all related to bug 1421289, which is causing various 1000s timeouts for Mochitests, and Marionette tests.
Depends on: 1421289
> Dave, can you suggest how we can delve deeper into this?

I don't think I have enough context to add value here. The output in comment 60 suggests a TEST-START is being reported before the pytest session starts, which seems odd. Could you direct me to where pytest is invoked? I would suggest that either the issue is occurring before this, or perhaps pytest is throwing an exception that's being ignored.
Flags: needinfo?(dave.hunt)
So it looks like we are hitting some external timeout if we look at https://treeherder.mozilla.org/logviewer.html#?job_id=148432942&repo=try&lineNumber=1223

More debug statements going in.
Ok, I have managed to track it down to the invocaction of Pytest and then it hangs so it looks like my initial gut of Pytest being the cause is likely. Now... to understand why...

We get to https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-62 before we hang.


See log with my statements in https://treeherder.mozilla.org/logviewer.html#?job_id=148724638&repo=try&lineNumber=1251
Not knowing the best way to get through all of this I don't really know what are the next steps.

I definitely know that _something_ in pytest is not returning. We pass in a file and other arguments, see comment above for link to code, and then wait for pytest. We then timeout, somehow, and realise its a timeout but mozprocess comes a long and kills us for taking too long.

Dave, if you could give me pointers on where to start looking in pytest, I would be grateful. We are using the vendored version in third_party/python/pytest in mozilla central.

James, should we, when hitting EXTERNAL-TIMEOUT do `result = Stop` in wptrunner as there seems to be an error we can't recover from?
Flags: needinfo?(james)
Flags: needinfo?(dave.hunt)
pytest in the following folder is kinda old:
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest

Maybe it should be upgraded to a newer release?
It looks to me like pytest isn't even starting, as we'd expect to see some output in the console. You could try the the pytest-instafail plugin (https://pypi.python.org/pypi/pytest-instafail/), which shows errors/failures immediately rather than wait until the end of the suite. Alternatively, pytest-faulthandler (https://pypi.python.org/pypi/pytest-faulthandler) might shed some light on what's happening. You could also try adding some debug in WdspecRun.run(), where the thread managing the run is created and ultimately timed out.

I would also suggest trying a later pytest release, as 3.x has been available for more than a year.
Flags: needinfo?(dave.hunt)
looks like updating pytest solves this.
Flags: needinfo?(james)
Attachment #8934146 - Attachment is obsolete: true
Attachment #8933670 - Attachment is obsolete: true
so from try runs the vendor update drops the chances of it happening but doesnt outright solve it
(In reply to David Burns :automatedtester from comment #81)

> so from try runs the vendor update drops the chances of it
> happening but doesnt outright solve it

Do you have an indication of the improvement?  Do you think there
are more bugs in pytest downstream that causes it to occasionally
freeze up?
Comment on attachment 8935728 [details]
Bug 1407383: Wait for the URL to change in test before asserting

https://reviewboard.mozilla.org/r/206624/#review212366

::: testing/web-platform/tests/webdriver/tests/navigation/current_url.py:82
(Diff revision 2)
>  
>  def test_get_current_url_after_modified_location(session):
> +    start = session.transport.send("GET", "session/%s/url" % session.session_id)
>      session.execute_script("window.location.href = 'about:blank#wd_test_modification'")
> -
> +    wait(session,
> +         lambda d: session.transport.send("GET", "session/%s/url" % session.session_id) != start.body["value"],

d is unused.
Attachment #8935728 - Flags: review?(ato) → review+
(In reply to Andreas Tolfsen ‹:ato› from comment #86)
> (In reply to David Burns :automatedtester from comment #81)
> 
> > so from try runs the vendor update drops the chances of it
> > happening but doesnt outright solve it
> 
> Do you have an indication of the improvement?  Do you think there
> are more bugs in pytest downstream that causes it to occasionally
> freeze up?

we go from ~15% intermittent rate to ~2.3%. I feel like it would be worth watching after this lands to know if its the next top prioritty issue

As for pytest, I am not sure. It could be due to the other mozprocess intermittents we have on file.
Comment on attachment 8935420 [details]
Bug 1407383 - Update  metadata for WebDriver

https://reviewboard.mozilla.org/r/206312/#review212780

I'l l apply this during the next wpt udpate.
Attachment #8935420 - Flags: review?(james) → review+
Comment on attachment 8932433 [details]
Bug 1407383 - Update Vendored packages for Web-Platform-Tests

https://reviewboard.mozilla.org/r/203478/#review212782

I think it's easier to start a wpt update rather than pre-apply this.
Attachment #8932433 - Flags: review?(james)
Attachment #8932433 - Attachment is obsolete: true
Attachment #8935420 - Attachment is obsolete: true
Pushed by dburns@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81d47aa03334
Wait for the URL to change in test before asserting r=ato
I can’t say I see any improvement from the OrangeFactor bot.
There are 71 failures in the last 7 days.
They occur on Linux32, Linux x64 (ccov, jsdcov, nightly) nd the affected builds type are debug, opt, pgo.

Log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153448184&lineNumber=5348

[task 2017-12-27T20:18:52.540Z] 20:18:52     INFO - TEST-START | toolkit/mozapps/extensions/test/xpinstall/browser_navigateaway3.js
5340
[task 2017-12-27T20:18:52.622Z] 20:18:52     INFO - GECKO(4930) | Sending InstallTriggered
5341
[task 2017-12-27T20:18:52.666Z] 20:18:52     INFO - GECKO(4930) | 1514405932656	addons.xpi	DEBUG	Download started for http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi to file /tmp/tmp-bvk.xpi
5342
[task 2017-12-27T20:18:52.666Z] 20:18:52     INFO - GECKO(4930) | 1514405932662	addons.xpi	DEBUG	Download of http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi completed.
5343
[task 2017-12-27T20:18:52.708Z] 20:18:52     INFO - GECKO(4930) | 1514405932704	addons.repository	DEBUG	cacheAddons: enabled false IDs ["unsigned-xpi@tests.mozilla.org"]
5344
[task 2017-12-27T20:18:52.733Z] 20:18:52     INFO - GECKO(4930) | Sending PageLoaded
5345
[task 2017-12-27T20:26:21.477Z] 20:26:21     INFO - GECKO(4930) | 1514406381473	addons.manager	INFO	Skipping background update check
5346
[task 2017-12-27T20:28:21.479Z] 20:28:21     INFO - GECKO(4930) | *** Blocklist::notify: Requesting http://127.0.0.1:8888/extensions-dummy/blocklistURL
5347
[task 2017-12-27T20:28:21.495Z] 20:28:21     INFO - GECKO(4930) | *** Blocklist::onXMLLoad: there was an error during load
5348
[task 2017-12-27T20:45:01.531Z] 20:45:01     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/mochitest/runtests.py', '--disable-e10s', '--total-chunks', '7', '--this-chunk', '3', '--jscov-dir-prefix=/builds/worker/workspace/build/blobber_upload_dir', '--appname=/builds/worker/workspace/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/a7vaVafyQsW-qx46KzZi2w/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--setpref=webgl.force-enabled=true', '--quiet', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_raw.log', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_errorsummary.log', '--use-test-media-devices', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/builds/worker/workspace/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime', '--timeout=1200']
5349
[task 2017-12-27T20:45:01.556Z] 20:45:01    ERROR - timed out after 1000 seconds of no output
5350
[task 2017-12-27T20:45:01.556Z] 20:45:01    ERROR - Return code: -15
5351
[task 2017-12-27T20:45:01.557Z] 20:45:01    ERROR - No suite end message was emitted by this harness.
5352
[task 2017-12-27T20:45:01.557Z] 20:45:01     INFO - TinderboxPrint: mochitest-browser-chrome-coverage<br/>213/0/0
5353
[task 2017-12-27T20:45:01.557Z] 20:45:01    ERROR - # TBPL FAILURE #
5354
[task 2017-12-27T20:45:01.558Z] 20:45:01  WARNING - setting return code to 2
5355
[task 2017-12-27T20:45:01.558Z] 20:45:01    ERROR - The mochitest suite: browser-chrome-coverage ran with return status: FAILURE
5356
[task 2017-12-27T20:45:01.559Z] 20:45:01     INFO - Running post-action listener: _package_coverage_data
5357
[task 2017-12-27T20:45:01.559Z] 20:45:01     INFO - Beginning compression of JSDCov artifacts...
5358
[task 2017-12-27T20:45:01.559Z] 20:45:01     INFO - Running command: ['zip', '-r', '/builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip', '.', '-i', 'jscov*.json'] in /builds/worker/workspace/build/blobber_upload_dir
5359
[task 2017-12-27T20:45:01.559Z] 20:45:01     INFO - Copy/paste: zip -r /builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip . -i jscov*.json
Flags: needinfo?(ato)
It doesn’t look like the pytest/py vendor upgrade made any
difference, but I don’t have any better ideas what this could be
and I don’t currently have time allocated to dig in.
Flags: needinfo?(ato)
The latest version of pytest captures logging, and includes it in the output when a test fails. It's possible that this could reveal more information when the process is appearing to hang. To enable live logging you require pytest 3.3.1 or later, and the command line options: -s and --log-cli-level with the appropriate log level. See https://docs.pytest.org/en/3.3.1/logging.html for more information.
:automatedtester, can you find someone to look into this high frequency failure that we have had for 3+ months?
Flags: needinfo?(dburns)
(In reply to Andreas Tolfsen ‹:ato› from comment #108)
> It doesn’t look like the pytest/py vendor upgrade made any
> difference, but I don’t have any better ideas what this could be
> and I don’t currently have time allocated to dig in.

the changes I have done for vendored code has not been downstreamed from the w3c. I will get James to schedule a sync
Flags: needinfo?(dburns)
I still see failures here, I assume a sync took place sometime in the last day or so, :automatedtester, can you follow up here?
Flags: needinfo?(dburns)
No, AFAIK the sync has caused a lot of failures to appear, so it might not have been done yet. At least I don't see an updated pytest: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest

James, can you supply details regarding the sync?
Flags: needinfo?(dburns) → needinfo?(james)
Component: geckodriver → web-platform-tests
:jgraham, this still shows a lot of failures- can you help prioritize this?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #120)
> :jgraham, this still shows a lot of failures- can you help prioritize this?

the wpt sync should finish today or first thing tomorrow.
Flags: needinfo?(james)
It looks like this isn't fixed by the sync: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9

I think someone needs to get a loaner, run the test in a loop until it fails and then use gdb to figure out what's going on.
(In reply to James Graham [:jgraham] from comment #122)
> It looks like this isn't fixed by the sync:
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9
> 
> I think someone needs to get a loaner, run the test in a loop until it fails
> and then use gdb to figure out what's going on.

I can't see this error on that push

My original work saw it move from >15% down to 2%(comment 88). I would like to see the numbers over the next few days and then see what our next steps are.
Please note that the affected test now seems to be key_shortcuts.py and not key.py. At least in those two failure cases on inbound.
I can reproduce a hang on TC, but not on my machine (so far) by running something like:

mach wpt --test-type wdspec  --repeat 1000

Unfortunately I can't figure out how to debug it on TC:

* gdb -p (to attach to the existing processes) seems to be disabled on taskcluster since it would allow interacting with processes outside the container
* gdb python --args <command line above> doesen't provide me with a way to look at the child processes (afaik)
* rr doesn't work on AWS

I'm pretty sure there's some kind of deadlock here. It could be something to do with the log output, but that's only a guess. Debugging this blind is hard, so any suggestions would be appreciated.
Similar to David I see this on and off on MacOS. gdb shouldn't be possible to use anymore. So any instructions in how to best debug this with lldb? Just in case I see it again.
I ran these tests all night on an OSX debug build and didn't see a hang, so if you can reproduce that's very interesting.

In after starting lldb it looks like you do something like

process attach --pid <pid>

I don't know if there's lldb extensions for python like the gdb ones that allow py-bt, but in any case backtraces from each running python processs seem interesting.
So I hit this today myself again, and I tried to debug it via lldb. But the stack looks pretty useless because it is not symbolicated. I also tried with pyrasite to connect to the hanging processes but it was also not working.

At least the last spawned process looks a bit different for the first thread's last frames:

    + 2646 connection_recv_obj  (in _multiprocessing.so) + 90  [0x10e52545a]
    +   2646 conn_recv_string  (in _multiprocessing.so) + 68  [0x10e5259c5]
    +     2646 _conn_recvall  (in _multiprocessing.so) + 49  [0x10e525a9c]
    +       2646 read  (in libsystem_kernel.dylib) + 10  

So it seems to hang in communicating with the other processes?

What I can say that it stops in between those two lines. Means when the issue occurs pytest does not print it's info line:

> 0:04.67 TEST_START: /webdriver/tests/actions/key_shortcuts.py
> ============================================================== test session starts ==============================================================
(In reply to James Graham [:jgraham] from comment #129)
> I ran these tests all night on an OSX debug build and didn't see a hang, so
> if you can reproduce that's very interesting.

Btw how did you run those? With `--repeat-until-unexpected`? If yes, that should not work because if we really hang that early, you should always execute the full command and run the test only once. And that until it hangs.
:jgraham this failure is showing up at a very high rate still, can we disable all tests that affect this on linux until this is resolved? do you have other ideas?
Flags: needinfo?(james)
Please note that there are still a lot of falsely starred entries in the list of OF.

Also this is not only related to this specific test but it happens for the very first test as run in the job. So disabling just the first test doesn't help. We would have to completely turn of wdspec tests.
(In reply to OrangeFactor Robot from comment #139)
> 23 failures in 141 pushes (0.163 failures/push) were associated with this
> bug yesterday.    
> 

so its 18 failures in 141 pushes(0.127 failures/push)
a 12%+ failure rate- should we hide this test or restrict it to a single config until this is fixed?
Flags: needinfo?(dburns)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #141)
> a 12%+ failure rate- should we hide this test or restrict it to a single
> config until this is fixed?

As mentioned in comment 138 that would turn off all wdspec tests as its not related to a test. From a quick search of Bugzilla[1] I think there might be an underlying issue in mozprocess issue or there might be a resource issue from TC as it appears, anecdotally, that we are hitting a deadlock probably due to I/O

[1] https://bugzilla.mozilla.org/buglist.cgi?quicksearch=mozprocess%20timed%20out&list_id=13974869
Flags: needinfo?(dburns)
I asked if we should restrict this to a single config until this is fixed, for example linux64/debug only.  Who is working on the mozprocess issue?  Is there a bug we can track instead of this?
Flags: needinfo?(dburns)
David, you said that the underlying issue might be in pytest. How safe is this assumption? I mean in relation to a possible issue with mozprocess, which both would be mutual exclusive.
Flags: needinfo?(james)
Sorry if I’m missing some context here, but what uses mozprocess here?
(In reply to Andreas Tolfsen ‹:ato› from comment #145)
> Sorry if I’m missing some context here, but what uses mozprocess here?

Mozprocess starts everything and makes sure that they don't just hang by killing jobs that don't output for 1000s.
Flags: needinfo?(dburns)
(In reply to Henrik Skupin (:whimboo) from comment #144)
> David, you said that the underlying issue might be in pytest. How safe is
> this assumption? I mean in relation to a possible issue with mozprocess,
> which both would be mutual exclusive.

I know that pytest is not returning. I am 100% confident in that. Why it's not returning I have no idea and I know James hasnt been able to debug either. 

James' theory, which makes the most sense, is there is an I/O deadlock somewhere. If that is the case then there might be a bigger issue which is why I searched for mozprocess intermittents today which shocked me
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #143)
> I asked if we should restrict this to a single config until this is fixed,
> for example linux64/debug only.  Who is working on the mozprocess issue?  Is
> there a bug we can track instead of this?

No one as far as I know, I thought that I would look it up today to see if there is a trend and clearly there is. I will raise one and put all the mozprocess bugs there as deps that we can then clear up as we know that its not something more sinister underlying
I have put some time into this e.g. I set up the docker image from Taskcluster locally and tried to reproduce. But so far it seems to be easy to reproduce in CI and hard to debug there, and hard to reproduce outside CI where it would be eas[y|ier] to debug. AFAIK only whimboo has reported seeing the hang outside the CI environment, and it's not 100% clear if that's the same issue.

I agree that the situation in CI is currently untenable. I don't really object to reducing the number of Wd jobs in the short term, but ultimately someone needs to dedicate some time to actually resolving the issue, and I haven't been able to do that recently.
David could you please give me some links in where I would have to add some debug print lines to actually check if I see the identical problem with pytest? I don't know where to start, and your input could shorten the time drastically.
Flags: needinfo?(dburns)
And one note... we actually use mozprocess to start the wpt tests via mozharness. So keep in mind that it has its own copy of mozprocess in place. To reproduce this output, you should use mozharness. Also this 1000s timeout means that no output has been given in that time, which underpins the comment from David about the deadlock above.

Yes, I was able to see the deadlock one and off locally, and if I know where to get started with debugging, I can jump into that.
(In reply to Henrik Skupin (:whimboo) from comment #151)
> And one note... we actually use mozprocess to start the wpt tests via
> mozharness. So keep in mind that it has its own copy of mozprocess in place.
> To reproduce this output, you should use mozharness. Also this 1000s timeout
> means that no output has been given in that time, which underpins the
> comment from David about the deadlock above.
> 
> Yes, I was able to see the deadlock one and off locally, and if I know where
> to get started with debugging, I can jump into that.

If its pytest as the cause then https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-63 is what will hang.
Flags: needinfo?(dburns)
Fun discovery from the weekend: add enough logging and this stops happening.
Using a specific test in the summary is misleading as mentioned a couple of times earlier here. So we should really get rid of it. Instead lets prefix it with the test job symbol for better starring.
Summary: Intermittent /webdriver/tests/actions/key.py Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',
Summary: Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',
There are 87 failures in the past 7 days.
Platforms: Linux opt/ debug, Linux x64 opt/ debug/ pgo, linux32-nightly opt, linux32-devedition opt, linux64-ccov opt, 	linux64-jsdcov opt, linux64-nightly opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=160524801

Relevant part of the log:
[task 2018-02-06T01:40:01.065Z] 01:40:01     INFO - SUITE-START | Running 54 tests
[task 2018-02-06T01:40:01.065Z] 01:40:01     INFO - Running wdspec tests
[task 2018-02-06T01:40:01.065Z] 01:40:01     INFO - TEST-START | /webdriver/tests/actions/key.py
[task 2018-02-06T01:40:01.065Z] 01:40:01     INFO - TEST-SKIP | /webdriver/tests/actions/key.py | took 0ms
[task 2018-02-06T01:40:01.066Z] 01:40:01     INFO - TEST-START | /webdriver/tests/contexts/maximize_window.py
[task 2018-02-06T01:40:01.066Z] 01:40:01     INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms
[task 2018-02-06T01:40:01.066Z] 01:40:01     INFO - Starting http server on web-platform.test:8001
[task 2018-02-06T01:40:01.070Z] 01:40:01     INFO - Starting runner
[task 2018-02-06T01:40:01.070Z] 01:40:01     INFO - Starting http server on web-platform.test:8443
[task 2018-02-06T01:40:01.097Z] 01:40:01     INFO - PID 840 | 1517881201093	geckodriver	INFO	geckodriver 0.19.1 ( 'e6138707451a'  '2018-02-05 20:17 -0500')
[task 2018-02-06T01:40:01.101Z] 01:40:01     INFO - PID 840 | 1517881201096	geckodriver	INFO	Listening on 127.0.0.1:4444
[task 2018-02-06T01:40:01.602Z] 01:40:01     INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/
[task 2018-02-06T01:40:01.602Z] 01:40:01     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
[task 2018-02-06T01:56:41.662Z] 01:56:41     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/XXPgFAK3QL-xbZQq71hzZA/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2018-02-06T01:56:41.663Z] 01:56:41    ERROR - timed out after 1000 seconds of no output
[task 2018-02-06T01:56:41.663Z] 01:56:41    ERROR - Return code: -15
[task 2018-02-06T01:56:41.664Z] 01:56:41    ERROR - No suite end message was emitted by this harness.
[task 2018-02-06T01:56:41.664Z] 01:56:41    ERROR - # TBPL FAILURE #
[task 2018-02-06T01:56:41.665Z] 01:56:41  WARNING - setting return code to 2
[task 2018-02-06T01:56:41.665Z] 01:56:41     INFO - Running post-action listener: _package_coverage_data
[task 2018-02-06T01:56:41.665Z] 01:56:41     INFO - Running post-action listener: _resource_record_post_action
[task 2018-02-06T01:56:41.665Z] 01:56:41     INFO - [mozharness: 2018-02-06 01:56:41.665687Z] Finished run-tests step (success)
[task 2018-02-06T01:56:41.666Z] 01:56:41     INFO - Running post-run listener: _resource_record_post_run
[task 2018-02-06T01:56:41.738Z] 01:56:41     INFO - Total resource usage - Wall time: 1021s; CPU: 1.0%; Read bytes: 442368; Write bytes: 444174336; Read time: 24; Write time: 24936
[task 2018-02-06T01:56:41.738Z] 01:56:41     INFO - TinderboxPrint: CPU usage<br/>0.6%
[task 2018-02-06T01:56:41.738Z] 01:56:41     INFO - TinderboxPrint: I/O read bytes / time<br/>442,368 / 24
[task 2018-02-06T01:56:41.739Z] 01:56:41     INFO - TinderboxPrint: I/O write bytes / time<br/>444,174,336 / 24,936
[task 2018-02-06T01:56:41.739Z] 01:56:41     INFO - TinderboxPrint: CPU idle<br/>4,038.5 (99.4%)
[task 2018-02-06T01:56:41.739Z] 01:56:41     INFO - TinderboxPrint: Swap in / out<br/>0 / 0
[task 2018-02-06T01:56:41.740Z] 01:56:41     INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
[task 2018-02-06T01:56:41.741Z] 01:56:41     INFO - install - Wall time: 17s; CPU: 25.0%; Read bytes: 0; Write bytes: 9338880; Read time: 0; Write time: 484
[task 2018-02-06T01:56:41.751Z] 01:56:41     INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 204800; Write bytes: 434835456; Read time: 16; Write time: 24452
[task 2018-02-06T01:56:42.030Z] 01:56:42     INFO - Running post-run listener: _upload_blobber_files
[task 2018-02-06T01:56:42.030Z] 01:56:42  WARNING - Blob upload gear skipped. Missing cmdline options.
[task 2018-02-06T01:56:42.030Z] 01:56:42     INFO - Running post-run listener: copy_logs_to_upload_dir
[task 2018-02-06T01:56:42.031Z] 01:56:42     INFO - Copying logs to upload dir...
[task 2018-02-06T01:56:42.031Z] 01:56:42     INFO - mkdir: /builds/worker/workspace/build/upload/logs
[task 2018-02-06T01:56:42.032Z] 01:56:42     INFO - Copying logs to upload dir...
[task 2018-02-06T01:56:42.034Z] 01:56:42  WARNING - returning nonzero exit status 2
This bug has failed 85 times in the last 7 days. It failed only on Linux affecting debug, pgo and opt build types.

Failing tests: debug-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-test-verify-wpt-e10.

Link to a recent log:https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162310358&lineNumber=1222

Part of that log:  01:30:07     INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py
[task 2018-02-15T01:46:47.542Z] 01:46:47     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/YmxWjjgmTZy5PdEjMJy1FQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2018-02-15T01:46:47.559Z] 01:46:47    ERROR - timed out after 1000 seconds of no output
[task 2018-02-15T01:46:47.560Z] 01:46:47    ERROR - Return code: -15
[task 2018-02-15T01:46:47.560Z] 01:46:47    ERROR - No suite end message was emitted by this harness.
[task 2018-02-15T01:46:47.561Z] 01:46:47    ERROR - # TBPL FAILURE #
[task 2018-02-15T01:46:47.561Z] 01:46:47  WARNING - setting return code to 2
[task 2018-02-15T01:46:47.562Z] 01:46:47     INFO - Running post-action listener: _package_coverage_data
[task 2018-02-15T01:46:47.562Z] 01:46:47     INFO - Running post-action listener: _resource_record_post_action
[task 2018-02-15T01:46:47.562Z] 01:46:47     INFO - [mozharness: 2018-02-15 01:46:47.561867Z] Finished run-tests step (success)
[task 2018-02-15T01:46:47.562Z] 01:46:47     INFO - Running post-run listener: _resource_record_post_run
[task 2018-02-15T01:46:47.646Z] 01:46:47     INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 241664; Write bytes: 524836864; Read time: 4; Write time: 93468

The WPT key tests are disabled due to some problem in wptrunner.

new file mode 100644
--- /dev/null
+++ b/testing/web-platform/meta/webdriver/tests/actions/key.py.ini
@@ -0,0 +1,4 @@
+[key.py]
+  type: wdspec
+  disabled:
+    if (os == "linux"): https://bugzilla.mozilla.org/show_bug.cgi?id=1407383

 Bug 1421289 needs to be fixed in order to make progress on the actual bug.
I really propose that we re-enable actions/key.py given that the test module itself is not the problem here. We shouldn't see an increase of failures. The test was just inappropriately disabled a long time ago.
(In reply to Henrik Skupin (:whimboo) from comment #174)
> I really propose that we re-enable actions/key.py given that the test module
> itself is not the problem here. We shouldn't see an increase of failures.
> The test was just inappropriately disabled a long time ago.

Sounds reasonable to me.
will we re-enable that in this bug or somewhere else?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #176)
> will we re-enable that in this bug or somewhere else?

You skipped the test without a patch attached to this bug, so I assume we can do the same to unskip?
Flags: needinfo?(jmaher)
can you link to the bug where we will enable it?
Flags: needinfo?(jmaher)
We should do it here. See your comment 12.
Flags: needinfo?(jmaher)
for disabling tests we often just disable the test, not add a patch; if you want to revert that patch, go ahead!  if there are other factors other than removing the .ini file, it probably is worth a review.
Flags: needinfo?(jmaher)
Ok, we cannot re-enable key.py yet due to perma timeout failures. There are no restarts of Firefox in-between each sub-test but the very first start of Firefox seems to take extra-ordinary long:

[task 2018-02-16T12:24:42.511Z] 12:24:42     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key.py::test_lone_keyup_sends_no_events 
[task 2018-02-16T12:25:09.626Z] 12:25:09     INFO - PID 843 | GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
[task 2018-02-16T12:25:10.320Z] 12:25:10     INFO - PID 843 | 1518783910314	Marionette	INFO	Listening on port 2828
[task 2018-02-16T12:25:11.003Z] 12:25:11     INFO - STDOUT: PASSED

Here exactly 28 seconds until Firefox was ready. It is actually a different problem.
This bug
- Depends on:	1421289, 1418227
- Blocks:	1429391, 1433861

I can see there is a lot work on this bug, as an update I want to add the following:

There are 81 failures in the last 7 days and 242 failures in the last 21 days.
- Linux x64: 50
- Linux: 23
- linux32-nightly: 2
- linux64-ccov: 3
- linux64-jsdcov: 1
- linux64-nightly: 1
- OS X 10.10: 1

- opt: 41
- debug: 27
- pgo: 13

Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164092702&lineNumber=2952

[task 2018-02-24T05:37:25.838Z] 05:37:25     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/Eq0ruhAHRN6GlG2-xuqcqQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil']
[task 2018-02-24T05:37:25.846Z] 05:37:25    ERROR - timed out after 1000 seconds of no output
[task 2018-02-24T05:37:25.846Z] 05:37:25    ERROR - Return code: -15
[task 2018-02-24T05:37:25.846Z] 05:37:25    ERROR - No suite end message was emitted by this harness.
[task 2018-02-24T05:37:25.846Z] 05:37:25    ERROR - # TBPL FAILURE #
[task 2018-02-24T05:37:25.846Z] 05:37:25  WARNING - setting return code to 2
Flags: needinfo?(james)
:jgraham, can we disable wdspec on linux?
I was trying to reproduce this hang with a 1-click loaner and got it fairly quickly. Sadly I cannot use GDB to get stacktraces of each of the processes because ptrace is not allowed:

> # sudo gdb --pid=3586
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
> [..]
> Attaching to process 3586
> Could not attach to process.  If your uid matches the uid of the target
> process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
> again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
> ptrace: Operation not permitted.

I also cannot turn off the restriction:

> # echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
> tee: /proc/sys/kernel/yama/ptrace_scope: Read-only file system
> 0

At least by inspecting the log I noticed that the hang happens after `Executor setup done`, which is NOT printed to the public logs as uploaded to taskcluster:

{"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520242899539, "action": "log", "message": "Executor setup done",
 "level": "DEBUG", "pid": 3586}
^C{"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520243299983, "action": "log", "message": "TestRunnerManager main loop terminating, starting cleanup", "level": "DEBUG", "pid": 3586}
Here the instructions in how to reproduce the hang on a one-click loaner:

1) Create a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e0a1ba39c48913e4e88efe4fa289c23454ff5181&selectedJob=165824714
2) Run the execution step and stop it when tests are started
3) Save this attachment as run.sh and make it executable
4) Run ./run.sh and wait until it hangs
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #193)
> :jgraham, can we disable wdspec on linux?

Redirecting this to maja_zf.
Flags: needinfo?(james) → needinfo?(mjzffr)
The hang is actually caused for `__import__(modname)` where `modname` is `tests.conftest`:

https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/third_party/py/py/_path/local.py#668
https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/conftest.py

I'm going to reduce the content of this file now to see what causes it. I suspect it is the import of pytest.
Rather than disabling the entire suite, I like the idea of setting it to Tier-2 for a couple of months until we get the most broad, infrastructure-related intermittents figured out. The individual tests, for the most part, are fine and provide useful information to developers working on the features being tested.
Flags: needinfo?(mjzffr)
Reducing conftest.py showed that the hang comes actually with `fixtures.py`:

https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/support/fixtures.py

In that file I can remove everything except the following line:

> import webdriver

So something in that package keeps us hanging. More updates follow...
I finally traced it down to StructuredLog and here specifically the `__init__` method:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozlog/mozlog/structuredlog.py#149

The hang happens in `with self._lock`. James given that you know mozlog better than me, any advice in what would be good as next steps?
Flags: needinfo?(james)
So, I would really like to see what happens if you apply the following patch to log all activity on that lock:

diff --git a/testing/mozbase/mozlog/mozlog/structuredlog.py b/testing/mozbase/mozlog/mozlog/structuredlog.py
index 74a7ef55e3f4..712560a0eeda 100644
--- a/testing/mozbase/mozlog/mozlog/structuredlog.py
+++ b/testing/mozbase/mozlog/mozlog/structuredlog.py
@@ -7,6 +7,7 @@ from __future__ import absolute_import, print_function, unicode_literals
 from multiprocessing import current_process
 from threading import current_thread, Lock
 import json
+import os
 import sys
 import time
 import traceback
@@ -69,6 +70,27 @@ Subfields for all messages:
 _default_logger_name = None
 
 
+class LoggedLock(object):
+    def __init__(self):
+        self._lock = Lock()
+
+    def acquire(self):
+        os.write(2, "Lock requested %s %s\n" % (current_process().pid, current_thread().name))
+        rv = self._lock.acquire()
+        os.write(2, "Lock acquired %s %s\n" % (current_process().pid, current_thread().name))
+        return rv
+
+    def release(self):
+        os.write(2, "Lock released by %s %s\n" % (current_process().pid, current_thread().name))
+        return self._lock.release()
+
+    def __enter__(self):
+        return self.acquire()
+
+    def __exit__(self, *args, **kwargs):
+        return self.release()
+
+
 def get_default_logger(component=None):
     """Gets the default logger if available, optionally tagged with component
     name. Will return None if not yet set
@@ -134,7 +156,7 @@ class ComponentState(object):
 
 
 class StructuredLogger(object):
-    _lock = Lock()
+    _lock = LoggedLock()
     _logger_states = {}
     """Create a structured logger with the given name
 
diff --git a/testing/web-platform/tests/tools/webdriver/webdriver/client.py b/testing/web-platform/tests/tools/webdriver/webdriver/client.py
index 622c5fce76ac..86209a366691 100644
--- a/testing/web-platform/tests/tools/webdriver/webdriver/client.py
+++ b/testing/web-platform/tests/tools/webdriver/webdriver/client.py
@@ -6,7 +6,10 @@ import transport
 
 from six import string_types
 
+import os
+os.write(2, "Before get_default_logger\n")
 from mozlog import get_default_logger
+os.write(2, "After get_default_logger\n")
 
 logger = get_default_logger()
 



That's against m-c so you may need to edit the paths to get it to apply.


I suspect what's going on is that we are randomly inheriting a locked lock into the subprocess due to the Incredible Brokenness of Python Multithreading. But I'd like to confirm that's what's going on.


Irrespective of that, we don't actually seem to use the logger anywhere? So we could delete that line and stop using mozlog in the conftest file and the problem will go away. Apologies for not spotting this earlier. In retrospect it seems like exactly the kind of thing that coculd cause exactly this kind of problem
Flags: needinfo?(james)
*Multiprocessing. Multithreading is less broken, just not fit for (many) purpose(s).
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: P5 → P1
Also I don't see why we are hanging here in trying to aquire a lock. As it looks like the same lock instance gets released right before:

Lock requested 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>
Lock acquired 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>
{"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520281029594, "action": "log", "message": "Executor setup done
", "level": "DEBUG", "pid": 14850}
Lock released by 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>
*** before self._lockLock requested 14890 Thread-2 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>
(In reply to James Graham [:jgraham] from comment #203)
> Irrespective of that, we don't actually seem to use the logger anywhere? So
> we could delete that line and stop using mozlog in the conftest file and the
> problem will go away. Apologies for not spotting this earlier. In retrospect

This doesn't help. We still hang even with the lines removed in `fixture.py`.
Comment on attachment 8956221 [details]
Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock.

https://reviewboard.mozilla.org/r/225132/#review231182

::: testing/web-platform/tests/webdriver/tests/support/fixtures.py:20
(Diff revision 3)
>  def ignore_exceptions(f):
>      def inner(*args, **kwargs):
>          try:
>              return f(*args, **kwargs)
>          except webdriver.error.WebDriverException as e:
> -            logger.warning("Ignored exception %s" % e)
> +            print >> sys.stderr, "Ignored exception %s" % e

Not sure what I should make out of this lint failure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169

Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments.
Attachment #8956222 - Flags: review?(james)
Attachment #8956221 - Flags: review?(james)
Attachment #8951549 - Flags: review?(james)
Comment on attachment 8956221 [details]
Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock.

https://reviewboard.mozilla.org/r/225132/#review231224
Attachment #8956221 - Flags: review?(james) → review+
Comment on attachment 8956222 [details]
Bug 1407383 - [wdspec] Remove unused imports from fixtures module.

https://reviewboard.mozilla.org/r/225134/#review231228
Attachment #8956222 - Flags: review?(james) → review+
Comment on attachment 8951549 [details]
Bug 1407383 - [wdspec] Re-enable key.py on Linux64.

https://reviewboard.mozilla.org/r/220862/#review231226
Attachment #8951549 - Flags: review?(james) → review+
Comment on attachment 8956221 [details]
Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock.

https://reviewboard.mozilla.org/r/225132/#review231182

> Not sure what I should make out of this lint failure:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169
> 
> Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments.

The problem was that the file needed to be added to the whitelist for the linter to allow the usage of `print` at all.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/880cc555591a
[wdspec] Remove unused imports from fixtures module. r=jgraham
https://hg.mozilla.org/integration/autoland/rev/c26cb6d150d5
[wdspec] Remove usage of mozlog to prevent process lock. r=jgraham
https://hg.mozilla.org/integration/autoland/rev/bf1d69a11d6f
[wdspec] Re-enable key.py on Linux64. r=jgraham
Created web-platform-tests PR https://github.com/w3c/web-platform-tests/pull/9872 for changes under testing/web-platform/tests
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][wptsync upstream]
Can't merge web-platform-tests PR due to failing upstream checks:
Github PR https://github.com/w3c/web-platform-tests/pull/9872
* continuous-integration/travis-ci/pr (https://travis-ci.org/w3c/web-platform-tests/builds/349768613?utm_source=github_status&utm_medium=notification)
Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell disable-recommended][wptsync upstream error]
Whiteboard: [stockwell disable-recommended][wptsync upstream error] → [stockwell disable-recommended][wptsync upstream]
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell fixed][wptsync upstream]
Target Milestone: --- → mozilla60
(In reply to Andreea Pavel [:apavel] from comment #230)
> Recent failure log:
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=167034813&repo=autoland&lineNumber=377485

This has nothing to do with this bug, and is something completely different. I will file a new bug.
Flags: needinfo?(hskupin)
Summary: Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-'] due to hang in mozlog
Upstream PR merged
See Also: → 1516293
You need to log in before you can comment on or make changes to this bug.