<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

8 years ago

Depends on: 1409049

Comment 4

•

8 years ago

The orange factor report [1] seems to indicate that this is due to a timeout, and because it mostly occurs on pgo and debug builds it is probable to think this is because the timeout of the test is too short. I will try increasing it by marking it as a ‘long’ test in https://bugzil.la/1409049. [1] https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1407383&entireHistory=true&tree=all

Flags: needinfo?(ato)

Comment 5

•

8 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #4) > The orange factor report [1] seems to indicate that this is due to > a timeout, and because it mostly occurs on pgo and debug builds it > is probable to think this is because the timeout of the test is > too short. Scratch that. This is an overall 1000 seconds (16.6 minutes) timeout from mozprocess, meaning wptrunner or something else stops responding. I would expect wptrunner to abort the test much sooner and return with a TEST-TIMEOUT status.

Henrietta Maior [:henrietta_maior]

Updated

•

8 years ago

No longer depends on: 1409049

Comment hidden (Intermittent Failures Robot)

Comment 7

•

8 years ago

:ato Will you be able to look into this?

Flags: needinfo?(ato)

Comment 8

•

8 years ago

(In reply to Henrietta Maior [:henrietta_maior] from comment #7) > :ato Will you be able to look into this? I already replied to your needinfo request. I believe this is something more systemic to wptrunner or an infrastructure issue. Cc’ing jgraham to see if he can shed some light on this, or at least move the bug to the correct component.

Flags: needinfo?(ato)

Comment hidden (Intermittent Failures Robot)

Assignee

Updated

•

8 years ago

Flags: needinfo?(james)

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 12

•

8 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/879f2ee477b4 Disable /webdriver/tests/actions/key.py on linux for frequent failures. r=me, a=testonly

Comment 13

•

8 years ago

I disabled this test for having >200 failures in the last 30 days, please remember to enable this when testing a fix

Keywords: leave-open

Whiteboard: [stockwell disable-recommended] → [stockwell disabled]

Comment 14

•

8 years ago

maja_zf: Just to make you aware, the WPT key tests are now disabled due to some problem in wptrunner.

Comment 15

•

8 years ago

To be clear, the diagnosis that the problem's in wptrunner doesn't strike me as very certain. wptrunner tries to run a timer thread that will stop waiting for the pytest results if they exceed the timeout. And presumably that works at least sometimes since we get TIMEOUT results. But in this case we apparently don't get that far, which probably is a bug in wptrunner in that it's not handling some behaviour of pytest, but I expect that something bad is happening when the test runs so that the python interpreter never gets to run the main thread and so we never report an EXTERNAL-TIMEOUT for this and so kill the TestRunner process and continue. Without someone managing to debug the problem I don't think it's clear what the underlying fault is.

Flags: needinfo?(james)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Updated

•

8 years ago

status-firefox58: --- → disabled

Keywords: test-disabled

Comment 16

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/879f2ee477b4

Maja Frydrychowicz :impossibus (was :maja_zf) (needinfo me)

Assignee

Comment 17

•

8 years ago

Please also see bug 1391545 for a similar output affecting Marionette tests.

Comment hidden (Intermittent Failures Robot)

Comment 19

•

8 years ago

I don't think this failure has anything to do with the test that got disabled (key.py). It just happened to be ordered as the first test to run. While investigating another intermittent, I'm still seeing this failure a lot despite key.py being disabled.

Maja Frydrychowicz :impossibus (was :maja_zf) (needinfo me)

Comment 20

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d545d2c66879a61a6d6fc1008604772e257375d8

Comment hidden (Intermittent Failures Robot)

Arthur Iakab [arthur_iakab]

Comment 22

•

8 years ago

now these are failing on: TEST-START | /webdriver/tests/actions/modifier_click.py So comment 19 is spot on that we are not executing any tests, we are just hanging.

Comment hidden (Intermittent Failures Robot)

Comment 26

•

8 years ago

This failure has occurred 100 times in the last 7 days. It has failed only on Linux x64 and on opt,pgo,debug build types. Failing test suites: tests-wdspec-e10s, marionette-e10s, opt-mochitest-3, platform-tests-e10s. Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144012458&lineNumber=1229 Part of the log: task 2017-11-11T19:21:33.560Z] 19:21:33 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ 1228 [task 2017-11-11T19:21:33.560Z] 19:21:33 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py 1229 [task 2017-11-11T19:38:13.591Z] 19:38:13 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/daJK4fMkTQC-LumF6CRNug/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] 1230 [task 2017-11-11T19:38:13.625Z] 19:38:13 ERROR - timed out after 1000 seconds of no output 1231 [task 2017-11-11T19:38:13.626Z] 19:38:13 ERROR - Return code: -15 1232 [task 2017-11-11T19:38:13.626Z] 19:38:13 ERROR - No suite end message was emitted by this harness. 1233 [task 2017-11-11T19:38:13.627Z] 19:38:13 ERROR - # TBPL FAILURE # 1234 [task 2017-11-11T19:38:13.627Z] 19:38:13 WARNING - setting return code to 2 1235 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-action listener: _package_coverage_data 1236 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-action listener: _resource_record_post_action 1237 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - [mozharness: 2017-11-11 19:38:13.628489Z] Finished run-tests step (success) 1238 [task 2017-11-11T19:38:13.628Z] 19:38:13 INFO - Running post-run listener: _resource_record_post_run 1239 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 1077248; Write bytes: 1633046528; Read time: 60; Write time: 1293480 1240 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - TinderboxPrint: CPU usage 0.9% 1241 [task 2017-11-11T19:38:13.728Z] 19:38:13 INFO - TinderboxPrint: I/O read bytes / time 1,077,248 / 60 1242 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: I/O write bytes / time 1,633,046,528 / 1,293,480 1243 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: CPU idle 4,038.3 (99.1%) 1244 [task 2017-11-11T19:38:13.729Z] 19:38:13 INFO - TinderboxPrint: Swap in / out 0 / 0 1245 [task 2017-11-11T19:38:13.730Z] 19:38:13 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 1246 [task 2017-11-11T19:38:13.732Z] 19:38:13 INFO - install - Wall time: 18s; CPU: 28.0%; Read bytes: 139264; Write bytes: 33554432; Read time: 28; Write time: 110332 1247 [task 2017-11-11T19:38:13.743Z] 19:38:13 INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 937984; Write bytes: 1595269120; Read time: 32; Write time: 1182940 1248 [task 2017-11-11T19:38:14.070Z] 19:38:14 INFO - Running post-run listener: _upload_blobber_files 1249 [task 2017-11-11T19:38:14.070Z] 19:38:14 WARNING - Blob upload gear skipped. Missing cmdline options. 1250 [task 2017-11-11T19:38:14.070Z] 19:38:14 INFO - Running post-run listener: copy_logs_to_upload_dir 1251 [task 2017-11-11T19:38:14.071Z] 19:38:14 INFO - Copying logs to upload dir... 1252 [task 2017-11-11T19:38:14.071Z] 19:38:14 INFO - mkdir: /builds/worker/workspace/build/upload/logs 1253 [task 2017-11-11T19:38:14.073Z] 19:38:14 INFO - Copying logs to upload dir... 1254 [task 2017-11-11T19:38:14.075Z] 19:38:14 WARNING - returning nonzero exit status 2 1255 [task 2017-11-11T19:38:14.099Z] cleanup 1256 [task 2017-11-11T19:38:14.099Z] + cleanup 1257 [task 2017-11-11T19:38:14.099Z] + local rv=2 1258 [task 2017-11-11T19:38:14.099Z] + [[ -s /builds/worker/.xsession-errors ]] 1259 [task 2017-11-11T19:38:14.099Z] + cp /builds/worker/.xsession-errors /builds/worker/artifacts/public/xsession-errors.log 1260 [task 2017-11-11T19:38:14.101Z] + true 1261 [task 2017-11-11T19:38:14.101Z] + cleanup_xvfb 1262 [task 2017-11-11T19:38:14.102Z] pidof Xvfb 1263 [task 2017-11-11T19:38:14.102Z] ++ pidof Xvfb 1264 [task 2017-11-11T19:38:14.105Z] + local xvfb_pid=25 1265 [task 2017-11-11T19:38:14.105Z] + local vnc=false 1266 [task 2017-11-11T19:38:14.105Z] + local interactive=false 1267 [task 2017-11-11T19:38:14.106Z] + '[' -n 25 ']' 1268 [task 2017-11-11T19:38:14.106Z] + [[ false == false ]] 1269 [task 2017-11-11T19:38:14.106Z] + [[ false == false ]] 1270 [task 2017-11-11T19:38:14.106Z] + kill 25 1271 [task 2017-11-11T19:38:14.107Z] + screen -XS xvfb quit 1272 [task 2017-11-11T19:38:14.107Z] XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0" 1273 [task 2017-11-11T19:38:14.107Z] after 783 requests (783 known processed) with 0 events remaining. 1274 [task 2017-11-11T19:38:14.111Z] compizconfig - Info: Backend : ini 1275 [task 2017-11-11T19:38:14.112Z] compizconfig - Info: Integration : true 1276 [task 2017-11-11T19:38:14.112Z] compizconfig - Info: Profile : default 1277 [task 2017-11-11T19:38:14.157Z] No screen session found. 1278 [task 2017-11-11T19:38:14.157Z] + true 1279 [task 2017-11-11T19:38:14.157Z] + exit 2 1280 [taskcluster 2017-11-11 19:38:14.457Z] === Task Finished === 1281 [taskcluster 2017-11-11 19:38:19.671Z] Unsuccessful task run with exit code: 2 completed in 2514.307 seconds :whimboo Do you have any updates?

Flags: needinfo?(hskupin)

Assignee

Comment 27

•

8 years ago

(In reply to OrangeFactor Robot from comment #24) > For more details, see: > https://brasstacks.mozilla.com/orangefactor/ > ?display=Bug&bugid=1407383&startday=2017-10-30&endday=2017-11-05&tree=all If that is becoming a blocking issue James, Andreas, or Maja should have a look at this bug. They worked most on wptrunner. But I'm not sure whom of them have the time right now.

Flags: needinfo?(mjzffr)

Flags: needinfo?(james)

Flags: needinfo?(ato)

Comment 28

•

8 years ago

Something that's happening on mochitest, marionette and wpt clearly isn't a single bug in the wpt test harness. There are either multiple unrelated issues being bucketed together here, or it's in a lower layer like mozharness or marionette.

Assignee

Comment 29

•

8 years ago

Correct, but someone would have to investigate that first. wptrunner handles the process itself, and as such it cannot be related to the process handling in Marionette. Although we have a similar issue there (bug 1391545).

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

Comment 31

•

8 years ago

Has anyone found a reliable way to reproduce this locally?

Comment 32

•

8 years ago

looking at this bug, Firefox is not starting up. There is 0 output from firefox or from Marionette after geckdriver starts up for wdspec tests. For mochitest we dont use geckodriver so thats not the cause there

Assignee

Comment 33

•

8 years ago

If this is happening fairly consistent the best option is really to just request a one click loaner and observe the tests via VNC. Given that those failures are on Linux, it is very easy to handle via Treeherder/Taskcluster.

Assignee

Updated

•

8 years ago

Depends on: 1418227

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 36

•

8 years ago

There were 144 failures related to this bug in the last 7 days on Linux. :whimboo any updates for this, it has spiked pretty high lately?

Andrei Ciure[:aciure]

Updated

•

8 years ago

Whiteboard: [stockwell disabled] → [stockwell needswork]

Maja Frydrychowicz :impossibus (was :maja_zf) (needinfo me)

Assignee

Comment 37

•

8 years ago

(In reply to Andrei Ciure[:andrei_ciure_] from comment #36) > :whimboo any updates for this, it has spiked pretty high lately? I'm not the only one who has to work on all the intermittent failures. That doesn't scale. As such I put ni? for James, Andreas, and Maja a while ago. I hope one of those can find the time to check specifically this bug.

Comment hidden (Intermittent Failures Robot)

Updated

•

8 years ago

Flags: needinfo?(mjzffr)

Comment 40

•

8 years ago

James, Andreas, Looking at the log it looks like we never get to the point where we call Firefox to startup (if I understand the flow). # An Example of a test running [task 2017-11-20T10:46:49.696Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-20T10:46:49.697Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-20T10:46:49.706Z] 10:46:49 INFO - Starting http server on 127.0.0.1:8001 [task 2017-11-20T10:46:49.717Z] 10:46:49 INFO - PID 852 | 1511174809713 geckodriver INFO geckodriver 0.19.1 ( '589acfe789d8' '2017-11-20 13:59 +0800') [task 2017-11-20T10:46:49.721Z] 10:46:49 INFO - PID 852 | 1511174809716 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-20T10:46:50.220Z] 10:46:50 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-20T10:46:50.221Z] 10:46:50 INFO - TEST-START | /webdriver/tests/actions/modifier_click.py [task 2017-11-20T10:46:50.540Z] 10:46:50 INFO - STDOUT: ============================= test session starts ============================== [task 2017-11-20T10:46:50.540Z] 10:46:50 INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python [task 2017-11-20T10:46:50.541Z] 10:46:50 INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: [task 2017-11-20T10:46:50.541Z] 10:46:50 INFO - STDOUT: collecting ... [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - STDOUT: collected 9 items [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/modifier_click.py::test_modifier_click[modifier0-ctrlKey] [task 2017-11-20T10:46:50.558Z] 10:46:50 INFO - PID 852 | 1511174810555 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi" [task 2017-11-20T10:46:52.629Z] 10:46:52 INFO - PID 852 | [865, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /builds/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2872 [task 2017-11-20T10:47:19.603Z] 10:47:19 INFO - PID 852 | 1511174839597 Marionette INFO Enabled via --marionette # An example of a failure [task 2017-11-20T10:44:56.874Z] 10:44:56 INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms [task 2017-11-20T10:44:56.891Z] 10:44:56 INFO - Starting runner [task 2017-11-20T10:44:56.899Z] 10:44:56 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-20T10:44:56.900Z] 10:44:56 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-20T10:44:56.917Z] 10:44:56 INFO - PID 852 | 1511174696914 geckodriver INFO geckodriver 0.19.1 ( '589acfe789d8' '2017-11-20 13:59 +0800') [task 2017-11-20T10:44:56.921Z] 10:44:56 INFO - PID 852 | 1511174696918 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-20T10:44:57.421Z] 10:44:57 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-20T10:44:57.422Z] 10:44:57 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-20T11:01:37.471Z] 11:01:37 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/O13YgUo-SwOE8BF5SXh-kw/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2017-11-20T11:01:37.505Z] 11:01:37 ERROR - timed out after 1000 seconds of no output [task 2017-11-20T11:01:37.506Z] 11:01:37 ERROR - Return code: -15 [task 2017-11-20T11:01:37.508Z] 11:01:37 ERROR - No suite end message was emitted by this harness. The only obvious this for me is that Firefox doesnt start up but also looking we don't see any logging from mozrunner being emitted either from GeckoDriver. E.g. mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.IRSxkIla0Kpi" is not shown anywhere.

Comment 41

•

8 years ago

(In reply to David Burns :automatedtester from comment #40) > Looking at the log it looks like we never get to the point where > we call Firefox to startup (if I understand the flow). So based on the log you pasted, geckodriver starts but never accepts a new connection. It would be extremely useful here, I think, with a trace-level log from geckodriver. I wonder if doing a try build with lots of Wd retriggers and a patch to pass --webdriver-arg='-vv' by default would show us anything interesting.

Flags: needinfo?(ato)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment 44

•

8 years ago

try run with verbose logging https://treeherder.mozilla.org/#/jobs?repo=try&revision=2fd843112adc&selectedJob=148141221

Updated

•

8 years ago

Flags: needinfo?(james)

Comment 45

•

8 years ago

looking at the log with more verbose logging we don't get pytest starting up and collecting tests Good run [task 2017-11-28T12:25:29.916Z] 12:25:29 INFO - PID 847 | 1511871929913 geckodriver INFO geckodriver 0.19.1 ( '2fd843112adc' '2017-11-28 12:01 +0000') [task 2017-11-28T12:25:29.916Z] 12:25:29 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-28T12:25:29.919Z] 12:25:29 INFO - PID 847 | 1511871929913 webdriver::httpapi DEBUG Creating routes [task 2017-11-28T12:25:29.926Z] 12:25:29 INFO - PID 847 | 1511871929918 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-28T12:25:30.418Z] 12:25:30 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-28T12:25:30.418Z] 12:25:30 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-28T12:25:30.762Z] 12:25:30 INFO - STDOUT: ============================= test session starts ============================== [task 2017-11-28T12:25:30.763Z] 12:25:30 INFO - STDOUT: platform linux2 -- Python 2.7.12, pytest-2.9.1, py-1.4.31, pluggy-0.3.1 -- /builds/worker/workspace/build/venv/bin/python [task 2017-11-28T12:25:30.763Z] 12:25:30 INFO - STDOUT: rootdir: /builds/worker/workspace/build/tests/web-platform/tests/webdriver/tests/actions, inifile: [task 2017-11-28T12:25:30.764Z] 12:25:30 INFO - STDOUT: collecting ... [task 2017-11-28T12:25:30.780Z] 12:25:30 INFO - STDOUT: collected 3 items [task 2017-11-28T12:25:30.780Z] 12:25:30 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key_shortcuts.py::test_mod_a_and_backspace_deletes_all_text [task 2017-11-28T12:25:30.781Z] 12:25:30 INFO - PID 847 | 1511871930773 webdriver::server DEBUG -> POST /session {"capabilities": {"alwaysMatch": {"moz:firefoxOptions": {"binary": "/builds/worker/workspace/build/application/firefox/firefox", "prefs": {"network.dns.localDomains": "web-platform.test,www.web-platform.test,www1.web-platform.test,www2.web-platform.test,xn--n8j6ds53lwwkrqhv28a.web-platform.test,xn--lve-6lad.web-platform.test"}}}}} [task 2017-11-28T12:25:30.784Z] 12:25:30 INFO - PID 847 | 1511871930777 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.hpPAvne6jv1N" [task 2017-11-28T12:25:30.785Z] 12:25:30 INFO - PID 847 | 1511871930780 geckodriver::marionette TRACE connection attempt 0/600 [task 2017-11-28T12:25:30.886Z] 12:25:30 INFO - PID 847 | 1511871930880 geckodriver::marionette TRACE connection attempt 1/600 [task 2017-11-28T12:25:30.988Z] 12:25:30 INFO - PID 847 | 1511871930981 geckodriver::marionette TRACE connection attempt 2/600 [task 2017-11-28T12:25:31.089Z] 12:25:31 INFO - PID 847 | 1511871931081 geckodriver::marionette TRACE connection attempt 3/600 bad run [task 2017-11-28T13:27:29.259Z] 13:27:29 INFO - TEST-SKIP | /webdriver/tests/element_click/select.py | took 0ms [task 2017-11-28T13:27:29.282Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8001 [task 2017-11-28T13:27:29.282Z] 13:27:29 INFO - Starting runner [task 2017-11-28T13:27:29.285Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8000 [task 2017-11-28T13:27:29.294Z] 13:27:29 INFO - Starting http server on 127.0.0.1:8443 [task 2017-11-28T13:27:29.306Z] 13:27:29 INFO - PID 849 | 1511875649302 geckodriver INFO geckodriver 0.19.1 ( '2fd843112adc' '2017-11-28 12:01 +0000') [task 2017-11-28T13:27:29.307Z] 13:27:29 INFO - PID 849 | 1511875649302 webdriver::httpapi DEBUG Creating routes [task 2017-11-28T13:27:29.310Z] 13:27:29 INFO - PID 849 | 1511875649306 geckodriver INFO Listening on 127.0.0.1:4444 [task 2017-11-28T13:27:29.809Z] 13:27:29 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2017-11-28T13:27:29.810Z] 13:27:29 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2017-11-28T13:44:09.832Z] 13:44:09 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/UBMTC4VvSEOM_TnO_q7h0A/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - timed out after 1000 seconds of no output [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - Return code: -15 [task 2017-11-28T13:44:09.883Z] 13:44:09 ERROR - No suite end message was emitted by this harness. [task 2017-11-28T13:44:09.884Z] 13:44:09 ERROR - # TBPL FAILURE # [task 2017-11-28T13:44:09.884Z] 13:44:09 WARNING - setting return code to 2 [task 2017-11-28T13:44:09.885Z] 13:44:09 INFO - Running post-action listener: _package_coverage_data because we dont get oytest starting we then never get a session request which then times out Dave, can you suggest how we can delve deeper into this?

Flags: needinfo?(dave.hunt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Assignee

Comment 46

•

8 years ago

Maybe this is also all related to bug 1421289, which is causing various 1000s timeouts for Mochitests, and Marionette tests.

Depends on: 1421289

Comment 47

•

8 years ago

> Dave, can you suggest how we can delve deeper into this? I don't think I have enough context to add value here. The output in comment 60 suggests a TEST-START is being reported before the pytest session starts, which seems odd. Could you direct me to where pytest is invoked? I would suggest that either the issue is occurring before this, or perhaps pytest is throwing an exception that's being ignored.

Flags: needinfo?(dave.hunt)

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment 52

•

8 years ago

So it looks like we are hitting some external timeout if we look at https://treeherder.mozilla.org/logviewer.html#?job_id=148432942&repo=try&lineNumber=1223 More debug statements going in.

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment 55

•

8 years ago

Ok, I have managed to track it down to the invocaction of Pytest and then it hangs so it looks like my initial gut of Pytest being the cause is likely. Now... to understand why... We get to https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-62 before we hang. See log with my statements in https://treeherder.mozilla.org/logviewer.html#?job_id=148724638&repo=try&lineNumber=1251

Comment 56

•

8 years ago

Not knowing the best way to get through all of this I don't really know what are the next steps. I definitely know that _something_ in pytest is not returning. We pass in a file and other arguments, see comment above for link to code, and then wait for pytest. We then timeout, somehow, and realise its a timeout but mozprocess comes a long and kills us for taking too long. Dave, if you could give me pointers on where to start looking in pytest, I would be grateful. We are using the vendored version in third_party/python/pytest in mozilla central. James, should we, when hitting EXTERNAL-TIMEOUT do `result = Stop` in wptrunner as there seems to be an error we can't recover from?

Flags: needinfo?(james)

Flags: needinfo?(dave.hunt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Assignee

Comment 57

•

8 years ago

pytest in the following folder is kinda old: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest Maybe it should be upgraded to a newer release?

Comment 58

•

8 years ago

It looks to me like pytest isn't even starting, as we'd expect to see some output in the console. You could try the the pytest-instafail plugin (https://pypi.python.org/pypi/pytest-instafail/), which shows errors/failures immediately rather than wait until the end of the suite. Alternatively, pytest-faulthandler (https://pypi.python.org/pypi/pytest-faulthandler) might shed some light on what's happening. You could also try adding some debug in WdspecRun.run(), where the thread managing the run is created and ultimately timed out. I would also suggest trying a later pytest release, as 3.x has been available for more than a year.

Flags: needinfo?(dave.hunt)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment 71

•

8 years ago

looks like updating pytest solves this.

Flags: needinfo?(james)

Comment hidden (mozreview-request)

Updated

•

8 years ago

Attachment #8934146 - Attachment is obsolete: true

Updated

•

8 years ago

Attachment #8933670 - Attachment is obsolete: true

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Comment 81

•

8 years ago

so from try runs the vendor update drops the chances of it happening but doesnt outright solve it

Comment hidden (mozreview-request)

Comment 86

•

8 years ago

(In reply to David Burns :automatedtester from comment #81) > so from try runs the vendor update drops the chances of it > happening but doesnt outright solve it Do you have an indication of the improvement? Do you think there are more bugs in pytest downstream that causes it to occasionally freeze up?

Comment 87

•

8 years ago

mozreview-review

Comment on attachment 8935728 [details] Bug 1407383: Wait for the URL to change in test before asserting https://reviewboard.mozilla.org/r/206624/#review212366 ::: testing/web-platform/tests/webdriver/tests/navigation/current_url.py:82 (Diff revision 2) > > def test_get_current_url_after_modified_location(session): > + start = session.transport.send("GET", "session/%s/url" % session.session_id) > session.execute_script("window.location.href = 'about:blank#wd_test_modification'") > - > + wait(session, > + lambda d: session.transport.send("GET", "session/%s/url" % session.session_id) != start.body["value"], d is unused.

Attachment #8935728 - Flags: review?(ato) → review+

Comment 88

•

8 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #86) > (In reply to David Burns :automatedtester from comment #81) > > > so from try runs the vendor update drops the chances of it > > happening but doesnt outright solve it > > Do you have an indication of the improvement? Do you think there > are more bugs in pytest downstream that causes it to occasionally > freeze up? we go from ~15% intermittent rate to ~2.3%. I feel like it would be worth watching after this lands to know if its the next top prioritty issue As for pytest, I am not sure. It could be due to the other mozprocess intermittents we have on file.

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment 92

•

8 years ago

mozreview-review

Comment on attachment 8935420 [details] Bug 1407383 - Update metadata for WebDriver https://reviewboard.mozilla.org/r/206312/#review212780 I'l l apply this during the next wpt udpate.

Attachment #8935420 - Flags: review?(james) → review+

Comment 93

•

8 years ago

mozreview-review

Comment on attachment 8932433 [details] Bug 1407383 - Update Vendored packages for Web-Platform-Tests https://reviewboard.mozilla.org/r/203478/#review212782 I think it's easier to start a wpt update rather than pre-apply this.

Attachment #8932433 - Flags: review?(james)

Comment hidden (mozreview-request)

Updated

•

8 years ago

Attachment #8932433 - Attachment is obsolete: true

https://hg.mozilla.org/mozilla-central/rev/81d47aa03334

Updated

•

8 years ago

Attachment #8935420 - Attachment is obsolete: true

Pulsebot

Comment 95

•

8 years ago

Pushed by dburns@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/81d47aa03334 Wait for the URL to change in test before asserting r=ato

Dorel Luca [:dluca]

Comment 96

•

8 years ago

bugherder

Comment hidden (Intermittent Failures Robot)

Comment 101

•

8 years ago

I can’t say I see any improvement from the OrangeFactor bot.

Comment hidden (Intermittent Failures Robot)

Eliza Balazs [:ebalazs_]

Comment 106

•

7 years ago

There are 71 failures in the last 7 days. They occur on Linux32, Linux x64 (ccov, jsdcov, nightly) nd the affected builds type are debug, opt, pgo. Log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153448184&lineNumber=5348 [task 2017-12-27T20:18:52.540Z] 20:18:52 INFO - TEST-START | toolkit/mozapps/extensions/test/xpinstall/browser_navigateaway3.js 5340 [task 2017-12-27T20:18:52.622Z] 20:18:52 INFO - GECKO(4930) | Sending InstallTriggered 5341 [task 2017-12-27T20:18:52.666Z] 20:18:52 INFO - GECKO(4930) | 1514405932656 addons.xpi DEBUG Download started for http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi to file /tmp/tmp-bvk.xpi 5342 [task 2017-12-27T20:18:52.666Z] 20:18:52 INFO - GECKO(4930) | 1514405932662 addons.xpi DEBUG Download of http://example.com/browser/toolkit/mozapps/extensions/test/xpinstall/amosigned.xpi completed. 5343 [task 2017-12-27T20:18:52.708Z] 20:18:52 INFO - GECKO(4930) | 1514405932704 addons.repository DEBUG cacheAddons: enabled false IDs ["unsigned-xpi@tests.mozilla.org"] 5344 [task 2017-12-27T20:18:52.733Z] 20:18:52 INFO - GECKO(4930) | Sending PageLoaded 5345 [task 2017-12-27T20:26:21.477Z] 20:26:21 INFO - GECKO(4930) | 1514406381473 addons.manager INFO Skipping background update check 5346 [task 2017-12-27T20:28:21.479Z] 20:28:21 INFO - GECKO(4930) | *** Blocklist::notify: Requesting http://127.0.0.1:8888/extensions-dummy/blocklistURL 5347 [task 2017-12-27T20:28:21.495Z] 20:28:21 INFO - GECKO(4930) | *** Blocklist::onXMLLoad: there was an error during load 5348 [task 2017-12-27T20:45:01.531Z] 20:45:01 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/mochitest/runtests.py', '--disable-e10s', '--total-chunks', '7', '--this-chunk', '3', '--jscov-dir-prefix=/builds/worker/workspace/build/blobber_upload_dir', '--appname=/builds/worker/workspace/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/a7vaVafyQsW-qx46KzZi2w/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--setpref=webgl.force-enabled=true', '--quiet', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_raw.log', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/browser-chrome-coverage_errorsummary.log', '--use-test-media-devices', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/builds/worker/workspace/build', '--log-raw=-', '--flavor=browser', '--chunk-by-runtime', '--timeout=1200'] 5349 [task 2017-12-27T20:45:01.556Z] 20:45:01 ERROR - timed out after 1000 seconds of no output 5350 [task 2017-12-27T20:45:01.556Z] 20:45:01 ERROR - Return code: -15 5351 [task 2017-12-27T20:45:01.557Z] 20:45:01 ERROR - No suite end message was emitted by this harness. 5352 [task 2017-12-27T20:45:01.557Z] 20:45:01 INFO - TinderboxPrint: mochitest-browser-chrome-coverage 213/0/0 5353 [task 2017-12-27T20:45:01.557Z] 20:45:01 ERROR - # TBPL FAILURE # 5354 [task 2017-12-27T20:45:01.558Z] 20:45:01 WARNING - setting return code to 2 5355 [task 2017-12-27T20:45:01.558Z] 20:45:01 ERROR - The mochitest suite: browser-chrome-coverage ran with return status: FAILURE 5356 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Running post-action listener: _package_coverage_data 5357 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Beginning compression of JSDCov artifacts... 5358 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Running command: ['zip', '-r', '/builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip', '.', '-i', 'jscov*.json'] in /builds/worker/workspace/build/blobber_upload_dir 5359 [task 2017-12-27T20:45:01.559Z] 20:45:01 INFO - Copy/paste: zip -r /builds/worker/workspace/build/blobber_upload_dir/jsdcov_artifacts.zip . -i jscov*.json

Flags: needinfo?(ato)

Comment hidden (Intermittent Failures Robot)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 108

•

7 years ago

It doesn’t look like the pytest/py vendor upgrade made any difference, but I don’t have any better ideas what this could be and I don’t currently have time allocated to dig in.

Flags: needinfo?(ato)

Comment 109

•

7 years ago

The latest version of pytest captures logging, and includes it in the output when a test fails. It's possible that this could reveal more information when the process is appearing to hang. To enable live logging you require pytest 3.3.1 or later, and the command line options: -s and --log-cli-level with the appropriate log level. See https://docs.pytest.org/en/3.3.1/logging.html for more information.

Comment 110

•

7 years ago

:automatedtester, can you find someone to look into this high frequency failure that we have had for 3+ months?

Flags: needinfo?(dburns)

Comment 111

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #108) > It doesn’t look like the pytest/py vendor upgrade made any > difference, but I don’t have any better ideas what this could be > and I don’t currently have time allocated to dig in. the changes I have done for vendored code has not been downstreamed from the w3c. I will get James to schedule a sync

Flags: needinfo?(dburns)

Comment hidden (Intermittent Failures Robot)

Comment 114

•

7 years ago

I still see failures here, I assume a sync took place sometime in the last day or so, :automatedtester, can you follow up here?

Flags: needinfo?(dburns)

Assignee

Comment 115

•

7 years ago

No, AFAIK the sync has caused a lot of failures to appear, so it might not have been done yet. At least I don't see an updated pytest: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/pytest James, can you supply details regarding the sync?

Flags: needinfo?(dburns) → needinfo?(james)

Comment hidden (Intermittent Failures Robot)

Updated

•

7 years ago

Component: geckodriver → web-platform-tests

Comment hidden (Intermittent Failures Robot)

Comment 120

•

7 years ago

:jgraham, this still shows a lot of failures- can you help prioritize this?

Comment 121

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #120) > :jgraham, this still shows a lot of failures- can you help prioritize this? the wpt sync should finish today or first thing tomorrow.

Flags: needinfo?(james)

Comment 122

•

7 years ago

It looks like this isn't fixed by the sync: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9 I think someone needs to get a loaner, run the test in a loop until it fails and then use gdb to figure out what's going on.

Comment 123

•

7 years ago

(In reply to James Graham [:jgraham] from comment #122) > It looks like this isn't fixed by the sync: > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > inbound&revision=83bfdce958e7de595edd901f1c60686127ca22e9 > > I think someone needs to get a loaner, run the test in a loop until it fails > and then use gdb to figure out what's going on. I can't see this error on that push My original work saw it move from >15% down to 2%(comment 88). I would like to see the numbers over the next few days and then see what our next steps are.

Assignee

Comment 124

•

7 years ago

Please note that the affected test now seems to be key_shortcuts.py and not key.py. At least in those two failure cases on inbound.

Comment hidden (Intermittent Failures Robot)

Comment 126

•

7 years ago

I can reproduce a hang on TC, but not on my machine (so far) by running something like: mach wpt --test-type wdspec --repeat 1000 Unfortunately I can't figure out how to debug it on TC: * gdb -p (to attach to the existing processes) seems to be disabled on taskcluster since it would allow interacting with processes outside the container * gdb python --args <command line above> doesen't provide me with a way to look at the child processes (afaik) * rr doesn't work on AWS I'm pretty sure there's some kind of deadlock here. It could be something to do with the log output, but that's only a guess. Debugging this blind is hard, so any suggestions would be appreciated.

Assignee

Comment 127

•

7 years ago

Similar to David I see this on and off on MacOS. gdb shouldn't be possible to use anymore. So any instructions in how to best debug this with lldb? Just in case I see it again.

Comment hidden (Intermittent Failures Robot)

Comment 129

•

7 years ago

I ran these tests all night on an OSX debug build and didn't see a hang, so if you can reproduce that's very interesting. In after starting lldb it looks like you do something like process attach --pid <pid> I don't know if there's lldb extensions for python like the gdb ones that allow py-bt, but in any case backtraces from each running python processs seem interesting.

Assignee

Comment 130

•

7 years ago

So I hit this today myself again, and I tried to debug it via lldb. But the stack looks pretty useless because it is not symbolicated. I also tried with pyrasite to connect to the hanging processes but it was also not working. At least the last spawned process looks a bit different for the first thread's last frames: + 2646 connection_recv_obj (in _multiprocessing.so) + 90 [0x10e52545a] + 2646 conn_recv_string (in _multiprocessing.so) + 68 [0x10e5259c5] + 2646 _conn_recvall (in _multiprocessing.so) + 49 [0x10e525a9c] + 2646 read (in libsystem_kernel.dylib) + 10 So it seems to hang in communicating with the other processes? What I can say that it stops in between those two lines. Means when the issue occurs pytest does not print it's info line: > 0:04.67 TEST_START: /webdriver/tests/actions/key_shortcuts.py > ============================================================== test session starts ==============================================================

Assignee

Comment 131

•

7 years ago

(In reply to James Graham [:jgraham] from comment #129) > I ran these tests all night on an OSX debug build and didn't see a hang, so > if you can reproduce that's very interesting. Btw how did you run those? With `--repeat-until-unexpected`? If yes, that should not work because if we really hang that early, you should always execute the full command and run the test only once. And that until it hangs.

Comment hidden (Intermittent Failures Robot)

Comment 137

•

7 years ago

:jgraham this failure is showing up at a very high rate still, can we disable all tests that affect this on linux until this is resolved? do you have other ideas?

Flags: needinfo?(james)

Assignee

Comment 138

•

7 years ago

Please note that there are still a lot of falsely starred entries in the list of OF. Also this is not only related to this specific test but it happens for the very first test as run in the job. So disabling just the first test doesn't help. We would have to completely turn of wdspec tests.

Comment hidden (Intermittent Failures Robot)

Comment 140

•

7 years ago

(In reply to OrangeFactor Robot from comment #139) > 23 failures in 141 pushes (0.163 failures/push) were associated with this > bug yesterday. > so its 18 failures in 141 pushes(0.127 failures/push)

Comment 141

•

7 years ago

a 12%+ failure rate- should we hide this test or restrict it to a single config until this is fixed?

Flags: needinfo?(dburns)

Comment 142

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #141) > a 12%+ failure rate- should we hide this test or restrict it to a single > config until this is fixed? As mentioned in comment 138 that would turn off all wdspec tests as its not related to a test. From a quick search of Bugzilla[1] I think there might be an underlying issue in mozprocess issue or there might be a resource issue from TC as it appears, anecdotally, that we are hitting a deadlock probably due to I/O [1] https://bugzilla.mozilla.org/buglist.cgi?quicksearch=mozprocess%20timed%20out&list_id=13974869

Flags: needinfo?(dburns)

Comment 143

•

7 years ago

I asked if we should restrict this to a single config until this is fixed, for example linux64/debug only. Who is working on the mozprocess issue? Is there a bug we can track instead of this?

Flags: needinfo?(dburns)

Assignee

Comment 144

•

7 years ago

David, you said that the underlying issue might be in pytest. How safe is this assumption? I mean in relation to a possible issue with mozprocess, which both would be mutual exclusive.

Flags: needinfo?(james)

Comment 145

•

7 years ago

Sorry if I’m missing some context here, but what uses mozprocess here?

Comment 146

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #145) > Sorry if I’m missing some context here, but what uses mozprocess here? Mozprocess starts everything and makes sure that they don't just hang by killing jobs that don't output for 1000s.

Flags: needinfo?(dburns)

Comment 147

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #144) > David, you said that the underlying issue might be in pytest. How safe is > this assumption? I mean in relation to a possible issue with mozprocess, > which both would be mutual exclusive. I know that pytest is not returning. I am 100% confident in that. Why it's not returning I have no idea and I know James hasnt been able to debug either. James' theory, which makes the most sense, is there is an I/O deadlock somewhere. If that is the case then there might be a bigger issue which is why I searched for mozprocess intermittents today which shocked me

Comment 148

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #143) > I asked if we should restrict this to a single config until this is fixed, > for example linux64/debug only. Who is working on the mozprocess issue? Is > there a bug we can track instead of this? No one as far as I know, I thought that I would look it up today to see if there is a trend and clearly there is. I will raise one and put all the mozprocess bugs there as deps that we can then clear up as we know that its not something more sinister underlying

Comment 149

•

7 years ago

I have put some time into this e.g. I set up the docker image from Taskcluster locally and tried to reproduce. But so far it seems to be easy to reproduce in CI and hard to debug there, and hard to reproduce outside CI where it would be eas[y|ier] to debug. AFAIK only whimboo has reported seeing the hang outside the CI environment, and it's not 100% clear if that's the same issue. I agree that the situation in CI is currently untenable. I don't really object to reducing the number of Wd jobs in the short term, but ultimately someone needs to dedicate some time to actually resolving the issue, and I haven't been able to do that recently.

Assignee

Comment 150

•

7 years ago

David could you please give me some links in where I would have to add some debug print lines to actually check if I see the identical problem with pytest? I don't know where to start, and your input could shorten the time drastically.

Flags: needinfo?(dburns)

Assignee

Comment 151

•

7 years ago

And one note... we actually use mozprocess to start the wpt tests via mozharness. So keep in mind that it has its own copy of mozprocess in place. To reproduce this output, you should use mozharness. Also this 1000s timeout means that no output has been given in that time, which underpins the comment from David about the deadlock above. Yes, I was able to see the deadlock one and off locally, and if I know where to get started with debugging, I can jump into that.

Comment 152

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #151) > And one note... we actually use mozprocess to start the wpt tests via > mozharness. So keep in mind that it has its own copy of mozprocess in place. > To reproduce this output, you should use mozharness. Also this 1000s timeout > means that no output has been given in that time, which underpins the > comment from David about the deadlock above. > > Yes, I was able to see the deadlock one and off locally, and if I know where > to get started with debugging, I can jump into that. If its pytest as the cause then https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/pytestrunner/runner.py#54-63 is what will hang.

Flags: needinfo?(dburns)

Comment hidden (Intermittent Failures Robot)

Comment 156

•

7 years ago

Fun discovery from the weekend: add enough logging and this stops happening.

Comment hidden (Intermittent Failures Robot)

Assignee

Comment 162

•

7 years ago

Using a specific test in the summary is misleading as mentioned a couple of times earlier here. So we should really get rid of it. Instead lets prefix it with the test job symbol for better starring.

Summary: Intermittent /webdriver/tests/actions/key.py Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',

Assignee

Updated

•

7 years ago

Summary: Intermittent Wd failure: Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-',

Updated

•

7 years ago

Blocks: 1433861

Comment hidden (Intermittent Failures Robot)

Arthur Iakab [arthur_iakab]

Assignee

Updated

•

7 years ago

Blocks: 1429391

Noemi Erli[:noemi_erli]

Comment 167

•

7 years ago

There are 87 failures in the past 7 days. Platforms: Linux opt/ debug, Linux x64 opt/ debug/ pgo, linux32-nightly opt, linux32-devedition opt, linux64-ccov opt, linux64-jsdcov opt, linux64-nightly opt. Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=160524801 Relevant part of the log: [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - SUITE-START | Running 54 tests [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - Running wdspec tests [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - TEST-START | /webdriver/tests/actions/key.py [task 2018-02-06T01:40:01.065Z] 01:40:01 INFO - TEST-SKIP | /webdriver/tests/actions/key.py | took 0ms [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - TEST-START | /webdriver/tests/contexts/maximize_window.py [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - TEST-SKIP | /webdriver/tests/contexts/maximize_window.py | took 0ms [task 2018-02-06T01:40:01.066Z] 01:40:01 INFO - Starting http server on web-platform.test:8001 [task 2018-02-06T01:40:01.070Z] 01:40:01 INFO - Starting runner [task 2018-02-06T01:40:01.070Z] 01:40:01 INFO - Starting http server on web-platform.test:8443 [task 2018-02-06T01:40:01.097Z] 01:40:01 INFO - PID 840 | 1517881201093 geckodriver INFO geckodriver 0.19.1 ( 'e6138707451a' '2018-02-05 20:17 -0500') [task 2018-02-06T01:40:01.101Z] 01:40:01 INFO - PID 840 | 1517881201096 geckodriver INFO Listening on 127.0.0.1:4444 [task 2018-02-06T01:40:01.602Z] 01:40:01 INFO - WebDriver HTTP server listening at http://127.0.0.1:4444/ [task 2018-02-06T01:40:01.602Z] 01:40:01 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2018-02-06T01:56:41.662Z] 01:56:41 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/XXPgFAK3QL-xbZQq71hzZA/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-06T01:56:41.663Z] 01:56:41 ERROR - timed out after 1000 seconds of no output [task 2018-02-06T01:56:41.663Z] 01:56:41 ERROR - Return code: -15 [task 2018-02-06T01:56:41.664Z] 01:56:41 ERROR - No suite end message was emitted by this harness. [task 2018-02-06T01:56:41.664Z] 01:56:41 ERROR - # TBPL FAILURE # [task 2018-02-06T01:56:41.665Z] 01:56:41 WARNING - setting return code to 2 [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - Running post-action listener: _package_coverage_data [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - Running post-action listener: _resource_record_post_action [task 2018-02-06T01:56:41.665Z] 01:56:41 INFO - [mozharness: 2018-02-06 01:56:41.665687Z] Finished run-tests step (success) [task 2018-02-06T01:56:41.666Z] 01:56:41 INFO - Running post-run listener: _resource_record_post_run [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - Total resource usage - Wall time: 1021s; CPU: 1.0%; Read bytes: 442368; Write bytes: 444174336; Read time: 24; Write time: 24936 [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - TinderboxPrint: CPU usage 0.6% [task 2018-02-06T01:56:41.738Z] 01:56:41 INFO - TinderboxPrint: I/O read bytes / time 442,368 / 24 [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: I/O write bytes / time 444,174,336 / 24,936 [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: CPU idle 4,038.5 (99.4%) [task 2018-02-06T01:56:41.739Z] 01:56:41 INFO - TinderboxPrint: Swap in / out 0 / 0 [task 2018-02-06T01:56:41.740Z] 01:56:41 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 [task 2018-02-06T01:56:41.741Z] 01:56:41 INFO - install - Wall time: 17s; CPU: 25.0%; Read bytes: 0; Write bytes: 9338880; Read time: 0; Write time: 484 [task 2018-02-06T01:56:41.751Z] 01:56:41 INFO - run-tests - Wall time: 1004s; CPU: 0.0%; Read bytes: 204800; Write bytes: 434835456; Read time: 16; Write time: 24452 [task 2018-02-06T01:56:42.030Z] 01:56:42 INFO - Running post-run listener: _upload_blobber_files [task 2018-02-06T01:56:42.030Z] 01:56:42 WARNING - Blob upload gear skipped. Missing cmdline options. [task 2018-02-06T01:56:42.030Z] 01:56:42 INFO - Running post-run listener: copy_logs_to_upload_dir [task 2018-02-06T01:56:42.031Z] 01:56:42 INFO - Copying logs to upload dir... [task 2018-02-06T01:56:42.031Z] 01:56:42 INFO - mkdir: /builds/worker/workspace/build/upload/logs [task 2018-02-06T01:56:42.032Z] 01:56:42 INFO - Copying logs to upload dir... [task 2018-02-06T01:56:42.034Z] 01:56:42 WARNING - returning nonzero exit status 2

Comment hidden (Intermittent Failures Robot)

Comment 173

•

7 years ago

This bug has failed 85 times in the last 7 days. It failed only on Linux affecting debug, pgo and opt build types. Failing tests: debug-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-web-platform-tests-wdspec-e10, opt-test-verify-wpt-e10. Link to a recent log:https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162310358&lineNumber=1222 Part of that log: 01:30:07 INFO - TEST-START | /webdriver/tests/actions/key_shortcuts.py [task 2018-02-15T01:46:47.542Z] 01:46:47 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/YmxWjjgmTZy5PdEjMJy1FQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-15T01:46:47.559Z] 01:46:47 ERROR - timed out after 1000 seconds of no output [task 2018-02-15T01:46:47.560Z] 01:46:47 ERROR - Return code: -15 [task 2018-02-15T01:46:47.560Z] 01:46:47 ERROR - No suite end message was emitted by this harness. [task 2018-02-15T01:46:47.561Z] 01:46:47 ERROR - # TBPL FAILURE # [task 2018-02-15T01:46:47.561Z] 01:46:47 WARNING - setting return code to 2 [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-action listener: _package_coverage_data [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-action listener: _resource_record_post_action [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - [mozharness: 2018-02-15 01:46:47.561867Z] Finished run-tests step (success) [task 2018-02-15T01:46:47.562Z] 01:46:47 INFO - Running post-run listener: _resource_record_post_run [task 2018-02-15T01:46:47.646Z] 01:46:47 INFO - Total resource usage - Wall time: 1022s; CPU: 1.0%; Read bytes: 241664; Write bytes: 524836864; Read time: 4; Write time: 93468 The WPT key tests are disabled due to some problem in wptrunner. new file mode 100644 --- /dev/null +++ b/testing/web-platform/meta/webdriver/tests/actions/key.py.ini @@ -0,0 +1,4 @@ +[key.py] + type: wdspec + disabled: + if (os == "linux"): https://bugzilla.mozilla.org/show_bug.cgi?id=1407383 Bug 1421289 needs to be fixed in order to make progress on the actual bug.

Assignee

Comment 174

•

7 years ago

I really propose that we re-enable actions/key.py given that the test module itself is not the problem here. We shouldn't see an increase of failures. The test was just inappropriately disabled a long time ago.

Comment 175

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #174) > I really propose that we re-enable actions/key.py given that the test module > itself is not the problem here. We shouldn't see an increase of failures. > The test was just inappropriately disabled a long time ago. Sounds reasonable to me.

Comment 176

•

7 years ago

will we re-enable that in this bug or somewhere else?

Assignee

Comment 177

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #176) > will we re-enable that in this bug or somewhere else? You skipped the test without a patch attached to this bug, so I assume we can do the same to unskip?

Flags: needinfo?(jmaher)

Assignee

Updated

•

7 years ago

status-firefox59: --- → disabled

status-firefox60: --- → disabled

Comment 178

•

7 years ago

can you link to the bug where we will enable it?

Flags: needinfo?(jmaher)

Assignee

Comment 179

•

7 years ago

We should do it here. See your comment 12.

Flags: needinfo?(jmaher)

Comment 180

•

7 years ago

for disabling tests we often just disable the test, not add a patch; if you want to revert that patch, go ahead! if there are other factors other than removing the .ini file, it probably is worth a review.

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Tiberius Oros[:tiberius_oros]

Assignee

Comment 183

•

7 years ago

Ok, we cannot re-enable key.py yet due to perma timeout failures. There are no restarts of Firefox in-between each sub-test but the very first start of Firefox seems to take extra-ordinary long: [task 2018-02-16T12:24:42.511Z] 12:24:42 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key.py::test_lone_keyup_sends_no_events [task 2018-02-16T12:25:09.626Z] 12:25:09 INFO - PID 843 | GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications. [task 2018-02-16T12:25:10.320Z] 12:25:10 INFO - PID 843 | 1518783910314 Marionette INFO Listening on port 2828 [task 2018-02-16T12:25:11.003Z] 12:25:11 INFO - STDOUT: PASSED Here exactly 28 seconds until Firefox was ready. It is actually a different problem.

Comment hidden (Intermittent Failures Robot)

Comment 188

•

7 years ago

This bug - Depends on: 1421289, 1418227 - Blocks: 1429391, 1433861 I can see there is a lot work on this bug, as an update I want to add the following: There are 81 failures in the last 7 days and 242 failures in the last 21 days. - Linux x64: 50 - Linux: 23 - linux32-nightly: 2 - linux64-ccov: 3 - linux64-jsdcov: 1 - linux64-nightly: 1 - OS X 10.10: 1 - opt: 41 - debug: 27 - pgo: 13 Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164092702&lineNumber=2952 [task 2018-02-24T05:37:25.838Z] 05:37:25 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/wpt_raw.log', '--log-wptreport=/builds/worker/workspace/build/blobber_upload_dir/wptreport.json', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/wpt_errorsummary.log', '--binary=/builds/worker/workspace/build/application/firefox/firefox', '--symbols-path=https://queue.taskcluster.net/v1/task/Eq0ruhAHRN6GlG2-xuqcqQ/artifacts/public/build/target.crashreporter-symbols.zip', '--stackwalk-binary=/usr/local/bin/linux64-minidump_stackwalk', '--stackfix-dir=/builds/worker/workspace/build/tests/bin', '--run-by-dir=3', '--no-pause-after-test', '--test-type=wdspec', '--stylo-threads=4', '--webdriver-binary=/builds/worker/workspace/build/tests/bin/geckodriver', '--prefs-root=/builds/worker/workspace/build/tests/web-platform/prefs', '--processes=1', '--config=/builds/worker/workspace/build/tests/web-platform/wptrunner.ini', '--ca-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/cacert.pem', '--host-key-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.key', '--host-cert-path=/builds/worker/workspace/build/tests/web-platform/certs/web-platform.test.pem', '--certutil-binary=/builds/worker/workspace/build/tests/bin/certutil'] [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - timed out after 1000 seconds of no output [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - Return code: -15 [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - No suite end message was emitted by this harness. [task 2018-02-24T05:37:25.846Z] 05:37:25 ERROR - # TBPL FAILURE # [task 2018-02-24T05:37:25.846Z] 05:37:25 WARNING - setting return code to 2

Flags: needinfo?(james)

Comment hidden (Intermittent Failures Robot)

Comment 193

•

7 years ago

:jgraham, can we disable wdspec on linux?

Comment hidden (Intermittent Failures Robot)

Assignee

Comment 196

•

7 years ago

I was trying to reproduce this hang with a 1-click loaner and got it fairly quickly. Sadly I cannot use GDB to get stacktraces of each of the processes because ptrace is not allowed: > # sudo gdb --pid=3586 > GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 > [..] > Attaching to process 3586 > Could not attach to process. If your uid matches the uid of the target > process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try > again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf > ptrace: Operation not permitted. I also cannot turn off the restriction: > # echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope > tee: /proc/sys/kernel/yama/ptrace_scope: Read-only file system > 0 At least by inspecting the log I noticed that the hang happens after `Executor setup done`, which is NOT printed to the public logs as uploaded to taskcluster: {"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520242899539, "action": "log", "message": "Executor setup done", "level": "DEBUG", "pid": 3586} ^C{"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520243299983, "action": "log", "message": "TestRunnerManager main loop terminating, starting cleanup", "level": "DEBUG", "pid": 3586}

Assignee

Comment 197

•

7 years ago

Attached file bash script for reproduction — Details

Here the instructions in how to reproduce the hang on a one-click loaner: 1) Create a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=e0a1ba39c48913e4e88efe4fa289c23454ff5181&selectedJob=165824714 2) Run the execution step and stop it when tests are started 3) Save this attachment as run.sh and make it executable 4) Run ./run.sh and wait until it hangs

Comment 198

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #193) > :jgraham, can we disable wdspec on linux? Redirecting this to maja_zf.

Flags: needinfo?(james) → needinfo?(mjzffr)

Maja Frydrychowicz :impossibus (was :maja_zf) (needinfo me)

Assignee

Comment 199

•

7 years ago

The hang is actually caused for `__import__(modname)` where `modname` is `tests.conftest`: https://searchfox.org/mozilla-central/source/testing/web-platform/tests/tools/third_party/py/py/_path/local.py#668 https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/conftest.py I'm going to reduce the content of this file now to see what causes it. I suspect it is the import of pytest.

Comment 200

•

7 years ago

Rather than disabling the entire suite, I like the idea of setting it to Tier-2 for a couple of months until we get the most broad, infrastructure-related intermittents figured out. The individual tests, for the most part, are fine and provide useful information to developers working on the features being tested.

Flags: needinfo?(mjzffr)

Assignee

Comment 201

•

7 years ago

Reducing conftest.py showed that the hang comes actually with `fixtures.py`: https://searchfox.org/mozilla-central/source/testing/web-platform/tests/webdriver/tests/support/fixtures.py In that file I can remove everything except the following line: > import webdriver So something in that package keeps us hanging. More updates follow...

Assignee

Comment 202

•

7 years ago

I finally traced it down to StructuredLog and here specifically the `__init__` method: https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozlog/mozlog/structuredlog.py#149 The hang happens in `with self._lock`. James given that you know mozlog better than me, any advice in what would be good as next steps?

Flags: needinfo?(james)

Comment 203

•

7 years ago

So, I would really like to see what happens if you apply the following patch to log all activity on that lock: diff --git a/testing/mozbase/mozlog/mozlog/structuredlog.py b/testing/mozbase/mozlog/mozlog/structuredlog.py index 74a7ef55e3f4..712560a0eeda 100644 --- a/testing/mozbase/mozlog/mozlog/structuredlog.py +++ b/testing/mozbase/mozlog/mozlog/structuredlog.py @@ -7,6 +7,7 @@ from __future__ import absolute_import, print_function, unicode_literals from multiprocessing import current_process from threading import current_thread, Lock import json +import os import sys import time import traceback @@ -69,6 +70,27 @@ Subfields for all messages: _default_logger_name = None +class LoggedLock(object): + def __init__(self): + self._lock = Lock() + + def acquire(self): + os.write(2, "Lock requested %s %s\n" % (current_process().pid, current_thread().name)) + rv = self._lock.acquire() + os.write(2, "Lock acquired %s %s\n" % (current_process().pid, current_thread().name)) + return rv + + def release(self): + os.write(2, "Lock released by %s %s\n" % (current_process().pid, current_thread().name)) + return self._lock.release() + + def __enter__(self): + return self.acquire() + + def __exit__(self, *args, **kwargs): + return self.release() + + def get_default_logger(component=None): """Gets the default logger if available, optionally tagged with component name. Will return None if not yet set @@ -134,7 +156,7 @@ class ComponentState(object): class StructuredLogger(object): - _lock = Lock() + _lock = LoggedLock() _logger_states = {} """Create a structured logger with the given name diff --git a/testing/web-platform/tests/tools/webdriver/webdriver/client.py b/testing/web-platform/tests/tools/webdriver/webdriver/client.py index 622c5fce76ac..86209a366691 100644 --- a/testing/web-platform/tests/tools/webdriver/webdriver/client.py +++ b/testing/web-platform/tests/tools/webdriver/webdriver/client.py @@ -6,7 +6,10 @@ import transport from six import string_types +import os +os.write(2, "Before get_default_logger\n") from mozlog import get_default_logger +os.write(2, "After get_default_logger\n") logger = get_default_logger() That's against m-c so you may need to edit the paths to get it to apply. I suspect what's going on is that we are randomly inheriting a locked lock into the subprocess due to the Incredible Brokenness of Python Multithreading. But I'd like to confirm that's what's going on. Irrespective of that, we don't actually seem to use the logger anywhere? So we could delete that line and stop using mozlog in the conftest file and the problem will go away. Apologies for not spotting this earlier. In retrospect it seems like exactly the kind of thing that coculd cause exactly this kind of problem

Flags: needinfo?(james)

Comment 204

•

7 years ago

*Multiprocessing. Multithreading is less broken, just not fit for (many) purpose(s).

Assignee

Updated

•

7 years ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Priority: P5 → P1

Assignee

Comment 205

•

7 years ago

Attached file mozlog LoggedLock output — Details

Assignee

Comment 206

•

7 years ago

Also I don't see why we are hanging here in trying to aquire a lock. As it looks like the same lock instance gets released right before: Lock requested 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> Lock acquired 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> {"source": "web-platform-tests", "thread": "Thread-TestrunnerManager-1", "time": 1520281029594, "action": "log", "message": "Executor setup done ", "level": "DEBUG", "pid": 14850} Lock released by 14850 Thread-TestrunnerManager-1 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0> *** before self._lockLock requested 14890 Thread-2 for <mozlog.structuredlog.LoggedLock object at 0x7f3bacd64bd0>

Assignee

Comment 207

•

7 years ago

(In reply to James Graham [:jgraham] from comment #203) > Irrespective of that, we don't actually seem to use the logger anywhere? So > we could delete that line and stop using mozlog in the conftest file and the > problem will go away. Apologies for not spotting this earlier. In retrospect This doesn't help. We still hang even with the lines removed in `fixture.py`.

Assignee

Comment 208

•

7 years ago

The thread which actually hangs here is the one which gets created by `WdspecRun.run()`: https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/base.py#405

Comment hidden (mozreview-request)

Comment hidden (Intermittent Failures Robot)

Comment hidden (mozreview-request)

Assignee

Comment 219

•

7 years ago

mozreview-review

Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231182 ::: testing/web-platform/tests/webdriver/tests/support/fixtures.py:20 (Diff revision 3) > def ignore_exceptions(f): > def inner(*args, **kwargs): > try: > return f(*args, **kwargs) > except webdriver.error.WebDriverException as e: > - logger.warning("Ignored exception %s" % e) > + print >> sys.stderr, "Ignored exception %s" % e Not sure what I should make out of this lint failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169 Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments.

Assignee

Updated

•

7 years ago

Attachment #8956222 - Flags: review?(james)

Attachment #8956221 - Flags: review?(james)

Attachment #8951549 - Flags: review?(james)

Comment 220

•

7 years ago

mozreview-review

Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231224

Attachment #8956221 - Flags: review?(james) → review+

Comment 221

•

7 years ago

mozreview-review

Comment on attachment 8956222 [details] Bug 1407383 - [wdspec] Remove unused imports from fixtures module. https://reviewboard.mozilla.org/r/225134/#review231228

Attachment #8956222 - Flags: review?(james) → review+

Comment 222

•

7 years ago

mozreview-review

Comment on attachment 8951549 [details] Bug 1407383 - [wdspec] Re-enable key.py on Linux64. https://reviewboard.mozilla.org/r/220862/#review231226

Attachment #8951549 - Flags: review?(james) → review+

Comment hidden (mozreview-request)

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Assignee

Comment 225

•

7 years ago

mozreview-review-reply

Comment on attachment 8956221 [details] Bug 1407383 - [wdspec] Remove usage of mozlog to prevent process lock. https://reviewboard.mozilla.org/r/225132/#review231182 > Not sure what I should make out of this lint failure: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=14d058dbd80c9cb2b5645f55012c61e5b3f8f454&selectedJob=166153169 > > Does it mean I have to use a print function or not? The wording is kinda strange. Also i don't see this failure when running the linter locally with the exact same arguments. The problem was that the file needed to be added to the whitelist for the linter to allow the usage of `print` at all.

Pulsebot

Comment 226

•

7 years ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/880cc555591a [wdspec] Remove unused imports from fixtures module. r=jgraham https://hg.mozilla.org/integration/autoland/rev/c26cb6d150d5 [wdspec] Remove usage of mozlog to prevent process lock. r=jgraham https://hg.mozilla.org/integration/autoland/rev/bf1d69a11d6f [wdspec] Re-enable key.py on Linux64. r=jgraham

Comment 227

•

7 years ago

Created web-platform-tests PR https://github.com/w3c/web-platform-tests/pull/9872 for changes under testing/web-platform/tests

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Updated

•

7 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][wptsync upstream]

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 228

•

7 years ago

Can't merge web-platform-tests PR due to failing upstream checks: Github PR https://github.com/w3c/web-platform-tests/pull/9872 * continuous-integration/travis-ci/pr (https://travis-ci.org/w3c/web-platform-tests/builds/349768613?utm_source=github_status&utm_medium=notification)

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Updated

•

7 years ago

Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell disable-recommended][wptsync upstream error]

Dorel Luca [:dluca]

Comment 229

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/880cc555591a https://hg.mozilla.org/mozilla-central/rev/c26cb6d150d5 https://hg.mozilla.org/mozilla-central/rev/bf1d69a11d6f

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Updated

•

7 years ago

Whiteboard: [stockwell disable-recommended][wptsync upstream error] → [stockwell disable-recommended][wptsync upstream]

Assignee

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox60: disabled → fixed

Keywords: leave-open, test-disabled

Resolution: --- → FIXED

Whiteboard: [stockwell disable-recommended][wptsync upstream] → [stockwell fixed][wptsync upstream]

Target Milestone: --- → mozilla60

Assignee

Updated

•

7 years ago

Blocks: 1420189

Andreea Pavel [:apavel]

Comment 230

•

7 years ago

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=167034813&repo=autoland&lineNumber=377485

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Assignee

Comment 232

•

7 years ago

(In reply to Andreea Pavel [:apavel] from comment #230) > Recent failure log: > https://treeherder.mozilla.org/logviewer. > html#?job_id=167034813&repo=autoland&lineNumber=377485 This has nothing to do with this bug, and is something completely different. I will file a new bug.

Flags: needinfo?(hskupin)

Summary: Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-', → Intermittent Wd Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/web-platform/runtests.py', '--log-raw=-'] due to hang in mozlog

Comment 233

•

7 years ago

Upstream PR merged

Comment hidden (Intermittent Failures Robot)