Closed Bug 1638148 Opened 4 years ago Closed 4 years ago

Intermittent Mn MacOS, Linux Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', ...] for TestCrash.test_crash_content_process

Categories

(Testing :: Marionette Client and Harness, defect, P5)

Version 3
defect

Tracking

(firefox-esr68 unaffected, firefox76 unaffected, firefox77 unaffected, firefox78 fixed)

RESOLVED FIXED
mozilla78
Tracking Status
firefox-esr68 --- unaffected
firefox76 --- unaffected
firefox77 --- unaffected
firefox78 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: enndeakin)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed:backout])

Filed by: dluca [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=302328967&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/biV-YbLFQ72h3FCn1xzDWA/runs/0/artifacts/public/logs/live_backing.log


INFO - TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process
[task 2020-05-14T19:39:37.276Z] 19:39:37     INFO -  Suppressing GECKO output. To view, add `print(self.new_out.getvalue())` to the end of this test.
[task 2020-05-14T19:56:17.377Z] 19:56:17     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/harness/marionette_harness/runtests.py', '--setpref=toolkit.asyncshutdown.log=true', '--setpref=media.peerconnection.mtransport_process=false', '--setpref=network.process.enabled=false', '--gecko-log=-', '--log-raw=-', '-vv', '--log-raw=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/marionette_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/marionette_errorsummary.log', '--log-html=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/report.html', '--binary=/Users/cltbld/tasks/task_1589484627/build/application/Firefox Nightly.app/Contents/MacOS/firefox', '--address=localhost:2828', u'--symbols-path=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Q8dlOzVfSsiI1meyrIDf4Q/artifacts/public/build/target.crashreporter-symbols.zip', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit-tests.ini']
[task 2020-05-14T19:56:17.577Z] 19:56:17    ERROR - timed out after 1000 seconds of no output
[task 2020-05-14T19:56:17.577Z] 19:56:17    ERROR - Return code: -15
[task 2020-05-14T19:56:17.578Z] 19:56:17    ERROR - No suite end message was emitted by this harness.
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - TinderboxPrint: marionette<br/>824/0/1
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - gecko.log not found
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - TinderboxPrint: marionette<br/>824/0/1
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - Marionette exited with return code -15: FAILURE
[task 2020-05-14T19:56:17.578Z] 19:56:17    ERROR - # TBPL FAILURE #
[task 2020-05-14T19:56:17.578Z] 19:56:17  WARNING - setting return code to 2
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - Running post-action listener: _package_coverage_data
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - Running post-action listener: _resource_record_post_action
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - Running post-action listener: process_java_coverage_data
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - [mozharness: 2020-05-14 19:56:17.578821Z] Finished run-tests step (success)
[task 2020-05-14T19:56:17.578Z] 19:56:17     INFO - Running post-run listener: _resource_record_post_run
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - Total resource usage - Wall time: 1490s; CPU: 11.0%; Read bytes: 220069888; Write bytes: 1627289088; Read time: 2655; Write time: 7199
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: CPU usage<br/>11.0%
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: I/O read bytes / time<br/>220,069,888 / 2,655
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: I/O write bytes / time<br/>1,627,289,088 / 7,199
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: CPU idle<br/>5,319.1 (89.3%)
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: CPU system<br/>150.8 (2.5%)
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: CPU user<br/>489.0 (8.2%)
[task 2020-05-14T19:56:17.701Z] 19:56:17     INFO - TinderboxPrint: Swap in / out<br/>478,515,200 / 0
[task 2020-05-14T19:56:17.702Z] 19:56:17     INFO - install - Wall time: 19s; CPU: 38.0%; Read bytes: 3100672; Write bytes: 290271232; Read time: 51; Write time: 582
[task 2020-05-14T19:56:17.711Z] 19:56:17     INFO - run-tests - Wall time: 1471s; CPU: 11.0%; Read bytes: 215494656; Write bytes: 1332401664; Read time: 2575; Write time: 6597
[task 2020-05-14T19:56:17.982Z] 19:56:17  WARNING - returning nonzero exit status 2
[fetches 2020-05-14T19:56:18.001Z] removing /Users/cltbld/tasks/task_1589484627/fetches
[fetches 2020-05-14T19:56:18.002Z] finished
[taskcluster 2020-05-14T19:56:18.012Z]    Exit Code: 2
[taskcluster 2020-05-14T19:56:18.012Z]    User Time: 6m8.621984s
[taskcluster 2020-05-14T19:56:18.012Z]  Kernel Time: 1m28.166325s
[taskcluster 2020-05-14T19:56:18.012Z]    Wall Time: 25m33.544811831s
[taskcluster 2020-05-14T19:56:18.012Z]       Result: FAILED
[taskcluster 2020-05-14T19:56:18.012Z] === Task Finished ===
[taskcluster 2020-05-14T19:56:18.012Z] Task Duration: 25m33.548510968s
[taskcluster 2020-05-14T19:56:18.121Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2021-05-14T19:28:15.127Z
[taskcluster 2020-05-14T19:56:18.370Z] Uploading artifact public/test_info/marionette_errorsummary.log from file build/blobber_upload_dir/marionette_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2021-05-14T19:28:15.127Z
[taskcluster 2020-05-14T19:56:18.802Z] Uploading artifact public/test_info/marionette_raw.log from file build/blobber_upload_dir/marionette_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2021-05-14T19:28:15.127Z
[taskcluster 2020-05-14T19:56:19.067Z] Uploading artifact public/test_info/report.html from file build/blobber_upload_dir/report.html with content encoding "gzip", mime type "text/html; charset=utf-8" and expiry 2021-05-14T19:28:15.127Z
[taskcluster 2020-05-14T19:56:19.186Z] Uploading artifact public/test_info/resource-usage.json from file build/blobber_upload_dir/resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2021-05-14T19:28:15.127Z
[taskcluster:error] exit status 2
Summary: Intermittent Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/harness/marionette_harness/runtests.py', '... → Intermittent Mn MacOS Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/harness/marionette_harness/runtests
Summary: Intermittent Mn MacOS Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/harness/marionette_harness/runtests → Intermittent Mn MacOS Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u' after Assertion failure in -[mozRootAccessible representedView

The failing test here is test_crash_content_process:

[task 2020-05-14T19:39:37.275Z] 19:39:37     INFO - TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process
[task 2020-05-14T19:39:37.276Z] 19:39:37     INFO -  Suppressing GECKO output. To view, add `print(self.new_out.getvalue())` to the end of this test.
[task 2020-05-14T19:56:17.377Z] 19:56:17     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/harness/marionette_harness/runtests.py', '--setpref=toolkit.asyncshutdown.log=true', '--setpref=media.peerconnection.mtransport_process=false', '--setpref=network.process.enabled=false', '--gecko-log=-', '--log-raw=-', '-vv', '--log-raw=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/marionette_raw.log', '--log-errorsummary=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/marionette_errorsummary.log', '--log-html=/Users/cltbld/tasks/task_1589484627/build/blobber_upload_dir/report.html', '--binary=/Users/cltbld/tasks/task_1589484627/build/application/Firefox Nightly.app/Contents/MacOS/firefox', '--address=localhost:2828', u'--symbols-path=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Q8dlOzVfSsiI1meyrIDf4Q/artifacts/public/build/target.crashreporter-symbols.zip', '/Users/cltbld/tasks/task_1589484627/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit-tests.ini']
[task 2020-05-14T19:56:17.577Z] 19:56:17    ERROR - timed out after 1000 seconds of no output
[task 2020-05-14T19:56:17.577Z] 19:56:17    ERROR - Return code: -15

As it looks like there was a extremely long time in shutting down the browser until it got soft-killed. Sadly we do not have any gecko log output for it. If it turns out that the test fails more often we could do a try build after adding print(self.new_out.getvalue()) at the end of the test.

Looks like it actually fails a lot. And that not only for Mac but also Linux. So far I cannot see a hang during shutdown kill on Windows. All are happening for shippable opt.

It's a fairly new regression so a backfill would be good. I just triggered some jobs on mozilla-central to know at least the affecting merge.

Blocks: 1391545
Summary: Intermittent Mn MacOS Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', '-u' after Assertion failure in -[mozRootAccessible representedView → Intermittent Mn MacOS, Linux Automation Error: mozprocess timed out after 1000 seconds running ['/Users/cltbld/tasks/task_1589484627/build/venv/bin/python', ...] for TestCrash.test_crash_content_process

Looks like https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&searchStr=mn&revision=045d696faa87fc06c8040b75dec7cd6358c34067 is the causing merge on mozilla-central.

I asked in #sheriffs for a further backfill in autoland.

(In reply to Andreea Pavel [:apavel] from comment #6)

Retriggers point to bug 1594752, https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=OS%2CX%2C10.14%2CShippable%2Copt%2Ctest-macosx1014-64-shippable%2Fopt-marionette-e10s%2C%28Mn%29&selectedTaskRun=dLGvnHO1Sq-C-NfOzflmzA-0&tochange=3e68d50950bf122006b30009c8891196c21b8395&fromchange=1813cb675a68f10ca82536fad034766bdd585f83

Thanks a lot. So the failing test can be found here:

https://searchfox.org/mozilla-central/rev/4166c15e2a99a23a9b38ad62c9fdfe8e5448b354/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py#138-162

I assume that due to the expected crash there is a bad interaction between the actors, which causes a hang?

Given that Tier1 jobs are affected here, I assume the patch on bug 1594752 should be backed out?

However, the assertion happens before that bug, but is not followed by the automation error, but by bug 1582736.

These assertions also happen in passing tests. So not sure how important they are at this time.

Flags: needinfo?(hskupin) → needinfo?(apavel)
Regressed by: 1594752

Actually it looks like the backfill was not complete and misses some changesets, especially the one from above which you claimed as regressor:

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception%2Crunnable&searchStr=mn&fromchange=bf7bb4c678663022d6b33a7f0de79c9e4ba6e76c&selectedTaskRun=ce0UabpdSq-YADhWY1uiNw-0&tochange=30b5173170a437085c25a2c83472ed688dc62f28

Note that this is a failure for shippable builds on Mac and Linux.

Clearing bug 1594752 for now.

No longer regressed by: 1594752

As an update to the above comments, the regressor is still bug 1594752.
The TH link above is from the 11th of May, where the bug landed and got backed out.
Retriggers in comment 6 are made on the push from May 14th, where the bug got relanded.

Henrik, the failure is intermittent, we could probably talk to Neil and see how long it will take to be fixed.
Aryx, what do you think? backout or continue conversation ?

Flags: needinfo?(apavel) → needinfo?(aryx.bugmail)
Regressed by: 1594752

Thanks for clarifying and yes I was looking at a wrong autoland link. Note that this failure happened 33 times yesterday.

Yes, please back it out.

Flags: needinfo?(aryx.bugmail) → needinfo?(apavel)
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(apavel)
Resolution: --- → FIXED
Whiteboard: [stockwell needswork:owner] → [stockwell fixed:backout]
Assignee: nobody → enndeakin
Target Milestone: --- → mozilla78
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.