Closed
Bug 1032849
Opened 10 years ago
Closed 9 years ago
Hangs on windows result in a mozprocess timeout rather than a stack
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 890026
People
(Reporter: emorley, Unassigned)
References
Details
In bug 1009951, bug 900112 and bug 1032842, we're seeing mochitest hangs (where we hit the harness timeout: "application timed out after 330 seconds with no output"), which are followed by a mozprocess timeout, rather than handling the hang gracefully (eg trying to kill the process and get a stack, or if that doesn't work, at least exiting the test run gracefully). eg: https://tbpl.mozilla.org/php/getParsedLog.php?id=42856603&tree=Mozilla-Central 06:21:52 INFO - 1163 INFO TEST-START | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html 06:21:52 INFO - 1164 INFO TEST-INFO | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | Waiting for pruneOldCrashes 06:27:22 WARNING - TEST-UNEXPECTED-FAIL | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | application timed out after 330 seconds with no output 06:27:23 INFO - TEST-INFO | screenshot: exit status 0 06:44:03 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '5', '--this-chunk', '3', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1404208922/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--chunk-by-dir=4'] 06:44:03 ERROR - timed out after 1000 seconds of no output 06:44:03 ERROR - Return code: 572 and: https://tbpl.mozilla.org/php/getParsedLog.php?id=42851832&tree=Mozilla-Inbound 03:55:30 INFO - TEST-INFO | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | Entering test checkIdentityOfAboutSupport 03:55:31 INFO - 1404215731110 Browser.Experiments.Experiments TRACE Experiments #0::enabled=true, true 03:55:31 INFO - 1404215731111 Browser.Experiments.Experiments TRACE Experiments #0::Registering instance with Addon Manager. 03:55:31 INFO - 1404215731111 Browser.Experiments.Experiments TRACE Experiments #0::Registering previous experiment add-on provider. 03:55:31 INFO - 1404215731111 Browser.Experiments.Experiments TRACE PreviousExperimentProvider #0::startup() 03:55:31 INFO - 1404215731111 Browser.Experiments.Experiments TRACE Experiments #0::_loadFromCache 04:01:01 WARNING - TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | application timed out after 330 seconds with no output 04:01:01 INFO - TEST-INFO | screenshot: exit status 0 04:17:41 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '3', '--this-chunk', '1', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1404203403/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--browser-chrome', '--chunk-by-dir=5'] 04:17:41 ERROR - timed out after 1000 seconds of no output 04:17:41 ERROR - Return code: 572 Ted, do you have any idea why this might be happening now? (It started occurring in the last month or so iirc).
Flags: needinfo?(ted)
Reporter | ||
Updated•10 years ago
|
Comment 1•10 years ago
|
||
I don't know of anything offhand that would have changed this in that time frame. The only things I can think of to do would be to: a) Add a bunch of debug logging around process timeouts in Mochitest and see if anything stands out. b) Do a try push with the patches from bug 890026 applied and see if it reproduces in that case.
Flags: needinfo?(ted)
Comment 2•9 years ago
|
||
Adding a couple of people who might be able to help. Dan, Ahal, do either of you have time to look into this? The sheriffs are going to start hiding WinXP tests because of this if we can't fix it.
Flags: needinfo?(dminor)
Flags: needinfo?(ahalberstadt)
Comment 3•9 years ago
|
||
Also cc'ing Joel in case he knows of a contributor who might want to take this one.
Comment 4•9 years ago
|
||
This seems to be more than just windows, here's an osx one: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1424368068/mozilla-inbound_mountainlion-debug_test-mochitest-3-bm106-tests1-macosx-build57.txt.gz The "TEST-INFO | screencapture: exit 0" means we at least get here: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1568 I don't know how to debug this further, but at least for the non-windows case, instead of calling os.kill(), calling mozprocess.kill() might work.. I don't know what to do about windows except to maybe add a localized timeout to that function and revert to mozprocess.kill() if crashinject is what's timing out. Of course this is all assuming the timeout is happening in here. How reproducible is this? Which jobs are most likely to reproduce?
Flags: needinfo?(ahalberstadt)
Comment 5•9 years ago
|
||
Bug 900112 in the deps is one.
Comment 6•9 years ago
|
||
Also bug 836829, bug 1009951, bug 922420, bug 1101275
Comment 7•9 years ago
|
||
We might just instrument the code better so when it does occur, we have more information where it's dying.
Comment 8•9 years ago
|
||
Bug 1009951 seemed to have a lot of win/osx debug mochitest-3 failures. Assuming I can reproduce, this should at least tell us whether or not killAndGetStack is the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=244c34ce33b2 p.s ability to remote debug test slaves somehow would be an awesome project :p
Comment 9•9 years ago
|
||
Not reproducing easily.. retriggered some more.
Comment 10•9 years ago
|
||
I'm not likely to have time to look at this in the near future.
Flags: needinfo?(dminor)
Comment 11•9 years ago
|
||
I pushed this to try the other day and found that invoking crashinject.exe is the last thing that happens before we hit the mozharness timeout. I've pushed again with ted's patches from bug 890026 to see if that helps, hopefully this will reproduce again. I found in the course of re-triggers that we do get stacks for some timeouts - the intermittent failures associated with this bug are associated with tests that intentionally cause a hang.
Comment 12•9 years ago
|
||
This was partly effective in https://treeherder.mozilla.org/logviewer.html#?job_id=8723461&repo=try We got a stack out of the plugin process, but failed to get it out of the firefox process for some reason ("20:25:32 INFO - 2015-06-18 20:25:32: minidump.cc:3834: ERROR: Minidump cannot read header"). This might be worth taking -- I imagine the plugin process stack isn't worthless for debugging, and we at least kill firefox, so we waste less time waiting for the mozprocess timeout. I'll confirm this isn't a regression for cases we're already getting a stack.
Comment 13•9 years ago
|
||
That's not great, that indicates to me that that the minidump-writing code isn't doing something right. I note in your patch that you replaced dump_directory from my original patch with utilityPath: https://hg.mozilla.org/try/rev/b914cac8b6ad#l1.59 That's definitely not right, even if you write a minidump there the harness won't find it.
Comment 14•9 years ago
|
||
Oh! Thank you for noticing that.
Comment 15•9 years ago
|
||
According to recent instances of bug 1009951 and bug 900112 this was addressed by bug 890026.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•