Closed Bug 1032849 Opened 10 years ago Closed 9 years ago

Hangs on windows result in a mozprocess timeout rather than a stack

Categories

(Testing :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 890026

People

(Reporter: emorley, Unassigned)

References

Details

In bug 1009951, bug 900112 and bug 1032842, we're seeing mochitest hangs (where we hit the harness timeout: "application timed out after 330 seconds with no output"), which are followed by a mozprocess timeout, rather than handling the hang gracefully (eg trying to kill the process and get a stack, or if that doesn't work, at least exiting the test run gracefully).

eg:
https://tbpl.mozilla.org/php/getParsedLog.php?id=42856603&tree=Mozilla-Central
06:21:52     INFO -  1163 INFO TEST-START | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html
06:21:52     INFO -  1164 INFO TEST-INFO | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | Waiting for pruneOldCrashes
06:27:22  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | application timed out after 330 seconds with no output
06:27:23     INFO -  TEST-INFO | screenshot: exit status 0
06:44:03     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '5', '--this-chunk', '3', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1404208922/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--chunk-by-dir=4']
06:44:03    ERROR - timed out after 1000 seconds of no output
06:44:03    ERROR - Return code: 572


and:
https://tbpl.mozilla.org/php/getParsedLog.php?id=42851832&tree=Mozilla-Inbound
03:55:30     INFO -  TEST-INFO | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | Entering test checkIdentityOfAboutSupport
03:55:31     INFO -  1404215731110	Browser.Experiments.Experiments	TRACE	Experiments #0::enabled=true, true
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::Registering instance with Addon Manager.
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::Registering previous experiment add-on provider.
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::startup()
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::_loadFromCache
04:01:01  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | application timed out after 330 seconds with no output
04:01:01     INFO -  TEST-INFO | screenshot: exit status 0
04:17:41     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '3', '--this-chunk', '1', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1404203403/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--browser-chrome', '--chunk-by-dir=5']
04:17:41    ERROR - timed out after 1000 seconds of no output
04:17:41    ERROR - Return code: 572

Ted, do you have any idea why this might be happening now? (It started occurring in the last month or so iirc).
Flags: needinfo?(ted)
I don't know of anything offhand that would have changed this in that time frame. The only things I can think of to do would be to:
a) Add a bunch of debug logging around process timeouts in Mochitest and see if anything stands out.
b) Do a try push with the patches from bug 890026 applied and see if it reproduces in that case.
Flags: needinfo?(ted)
Adding a couple of people who might be able to help.  Dan, Ahal, do either of you have time to look into this? The sheriffs are going to start hiding WinXP tests because of this if we can't fix it.
Flags: needinfo?(dminor)
Flags: needinfo?(ahalberstadt)
Also cc'ing Joel in case he knows of a contributor who might want to take this one.
This seems to be more than just windows, here's an osx one:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1424368068/mozilla-inbound_mountainlion-debug_test-mochitest-3-bm106-tests1-macosx-build57.txt.gz

The "TEST-INFO | screencapture: exit 0" means we at least get here:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1568

I don't know how to debug this further, but at least for the non-windows case, instead of calling os.kill(), calling mozprocess.kill() might work.. I don't know what to do about windows except to maybe add a localized timeout to that function and revert to mozprocess.kill() if crashinject is what's timing out. Of course this is all assuming the timeout is happening in here.

How reproducible is this? Which jobs are most likely to reproduce?
Flags: needinfo?(ahalberstadt)
Bug 900112 in the deps is one.
We might just instrument the code better so when it does occur, we have more information where it's dying.
Bug 1009951 seemed to have a lot of win/osx debug mochitest-3 failures. Assuming I can reproduce, this should at least tell us whether or not killAndGetStack is the problem:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=244c34ce33b2

p.s ability to remote debug test slaves somehow would be an awesome project :p
Not reproducing easily.. retriggered some more.
I'm not likely to have time to look at this in the near future.
Flags: needinfo?(dminor)
I pushed this to try the other day and found that invoking crashinject.exe is the last thing that happens before we hit the mozharness timeout. I've pushed again with ted's patches from bug 890026 to see if that helps, hopefully this will reproduce again.

I found in the course of re-triggers that we do get stacks for some timeouts - the intermittent failures associated with this bug are associated with tests that intentionally cause a hang.
This was partly effective in https://treeherder.mozilla.org/logviewer.html#?job_id=8723461&repo=try

We got a stack out of the plugin process, but failed to get it out of the firefox process for some reason ("20:25:32 INFO - 2015-06-18 20:25:32: minidump.cc:3834: ERROR: Minidump cannot read header"). This might be worth taking -- I imagine the plugin process stack isn't worthless for debugging, and we at least kill firefox, so we waste less time waiting for the mozprocess timeout. I'll confirm this isn't a regression for cases we're already getting a stack.
That's not great, that indicates to me that that the minidump-writing code isn't doing something right. I note in your patch that you replaced dump_directory from my original patch with utilityPath:
https://hg.mozilla.org/try/rev/b914cac8b6ad#l1.59

That's definitely not right, even if you write a minidump there the harness won't find it.
Oh! Thank you for noticing that.
According to recent instances of bug 1009951 and bug 900112 this was addressed by bug 890026.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.