1032849 - Hangs on windows result in a mozprocess timeout rather than a stack

Reporter

Description

•

10 years ago

In bug 1009951, bug 900112 and bug 1032842, we're seeing mochitest hangs (where we hit the harness timeout: "application timed out after 330 seconds with no output"), which are followed by a mozprocess timeout, rather than handling the hang gracefully (eg trying to kill the process and get a stack, or if that doesn't work, at least exiting the test run gracefully).

eg:
https://tbpl.mozilla.org/php/getParsedLog.php?id=42856603&tree=Mozilla-Central
06:21:52     INFO -  1163 INFO TEST-START | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html
06:21:52     INFO -  1164 INFO TEST-INFO | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | Waiting for pruneOldCrashes
06:27:22  WARNING -  TEST-UNEXPECTED-FAIL | /tests/dom/plugins/test/mochitest/test_CrashService_hang.html | application timed out after 330 seconds with no output
06:27:23     INFO -  TEST-INFO | screenshot: exit status 0
06:44:03     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '5', '--this-chunk', '3', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1404208922/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--chunk-by-dir=4']
06:44:03    ERROR - timed out after 1000 seconds of no output
06:44:03    ERROR - Return code: 572


and:
https://tbpl.mozilla.org/php/getParsedLog.php?id=42851832&tree=Mozilla-Inbound
03:55:30     INFO -  TEST-INFO | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | Entering test checkIdentityOfAboutSupport
03:55:31     INFO -  1404215731110	Browser.Experiments.Experiments	TRACE	Experiments #0::enabled=true, true
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::Registering instance with Addon Manager.
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::Registering previous experiment add-on provider.
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::startup()
03:55:31     INFO -  1404215731111	Browser.Experiments.Experiments	TRACE	Experiments #0::_loadFromCache
04:01:01  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/general/browser_aboutSupport_newtab_security_state.js | application timed out after 330 seconds with no output
04:01:01     INFO -  TEST-INFO | screenshot: exit status 0
04:17:41     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\mochitest/runtests.py', '--total-chunks', '3', '--this-chunk', '1', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1404203403/firefox-33.0a1.en-US.win32.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--quiet', '--browser-chrome', '--chunk-by-dir=5']
04:17:41    ERROR - timed out after 1000 seconds of no output
04:17:41    ERROR - Return code: 572

Ted, do you have any idea why this might be happening now? (It started occurring in the last month or so iirc).

Flags: needinfo?(ted)

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Blocks: 1009951, 900112, 1032842

(not currently active) Ted Mielczarek

Comment 1

•

10 years ago

I don't know of anything offhand that would have changed this in that time frame. The only things I can think of to do would be to:
a) Add a bunch of debug logging around process timeouts in Mochitest and see if anything stands out.
b) Do a try push with the patches from bug 890026 applied and see if it reproduces in that case.

Flags: needinfo?(ted)

Jonathan Griffin (:jgriffin)

Comment 2

•

9 years ago

Adding a couple of people who might be able to help.  Dan, Ahal, do either of you have time to look into this? The sheriffs are going to start hiding WinXP tests because of this if we can't fix it.

Flags: needinfo?(dminor)

Flags: needinfo?(ahalberstadt)

Jonathan Griffin (:jgriffin)

Comment 3

•

9 years ago

Also cc'ing Joel in case he knows of a contributor who might want to take this one.

Andrew Halberstadt [:ahal]

Comment 4

•

9 years ago

This seems to be more than just windows, here's an osx one:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1424368068/mozilla-inbound_mountainlion-debug_test-mochitest-3-bm106-tests1-macosx-build57.txt.gz

The "TEST-INFO | screencapture: exit 0" means we at least get here:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1568

I don't know how to debug this further, but at least for the non-windows case, instead of calling os.kill(), calling mozprocess.kill() might work.. I don't know what to do about windows except to maybe add a localized timeout to that function and revert to mozprocess.kill() if crashinject is what's timing out. Of course this is all assuming the timeout is happening in here.

How reproducible is this? Which jobs are most likely to reproduce?

Flags: needinfo?(ahalberstadt)

Ryan VanderMeulen [:RyanVM]

Comment 5

•

9 years ago

Bug 900112 in the deps is one.

Ryan VanderMeulen [:RyanVM]

Comment 6

•

9 years ago

Also bug 836829, bug 1009951, bug 922420, bug 1101275

Jonathan Griffin (:jgriffin)

Comment 7

•

9 years ago

We might just instrument the code better so when it does occur, we have more information where it's dying.

Andrew Halberstadt [:ahal]

Comment 8

•

9 years ago

Bug 1009951 seemed to have a lot of win/osx debug mochitest-3 failures. Assuming I can reproduce, this should at least tell us whether or not killAndGetStack is the problem:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=244c34ce33b2

p.s ability to remote debug test slaves somehow would be an awesome project :p

Andrew Halberstadt [:ahal]

Comment 9

•

9 years ago

Not reproducing easily.. retriggered some more.

Dan Minor [:dminor]

Comment 10

•

9 years ago

I'm not likely to have time to look at this in the near future.

Flags: needinfo?(dminor)

Chris Manchester (limited bugmail, email directly)

Comment 11

•

9 years ago

I pushed this to try the other day and found that invoking crashinject.exe is the last thing that happens before we hit the mozharness timeout. I've pushed again with ted's patches from bug 890026 to see if that helps, hopefully this will reproduce again.

I found in the course of re-triggers that we do get stacks for some timeouts - the intermittent failures associated with this bug are associated with tests that intentionally cause a hang.

Chris Manchester (limited bugmail, email directly)

Comment 12

•

9 years ago

This was partly effective in https://treeherder.mozilla.org/logviewer.html#?job_id=8723461&repo=try

We got a stack out of the plugin process, but failed to get it out of the firefox process for some reason ("20:25:32 INFO - 2015-06-18 20:25:32: minidump.cc:3834: ERROR: Minidump cannot read header"). This might be worth taking -- I imagine the plugin process stack isn't worthless for debugging, and we at least kill firefox, so we waste less time waiting for the mozprocess timeout. I'll confirm this isn't a regression for cases we're already getting a stack.

(not currently active) Ted Mielczarek

Comment 13

•

9 years ago

That's not great, that indicates to me that that the minidump-writing code isn't doing something right. I note in your patch that you replaced dump_directory from my original patch with utilityPath:
https://hg.mozilla.org/try/rev/b914cac8b6ad#l1.59

That's definitely not right, even if you write a minidump there the harness won't find it.

Chris Manchester (limited bugmail, email directly)

Comment 14

•

9 years ago

Oh! Thank you for noticing that.

Chris Manchester (limited bugmail, email directly)

Updated

•

9 years ago

Depends on: 890026

Chris Manchester (limited bugmail, email directly)

Comment 15

•

9 years ago

According to recent instances of bug 1009951 and bug 900112 this was addressed by bug 890026.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → DUPLICATE

Bugzilla

Quick Search

Hangs on windows result in a mozprocess timeout rather than a stack

Categories

(Testing :: General, defect)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15