Open Bug 1325609 Opened 3 years ago Updated 2 months ago

Intermittent xpcshell timeout during xpcom/tests/unit/test_nsIProcess.js

Categories

(Core :: XPCOM, defect, P3)

defect

Tracking

()

Tracking Status
firefox52 --- disabled
firefox53 --- disabled
firefox54 --- disabled

People

(Reporter: gbrown, Unassigned)

References

Details

(Whiteboard: [test disabled])

Bug 1317155 tracks intermittent xpcshell test timeouts. Nearly all of the recent failures occur in test_pipe.js on linux64/debug.

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=40761877#L2247

 [task 2016-12-18T22:46:51.369651Z] 22:46:51     INFO -  TEST-START | xpcom/tests/unit/test_pipe.js

 [task 2016-12-18T23:03:31.421937Z] 23:03:31     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/xpcshell/runxpcshelltests.py', '--total-chunks', '10', '--this-chunk', '10', '--symbols-path=/home/worker/workspace/build/symbols', '--test-plugin-path=/home/worker/workspace/build/application/firefox/plugins', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/xpcshell_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/xpcshell_errorsummary.log', '--utility-path=tests/bin', '--xpcshell=/home/worker/workspace/build/application/firefox/xpcshell', '--manifest=tests/xpcshell/tests/xpcshell.ini']

[task 2016-12-18T23:03:31.467854Z] 23:03:31 ERROR - timed out after 1000 seconds of no output 


When test_pipe.js succeeds on linux64/debug, it completes in about 5 seconds; it seems likely that the test is intermittently hanging.
Might be related to landing bug 1134372.  Does the increase in test_pipe.js failures correlate to 12/03/2016?
Flags: needinfo?(gbrown)
(In reply to Ben Kelly [:bkelly] from comment #1)
> Does the increase in test_pipe.js
> failures correlate to 12/03/2016?

No, I don't think so. There were definitely test_pipe.js failures as early as Nov 12. I think there was a slight increase in frequency from Dec 8 until now.
Flags: needinfo?(gbrown)
I think I erred in blaming test_pipe.js: I forgot that xpcshell runs multiple tests in parallel. Many failures immediately follow the test_pipe.js START message, but there are other tests running too. 

There are also some failures which do not show test_pipe.js running. For instance:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=65880357&lineNumber=2220

[task 2017-01-03T16:58:34.798096Z] 16:58:34     INFO -  TEST-PASS | xpcom/tests/unit/test_ioutil.js | took 4185ms
[task 2017-01-03T16:58:34.949024Z] 16:58:34     INFO -  TEST-START | xpcom/tests/unit/test_nsIMutableArray.js
[task 2017-01-03T16:58:35.502039Z] 16:58:35     INFO -  TEST-PASS | xpcom/tests/unit/test_localfile.js | took 4385ms
[task 2017-01-03T16:58:35.680612Z] 16:58:35     INFO -  TEST-START | xpcom/tests/unit/test_nsIProcess.js
[task 2017-01-03T16:58:36.629717Z] 16:58:36     INFO -  TEST-PASS | xpcom/tests/unit/test_mac_bundle.js | took 4145ms
[task 2017-01-03T16:58:36.800661Z] 16:58:36     INFO -  TEST-START | xpcom/tests/unit/test_nsIProcess_stress.js
[task 2017-01-03T17:15:16.854540Z] 17:15:16     INFO - Automation Error: mozprocess timed out after 1000 seconds 

Reviewing several failures from bug 1317155, it looks to me like test_nsIMutableArray.js is always running when we hit the timeout.

test_nsIMutableArray.js was modified in bug 1311191, a few weeks before the earliest known failure.

:erahm -- Any thoughts/interest?
Flags: needinfo?(erahm)
Summary: Intermittent timeout running xpcom/tests/unit/test_pipe.js → Intermittent xpcshell timeout following xpcom/tests/unit/test_nsIMutableArray.js
I've found a few examples where test_nsIMutableArray.js completed before the timeout:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-aurora&job_id=68444593&lineNumber=2273

test_nsIProcess_stress.js is another possibility, but there have been no recent changes to that test.
Sequential runs show it's test_nsIProcess.js.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=70a0cdcc818193761f54119f7e7b2e698944c45d
Flags: needinfo?(erahm)
See Also: → 1317605
Summary: Intermittent xpcshell timeout following xpcom/tests/unit/test_nsIMutableArray.js → Intermittent xpcshell timeout during xpcom/tests/unit/test_nsIProcess.js
With sequential and verbose logging I get: https://public-artifacts.taskcluster.net/YgbyHH9-T1SaFk-vVUw5og/0/public/logs/live_backing.log

[task 2017-01-13T20:16:43.806630Z] 20:16:43     INFO -  TEST-START | xpcom/tests/unit/test_nsIProcess.js
[task 2017-01-13T20:16:43.810024Z] 20:16:43     INFO -  xpcom/tests/unit/test_nsIProcess.js | full command: ['/home/worker/workspace/build/application/firefox/xpcshell', '-g', '/home/worker/workspace/build/application/firefox', '-a', '/home/worker/workspace/build/application/firefox', '-r', '/home/worker/workspace/build/application/firefox/components/httpd.manifest', '-m', '-s', '-e', 'const _HEAD_JS_PATH = "/home/worker/workspace/build/tests/xpcshell/head.js";', '-e', 'const _MOZINFO_JS_PATH = "/tmp/xpc-profile-DyEEAl/mozinfo.json";', '-e', 'const _TESTING_MODULES_DIR = "/home/worker/workspace/build/tests/modules/";', '-f', '/home/worker/workspace/build/tests/xpcshell/head.js', '-p', '/tmp/xpc-plugins-_tMmSS', '-e', 'const _SERVER_ADDR = "localhost"', '-e', u'const _HEAD_FILES = ["/home/worker/workspace/build/tests/xpcshell/tests/xpcom/tests/unit/head_xpcom.js"];', '-e', 'const _TAIL_FILES = [];', '-e', 'const _JSDEBUGGER_PORT = 0;', '-e', u'const _TEST_FILE = ["/home/worker/workspace/build/tests/xpcshell/tests/xpcom/tests/unit/test_nsIProcess.js"];', '-e', u'const _TEST_NAME = "xpcom/tests/unit/test_nsIProcess.js"', '-e', '_execute_test(); quit(0);']
[task 2017-01-13T20:16:43.810421Z] 20:16:43     INFO -  xpcom/tests/unit/test_nsIProcess.js | current directory: u'/home/worker/workspace/build/tests/xpcshell/tests/xpcom/tests/unit'
[task 2017-01-13T20:16:43.812892Z] 20:16:43     INFO -  xpcom/tests/unit/test_nsIProcess.js | environment: ['XPCOM_DEBUG_BREAK=stack-and-abort', 'LD_LIBRARY_PATH=/home/worker/workspace/build/application/firefox', 'XPCSHELL_TEST_TEMP_DIR=/tmp/xpc-other-KOYPJi', 'MOZHTTP2_PORT=54271', 'XPCSHELL_TEST_PROFILE_DIR=/tmp/xpc-profile-DyEEAl', 'MOZ_CRASHREPORTER=1', 'MOZ_DISABLE_NONLOCAL_CONNECTIONS=1', 'MOZ_CRASHREPORTER_NO_REPORT=1']
[task 2017-01-13T20:16:44.401597Z] 20:16:44     INFO -  PROCESS | 13340 | [13340] WARNING: Couldn't get the user appdata directory. Crash events may not be produced.: file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 2861
[task 2017-01-13T20:16:44.983550Z] 20:16:44     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
[task 2017-01-13T20:16:44.996228Z] 20:16:44     INFO -  TEST-PASS | xpcom/tests/unit/test_nsIProcess.js | test_kill - [test_kill : 23] true == true
[task 2017-01-13T20:16:45.014244Z] 20:16:45     INFO -  TEST-PASS | xpcom/tests/unit/test_nsIProcess.js | test_kill - [test_kill : 33] true == true
[task 2017-01-13T20:33:25.135355Z] 20:33:25     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/home/worker/workspace/build/venv/bin/python', '-u', '/home/worker/workspace/build/tests/xpcshell/runxpcshelltests.py', '--total-chunks', '8', '--this-chunk', '8', '--symbols-path=/home/worker/workspace/build/symbols', '--test-plugin-path=/home/worker/workspace/build/application/firefox/plugins', '--log-raw=/home/worker/workspace/build/blobber_upload_dir/xpcshell_raw.log', '--log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/xpcshell_errorsummary.log', '--utility-path=tests/bin', '--sequential', '--xpcshell=/home/worker/workspace/build/application/firefox/xpcshell', '--manifest=tests/xpcshell/tests/xpcshell.ini']
[task 2017-01-13T20:33:25.237115Z] 20:33:25    ERROR - timed out after 1000 seconds of no output

...just like bug 1317605, but caught by taskcluster instead of the harness.
(In reply to Geoff Brown [:gbrown] from comment #6)
> ...just like bug 1317605, but caught by taskcluster instead of the harness.

In bug 1317605, proc.kill() fails at this point in the test; in this bug, it seems proc.kill() hangs. I'm trying to reproduce and narrow down the source of the hang on try.

I notice that test_nsIProcess_stress.js has similar code, but puts the kill() in a try/catch and ignores errors.
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=9c15d2d38301c7f0fd1564db1f6b69c5596a551e, I added logging and reproduced the hang in test_nsIProcess.js 4 times.

https://public-artifacts.taskcluster.net/RbB1nGddQP2f4teQJSq1nQ/0/public/logs/live_backing.log

[task 2017-01-18T00:11:39.946060Z] 00:11:39     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
[task 2017-01-18T00:11:39.958644Z] 00:11:39     INFO -  TEST-PASS | xpcom/tests/unit/test_nsIProcess.js | test_kill - [test_kill : 23] true == true
[task 2017-01-18T00:11:39.961259Z] 00:11:39     INFO -  PROCESS | 13367 | zzz nsProcess::Kill
[task 2017-01-18T00:11:39.963130Z] 00:11:39     INFO -  PROCESS | 13367 | zzz no mThread!!
[task 2017-01-18T00:11:39.981014Z] 00:11:39     INFO -  TEST-PASS | xpcom/tests/unit/test_nsIProcess.js | test_kill - [test_kill : 33] true == true
[task 2017-01-18T00:28:20.100591Z] 00:28:20     INFO - Automation Error: mozprocess timed out after 1000 seconds

"[test_kill : 33]" is logged just before the test calls process.kill() but there is no "zzz nsProcess::Kill", as though nsProcess::Kill() was not called.
Nathan - I feel stuck. Can you find someone to look into this? Bug 1317155 is one of the most frequent intermittent test failures now and is almost entirely caused by the apparent hang in test_nsIProcess.js illustrated in comment 8.
Flags: needinfo?(nfroyd)
(In reply to Geoff Brown [:gbrown] from comment #9)
> Nathan - I feel stuck. Can you find someone to look into this? Bug 1317155
> is one of the most frequent intermittent test failures now and is almost
> entirely caused by the apparent hang in test_nsIProcess.js illustrated in
> comment 8.

Are you sure stderr is getting flushed before that timeout, either in the C runtime, or on the Python side?  It seems very unlikely that nsProcess::Kill *isn't* getting called, but it seems reasonable that we might not be draining stderr somewhere along the way, so we're not seeing the messages.
Flags: needinfo?(nfroyd)
I'm not sure. We use 'python -u runxpcshelltests.py', but I'm not sure about the xpcshell binary, and I suppose something could be buffered at the mozharness or taskcluster levels.
I tried to coax additional logging out of the hang, but found no change.

Bug 1317155 test failures are quite frequent and are eliminated if test_nsIProcess.js is skipped; I'm going to skip test_nsIProcess.js on linux.
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d9d333eac5e
Skip test_nsIProcess.js on linux, for frequent timeouts; r=me
Keywords: leave-open
Whiteboard: [test disabled]
Keywords: leave-open
Whiteboard: [test disabled] → [test disabled][checkin-needed-aurora][checkin-needed-beta]
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.