Closed Bug 1412035 Opened 7 years ago Closed 7 years ago

Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js or browser_pluginCrashCommentAndURL.js | application timed out after 370 seconds with no output

Categories

(Core Graveyard :: Plug-ins, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:backout])

Filed by: archaeopteryx [at] coole-files.de

https://treeherder.mozilla.org/logviewer.html#?job_id=139787757&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/N38Sm1KGShGoYbH2L8_4Gw/runs/0/artifacts/public/logs/live_backing.log

08:29:14     INFO -  189 INFO TEST-START | browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js
08:29:14     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
08:29:14     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
08:29:14     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
08:29:14     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
08:29:15     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
08:29:15     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
08:29:15     INFO -  GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
08:29:44     INFO -  GECKO(16952) | [NPAPI 12948, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:29:44     INFO -  GECKO(16952) | [NPAPI 12948, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
08:46:24     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1509005989\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1509005989\\build\\tests\\mochitest\\runtests.py', '--total-chunks', '7', '--this-chunk', '7', '--appname=Z:\\task_1509005989\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/WHmcFJ76R_6GJDS4QE8ypA/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--quiet', '--log-raw=Z:\\task_1509005989\\build\\blobber_upload_dir\\browser-chrome-chunked_raw.log', '--log-errorsummary=Z:\\task_1509005989\\build\\blobber_upload_dir\\browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--flavor=browser', '--chunk-by-runtime']
[taskcluster 2017-10-26T09:25:59.950Z] Aborting task - max run time exceeded!
This bug started spiking on 2nd November and has failed 81 times in the last 7 days. It fails only on Windows. Build types PGO and OPT.

Here is a recent log of the fail:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=142111895&lineNumber=2175

And here is a part of the log:

05:58:06     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
05:58:06     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
05:58:07     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
05:58:07     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
05:58:07     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv
05:58:07     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
05:58:07     INFO -  GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv
05:58:36     INFO -  GECKO(1996) | [NPAPI 6104, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
05:58:36     INFO -  GECKO(1996) | [NPAPI 6104, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
06:15:16     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1509774926\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1509774926\\build\\tests\\mochitest\\runtests.py', '--total-chunks', '7', '--this-chunk', '7', '--appname=Z:\\task_1509774926\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/NKghVl6fT-mxZOT0C716hg/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--quiet', '--log-raw=Z:\\task_1509774926\\build\\blobber_upload_dir\\browser-chrome-chunked_raw.log', '--log-errorsummary=Z:\\task_1509774926\\build\\blobber_upload_dir\\browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--flavor=browser', '--chunk-by-runtime']
[taskcluster 2017-11-04T06:55:27.408Z] Aborting task - max run time exceeded!
[taskcluster 2017-11-04T06:55:27.552Z]   Exit Code: 0
[taskcluster 2017-11-04T06:55:27.552Z]   User Time: 0s
[taskcluster 2017-11-04T06:55:27.552Z] Kernel Time: 15.625ms
[taskcluster 2017-11-04T06:55:27.552Z]   Wall Time: 59m59.574047s
[taskcluster 2017-11-04T06:55:27.552Z] Peak Memory: 4878336
[taskcluster 2017-11-04T06:55:27.553Z]      Result: IDLENESS_LIMIT_EXCEEDED
[taskcluster 2017-11-04T06:55:27.553Z] === Task Finished ===
[taskcluster 2017-11-04T06:55:27.553Z] Task Duration: 59m59.5767969s

:jimm: Hi, there was a huge spike in the last 2 days for this bug, can you please take a look at it.
Flags: needinfo?(jmathies)
Not sure quite how, but in one stretch of seemingly-unrelated pushes on autoland, this morphed from being a mozprocess kill after 1000 seconds of hanging to being a harness kill after 370 (formerly 330) seconds.
Summary: Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js → Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js or browser_pluginCrashCommentAndURL.js | application timed out after 370 seconds with no output
Also adding to the confusion, this *should* have been the fault of bug 1333126, since we actually hounded the patch there out of the tree once before for doing this exact same thing, the mozprocess timeout on Win10 during browser_pluginCrashCommentAndURL.js, but this time we retriggered the suite on the parent of that bug's push a hundred or so times, and did get one instance of this before it landed. Among the possibilities is that the patch there makes this far more likely to happen, and we had a pretty low-frequency intermittent that we just ignored because a mozprocess timeout looks far more like infra than it looks like intermittent test failure, until you open the log and see that it's always happening during one test, despite the failure message not mentioning the test name.
Blocks: 1333126
(In reply to Phil Ringnalda (:philor) from comment #6)
> Also adding to the confusion, this *should* have been the fault of bug
> 1333126, since we actually hounded the patch there out of the tree once
> before for doing this exact same thing, the mozprocess timeout on Win10
> during browser_pluginCrashCommentAndURL.js, but this time we retriggered the
> suite on the parent of that bug's push a hundred or so times, and did get
> one instance of this before it landed. Among the possibilities is that the
> patch there makes this far more likely to happen, and we had a pretty
> low-frequency intermittent that we just ignored because a mozprocess timeout
> looks far more like infra than it looks like intermittent test failure,
> until you open the log and see that it's always happening during one test,
> despite the failure message not mentioning the test name.

I'm a bit stumped since I've re-tested the patch for bug 1333126 both on my local machine and on try without hitting this. What the patch does is lengthen the time it takes to process a content crash, but on most pushes I see the test taking only one to three seconds at most. The failures seem to have the test running for ~17s which is longer than I would expect but not long enough to cause a timeout. I'm I missing something here WRT how the harness works and handles global timeouts?
I assume we are crashing the process and mochitest in-browser is not failing fast, but our harnesses (like mochitest in this case) should time out at 370seconds of no activity- possibly we get stuck as we are defining activity as polling stdout/stderr from the process and it might be in a state where we cannot get info or assume it is dead but not closed.

:gbrown, do you have more thoughts on the timeout escalation here?
Flags: needinfo?(gbrown)
I note that most/all recent failures are "370 seconds with no output" and those logs, like https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=142303711&lineNumber=2164, seem well behaved from the harness point of view: the browser stops producing output, the harness notices that after 370 seconds, gets a screenshot, kills the browser, collects dumps.

Older logs, as in comment 3, are puzzling. In addition to jmaher's perfectly reasonable speculation, I'd suggest that runtests.py might think that the browser is no longer running and thus the timeout handler has been cancelled; if that's the case, it is unclear what runtests.py is doing / where it is hung. In addition, there's a similar issue with the mozharness "mozprocess timed out after 1000 seconds" handling: If runtests.py timed out, why didn't the mozharness script complete -- where is it hung?
Flags: needinfo?(gbrown)
After putting a ton of print statements in the test I think I've figured out what's going on. It seems to be the same type of issue that prevented me from landing it before: a stray timer from unrelated code firing right at the end of the test when the profile has been torn down, trying to access something in the profile, failing with an exception and causing the harness to stop.

My previous was with the idle-daily notification which I reported in bug 1409769, this time it seems to be the activity stream, see this:

https://treeherder.mozilla.org/logviewer.html#?job_id=142464151&repo=try&lineNumber=2436

This:

https://treeherder.mozilla.org/logviewer.html#?job_id=142468520&repo=try&lineNumber=2304

And this:

https://treeherder.mozilla.org/logviewer.html#?job_id=142464151&repo=try&lineNumber=2418

Somehow, someway, the patch for bug 1333126 has the magical property of precisely aligning our tests with unrelated activity so as to make them fail in new and interesting ways.
I have prepared a new try-run with the system activity feed timer ticking at an interval so long it shouldn't interfere with the tests. I'll re-trigger the tests a bunch of times to see if this fixes the issue. The run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8184a0d5653b5bb8f223cc59db77e56c9a7412a7

Phil, the failure you mentioned in comment 6 which you saw using a parent of my push could have been caused by this too. If I confirm this issue then we might want to audit the codebase for other timers running during mochitests as well as modify the harness to cope with failures they might cause. This is the second issue caused by a stray timer I hit while trying landing bug 1333126 so I wonder how many more could there be, and how many of our most obscure intermittents could have been caused by them.
Quick update: the activity feed timer I pointed to in comment 12 is one of the issues that can make the harness freeze, but apparently it's not the only one. I have verified that the test browser_pluginCrashCommentAndURL.js test itself finishes correctly and even calls the cleanup function, the failure is really happening somewhere in the harness between that test and the next one.
My analysis in comment 13 was wrong, I found out the real cause for the problem, see bug 1333126 comment 55. Long story short: this is an existing race in the test harness, exacerbated by I/O errors being quietly ignored. The patch in bug 1333126 just made the race condition a lot more likely thus exposing the bug.
No failures since the backout at https://bugzilla.mozilla.org/show_bug.cgi?id=1333126#c54.
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:backout]
I've just landed the patch for bug 1416028 that should address this failure (and possibly others too). While that patch is required to be able to land bug 1333126 it will also fix the failure that occurred w/o that patch. Additionally, if we hit anything like this again, the harness will not hang but will leave an error message in the log instead.
Flags: needinfo?(jmathies)
No longer depends on: 1418227
Priority: -- → P2
as this is not failing anymore and we had a backout, lets resolve this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.