1412035 - Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js or browser_pluginCrashCommentAndURL.js | application timed out after 370 seconds with no output

Reporter

Description

•

7 years ago

treeherder

Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=139787757&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/N38Sm1KGShGoYbH2L8_4Gw/runs/0/artifacts/public/logs/live_backing.log 08:29:14 INFO - 189 INFO TEST-START | browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js 08:29:14 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 08:29:14 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 08:29:14 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 08:29:14 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 08:29:15 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 08:29:15 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 08:29:15 INFO - GECKO(16952) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x51001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 08:29:44 INFO - GECKO(16952) | [NPAPI 12948, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 08:29:44 INFO - GECKO(16952) | [NPAPI 12948, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 08:46:24 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1509005989\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1509005989\\build\\tests\\mochitest\\runtests.py', '--total-chunks', '7', '--this-chunk', '7', '--appname=Z:\\task_1509005989\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/WHmcFJ76R_6GJDS4QE8ypA/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--quiet', '--log-raw=Z:\\task_1509005989\\build\\blobber_upload_dir\\browser-chrome-chunked_raw.log', '--log-errorsummary=Z:\\task_1509005989\\build\\blobber_upload_dir\\browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--flavor=browser', '--chunk-by-runtime'] [taskcluster 2017-10-26T09:25:59.950Z] Aborting task - max run time exceeded!

Comment hidden (Intermittent Failures Robot)

Dorel Luca [:dluca]

Comment 3

•

7 years ago

This bug started spiking on 2nd November and has failed 81 times in the last 7 days. It fails only on Windows. Build types PGO and OPT. Here is a recent log of the fail: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=142111895&lineNumber=2175 And here is a part of the log: 05:58:06 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 05:58:06 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 05:58:07 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 05:58:07 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 05:58:07 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel::Call] Error: Channel error: cannot send/recv 05:58:07 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 05:58:07 INFO - GECKO(1996) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x59001E,name=PPluginInstance::Msg_AsyncSetWindow) Channel error: cannot send/recv 05:58:36 INFO - GECKO(1996) | [NPAPI 6104, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 05:58:36 INFO - GECKO(1996) | [NPAPI 6104, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 06:15:16 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1509774926\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1509774926\\build\\tests\\mochitest\\runtests.py', '--total-chunks', '7', '--this-chunk', '7', '--appname=Z:\\task_1509774926\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/NKghVl6fT-mxZOT0C716hg/artifacts/public/build/target.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--quiet', '--log-raw=Z:\\task_1509774926\\build\\blobber_upload_dir\\browser-chrome-chunked_raw.log', '--log-errorsummary=Z:\\task_1509774926\\build\\blobber_upload_dir\\browser-chrome-chunked_errorsummary.log', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--flavor=browser', '--chunk-by-runtime'] [taskcluster 2017-11-04T06:55:27.408Z] Aborting task - max run time exceeded! [taskcluster 2017-11-04T06:55:27.552Z] Exit Code: 0 [taskcluster 2017-11-04T06:55:27.552Z] User Time: 0s [taskcluster 2017-11-04T06:55:27.552Z] Kernel Time: 15.625ms [taskcluster 2017-11-04T06:55:27.552Z] Wall Time: 59m59.574047s [taskcluster 2017-11-04T06:55:27.552Z] Peak Memory: 4878336 [taskcluster 2017-11-04T06:55:27.553Z] Result: IDLENESS_LIMIT_EXCEEDED [taskcluster 2017-11-04T06:55:27.553Z] === Task Finished === [taskcluster 2017-11-04T06:55:27.553Z] Task Duration: 59m59.5767969s :jimm: Hi, there was a huge spike in the last 2 days for this bug, can you please take a look at it.

Flags: needinfo?(jmathies)

Phil Ringnalda (:philor)

Comment 5

•

7 years ago

Not sure quite how, but in one stretch of seemingly-unrelated pushes on autoland, this morphed from being a mozprocess kill after 1000 seconds of hanging to being a harness kill after 370 (formerly 330) seconds.

Summary: Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js → Intermittent Automation Error: mozprocess timed out after 1000 seconds running in browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js or browser_pluginCrashCommentAndURL.js | application timed out after 370 seconds with no output

Phil Ringnalda (:philor)

Comment 6

•

7 years ago

Also adding to the confusion, this *should* have been the fault of bug 1333126, since we actually hounded the patch there out of the tree once before for doing this exact same thing, the mozprocess timeout on Win10 during browser_pluginCrashCommentAndURL.js, but this time we retriggered the suite on the parent of that bug's push a hundred or so times, and did get one instance of this before it landed. Among the possibilities is that the patch there makes this far more likely to happen, and we had a pretty low-frequency intermittent that we just ignored because a mozprocess timeout looks far more like infra than it looks like intermittent test failure, until you open the log and see that it's always happening during one test, despite the failure message not mentioning the test name.

Blocks: 1333126

Comment hidden (Intermittent Failures Robot)

Gabriele Svelto [:gsvelto]

Comment 9

•

7 years ago

(In reply to Phil Ringnalda (:philor) from comment #6) > Also adding to the confusion, this *should* have been the fault of bug > 1333126, since we actually hounded the patch there out of the tree once > before for doing this exact same thing, the mozprocess timeout on Win10 > during browser_pluginCrashCommentAndURL.js, but this time we retriggered the > suite on the parent of that bug's push a hundred or so times, and did get > one instance of this before it landed. Among the possibilities is that the > patch there makes this far more likely to happen, and we had a pretty > low-frequency intermittent that we just ignored because a mozprocess timeout > looks far more like infra than it looks like intermittent test failure, > until you open the log and see that it's always happening during one test, > despite the failure message not mentioning the test name. I'm a bit stumped since I've re-tested the patch for bug 1333126 both on my local machine and on try without hitting this. What the patch does is lengthen the time it takes to process a content crash, but on most pushes I see the test taking only one to three seconds at most. The failures seem to have the test running for ~17s which is longer than I would expect but not long enough to cause a timeout. I'm I missing something here WRT how the harness works and handles global timeouts?

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

7 years ago

I assume we are crashing the process and mochitest in-browser is not failing fast, but our harnesses (like mochitest in this case) should time out at 370seconds of no activity- possibly we get stuck as we are defining activity as polling stdout/stderr from the process and it might be in a state where we cannot get info or assume it is dead but not closed. :gbrown, do you have more thoughts on the timeout escalation here?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 11

•

7 years ago

I note that most/all recent failures are "370 seconds with no output" and those logs, like https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=142303711&lineNumber=2164, seem well behaved from the harness point of view: the browser stops producing output, the harness notices that after 370 seconds, gets a screenshot, kills the browser, collects dumps. Older logs, as in comment 3, are puzzling. In addition to jmaher's perfectly reasonable speculation, I'd suggest that runtests.py might think that the browser is no longer running and thus the timeout handler has been cancelled; if that's the case, it is unclear what runtests.py is doing / where it is hung. In addition, there's a similar issue with the mozharness "mozprocess timed out after 1000 seconds" handling: If runtests.py timed out, why didn't the mozharness script complete -- where is it hung?

Flags: needinfo?(gbrown)

Gabriele Svelto [:gsvelto]

Comment 12

•

7 years ago

After putting a ton of print statements in the test I think I've figured out what's going on. It seems to be the same type of issue that prevented me from landing it before: a stray timer from unrelated code firing right at the end of the test when the profile has been torn down, trying to access something in the profile, failing with an exception and causing the harness to stop. My previous was with the idle-daily notification which I reported in bug 1409769, this time it seems to be the activity stream, see this: https://treeherder.mozilla.org/logviewer.html#?job_id=142464151&repo=try&lineNumber=2436 This: https://treeherder.mozilla.org/logviewer.html#?job_id=142468520&repo=try&lineNumber=2304 And this: https://treeherder.mozilla.org/logviewer.html#?job_id=142464151&repo=try&lineNumber=2418 Somehow, someway, the patch for bug 1333126 has the magical property of precisely aligning our tests with unrelated activity so as to make them fail in new and interesting ways.

Gabriele Svelto [:gsvelto]

Comment 13

•

7 years ago

I have prepared a new try-run with the system activity feed timer ticking at an interval so long it shouldn't interfere with the tests. I'll re-trigger the tests a bunch of times to see if this fixes the issue. The run is here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8184a0d5653b5bb8f223cc59db77e56c9a7412a7 Phil, the failure you mentioned in comment 6 which you saw using a parent of my push could have been caused by this too. If I confirm this issue then we might want to audit the codebase for other timers running during mochitests as well as modify the harness to cope with failures they might cause. This is the second issue caused by a stray timer I hit while trying landing bug 1333126 so I wonder how many more could there be, and how many of our most obscure intermittents could have been caused by them.

Comment hidden (Intermittent Failures Robot)

Gabriele Svelto [:gsvelto]

Comment 15

•

7 years ago

Quick update: the activity feed timer I pointed to in comment 12 is one of the issues that can make the harness freeze, but apparently it's not the only one. I have verified that the test browser_pluginCrashCommentAndURL.js test itself finishes correctly and even calls the cleanup function, the failure is really happening somewhere in the harness between that test and the next one.

Gabriele Svelto [:gsvelto]

Comment 16

•

7 years ago

My analysis in comment 13 was wrong, I found out the real cause for the problem, see bug 1333126 comment 55. Long story short: this is an existing race in the test harness, exacerbated by I/O errors being quietly ignored. The patch in bug 1333126 just made the race condition a lot more likely thus exposing the bug.

Geoff Brown [:gbrown]

Comment 17

•

7 years ago

No failures since the backout at https://bugzilla.mozilla.org/show_bug.cgi?id=1333126#c54.

Whiteboard: [stockwell disable-recommended] → [stockwell fixed:backout]

Comment hidden (Intermittent Failures Robot)

Gabriele Svelto [:gsvelto]

Comment 19

•

7 years ago

I've just landed the patch for bug 1416028 that should address this failure (and possibly others too). While that patch is required to be able to land bug 1333126 it will also fix the failure that occurred w/o that patch. Additionally, if we hit anything like this again, the harness will not hang but will leave an error message in the log instead.

Jim Mathies [:jimm]

Updated

•

7 years ago

Flags: needinfo?(jmathies)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

7 years ago

Depends on: 1418227

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

7 years ago

No longer depends on: 1418227

Comment hidden (Intermittent Failures Robot)

Tracy Walker [:tracy]

Updated

•

7 years ago

Priority: -- → P2

Joel Maher ( :jmaher ) (UTC -8)

Comment 21

•

7 years ago

as this is not failing anymore and we had a backout, lets resolve this bug.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

BMO Automation

Updated

•

3 years ago

Product: Core → Core Graveyard