Closed Bug 1741152 Opened 3 years ago Closed 2 years ago

GeckoChildProcessHost::AsyncLaunch fails intermittently

Categories

(Core :: IPC, defect)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jstutte, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 1 obsolete file)

There is a series of intermittents that carry a similar sequence of events:

[task 2021-11-10T11:51:23.357Z] 11:51:23     INFO - PID 8821 | [Parent 8821, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builds/worker/checkouts/gecko/ipc/glue/GeckoChildProcessHost.cpp:750
[task 2021-11-10T11:51:23.364Z] 11:51:23     INFO - PID 8821 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1909: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2021-11-10T11:51:23.367Z] 11:51:23     INFO - PID 8821 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1909: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2021-11-10T11:51:23.594Z] 11:51:23     INFO - PID 8821 | 1636545083586	Marionette	ERROR	TypeError: can't access property "addMessageListener", this.mm is null(chrome://remote/content/marionette/dom.js:50:5) JS Stack trace: WebElementEventTarget@dom.js:50:5
[task 2021-11-10T11:51:23.594Z] 11:51:23     INFO - PID 8821 | switchToTab@browser.js:398:26
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | GeckoDriver.prototype.setWindowHandle@driver.js:1232:35
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | GeckoDriver.prototype.switchToWindow@driver.js:1172:18
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | despatch@server.js:306:40
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | execute@server.js:279:16
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | onPacket/<@server.js:252:20
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | onPacket@server.js:253:9
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - PID 8821 | _onJSONObjectReady/<@transport.js:500:20
[task 2021-11-10T11:51:23.596Z] 11:51:23     INFO - NoSuchWindowException on command, setting status to CRASH
[task 2021-11-10T11:51:23.597Z] 11:51:23     INFO - TEST-UNEXPECTED-CRASH | /service-workers/service-worker/state.https.html | expected OK
[task 2021-11-10T11:51:23.598Z] 11:51:23     INFO - TEST-INFO took 617ms
[task 2021-11-10T11:51:23.604Z] 11:51:23     INFO - PID 8821 | 1636545083604	Marionette	INFO	Stopped listening on port 47567
[task 2021-11-10T11:51:23.784Z] 11:51:23     INFO - PID 8821 | WARNING: A blocker encountered an error while we were waiting.
[task 2021-11-10T11:51:23.787Z] 11:51:23     INFO - PID 8821 |           Blocker:  SessionStore: flushing all windows
[task 2021-11-10T11:51:23.788Z] 11:51:23     INFO - PID 8821 |           Phase: quit-application-granted
[task 2021-11-10T11:51:23.788Z] 11:51:23     INFO - PID 8821 |           State: {"total":-1,"current":-1}
[task 2021-11-10T11:51:23.788Z] 11:51:23     INFO - PID 8821 | WARNING: TypeError: can't access property "sendAsyncMessage", mm is null

The GeckoChildProcessHost related message points to GeckoChildProcessHost::AsyncLaunch.

It is not clear if this might contribute to unexpected states like bug 1741149 or bug 1741145.

Blocks: 1738766
Blocks: 1740654, 1725437
Blocks: 1740873, 1738508
Blocks: 1724556, 1736301
Blocks: 1732097, 1694370
Blocks: 1726330

(In reply to Jens Stutte [:jstutte] from comment #0)

It is not clear if this might contribute to unexpected states like bug 1741149 or bug 1741145.

According to Nika this is most probably the case, and we do not expect process launch failures to happen such frequently on our try environment.

Blocks: 1741527
Blocks: 1741536
See Also: → 1741145, 1741149
Assignee: nobody → jstutte
Status: NEW → ASSIGNED

I am not the right person to understand and fix this, probably. I am just proposing a simple diagnostic patch.

Assignee: jstutte → nobody
Status: ASSIGNED → NEW
Keywords: leave-open
Assignee: nobody → nika
Status: NEW → ASSIGNED
Assignee: nika → nobody
Status: ASSIGNED → NEW
Assignee: nobody → nika
Status: NEW → ASSIGNED
Attachment #9251119 - Attachment is obsolete: true

I think Nika's patch is more helpful for diagnostics, for sure. The only recommendation I can give is to verify the error propagation if we can concentrate the resulting intermittents on one single bug in order to reduce the noise for unrelated tests somehow. Thanks for working on this!

Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ea061f724a39
Add diagnostic warnings to process launch codepaths, r=ipc-reviewers,mccr8

:aryx, would you recommend to close all the other bugs as duplicate, too?

Flags: needinfo?(aryx.bugmail)

Not necessary, they should get closed automatically after 3 weeks without a failure instance. Sheriffs will try to identify future instances of this failure in yet unreported tests as duplicates of this one (and use this one for classification) but this requires manually checking the logs, a few might slip through.

Flags: needinfo?(aryx.bugmail)

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #12)

Not necessary, they should get closed automatically after 3 weeks without a failure instance. Sheriffs will try to identify future instances of this failure in yet unreported tests as duplicates of this one (and use this one for classification) but this requires manually checking the logs, a few might slip through.

Thanks! I might do so as well as this will hopefully help to reduce the noise in my bugmail until we are able to fix this.

Blocks: 1679951
Assignee: nika → nobody
Status: ASSIGNED → NEW

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=363881376&repo=mozilla-central&lineNumber=6230

[task 2022-01-12T17:38:53.715Z] 17:38:53     INFO - TEST-START | /html/semantics/embedded-content/the-img-element/usemap-casing.html
[task 2022-01-12T17:38:53.730Z] 17:38:53     INFO - Closing window cfac1958-f8e2-4ec6-b9e0-4c526e5fde45
[task 2022-01-12T17:38:53.921Z] 17:38:53     INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:54.062Z] 17:38:54     INFO - PID 7491 | [Parent 7491, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builds/worker/checkouts/gecko/ipc/glue/GeckoChildProcessHost.cpp:756
[task 2022-01-12T17:38:54.068Z] 17:38:54     INFO - PID 7491 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1912: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2022-01-12T17:38:54.071Z] 17:38:54     INFO - PID 7491 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1912: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2022-01-12T17:38:54.353Z] 17:38:54     INFO - PID 7491 | 1642009134347	Marionette	ERROR	TypeError: can't access property "addMessageListener", this.mm is null(chrome://remote/content/marionette/dom.js:50:5) JS Stack trace: WebElementEventTarget@dom.js:50:5
[task 2022-01-12T17:38:54.353Z] 17:38:54     INFO - PID 7491 | switchToTab@browser.js:398:26
[task 2022-01-12T17:38:54.356Z] 17:38:54     INFO - PID 7491 | GeckoDriver.prototype.setWindowHandle@driver.js:1194:35
[task 2022-01-12T17:38:54.357Z] 17:38:54     INFO - PID 7491 | GeckoDriver.prototype.switchToWindow@driver.js:1134:18
[task 2022-01-12T17:38:54.357Z] 17:38:54     INFO - PID 7491 | despatch@server.js:306:40
[task 2022-01-12T17:38:54.358Z] 17:38:54     INFO - PID 7491 | execute@server.js:279:16
[task 2022-01-12T17:38:54.358Z] 17:38:54     INFO - PID 7491 | onPacket/<@server.js:252:20
[task 2022-01-12T17:38:54.358Z] 17:38:54     INFO - PID 7491 | onPacket@server.js:253:9
[task 2022-01-12T17:38:54.358Z] 17:38:54     INFO - PID 7491 | _onJSONObjectReady/<@transport.js:500:20
[task 2022-01-12T17:38:54.359Z] 17:38:54     INFO - NoSuchWindowException on command, setting status to CRASH
[task 2022-01-12T17:38:54.361Z] 17:38:54     INFO - TEST-UNEXPECTED-CRASH | /html/semantics/embedded-content/the-img-element/usemap-casing.html | expected OK
[task 2022-01-12T17:38:54.361Z] 17:38:54     INFO - TEST-INFO took 673ms
[task 2022-01-12T17:38:54.367Z] 17:38:54     INFO - PID 7491 | 1642009134366	Marionette	INFO	Stopped listening on port 52448
[task 2022-01-12T17:38:54.655Z] 17:38:54     INFO - PID 7491 | WARNING: A blocker encountered an error while we were waiting.
[task 2022-01-12T17:38:54.658Z] 17:38:54     INFO - PID 7491 |           Blocker:  SessionStore: flushing all windows
[task 2022-01-12T17:38:54.658Z] 17:38:54     INFO - PID 7491 |           Phase: quit-application-granted
[task 2022-01-12T17:38:54.659Z] 17:38:54     INFO - PID 7491 |           State: {"total":-1,"current":-1}
[task 2022-01-12T17:38:54.659Z] 17:38:54     INFO - PID 7491 | WARNING: TypeError: can't access property "sendAsyncMessage", mm is null
[task 2022-01-12T17:38:54.659Z] 17:38:54     INFO - PID 7491 | WARNING: flush@resource:///modules/sessionstore/TabStateFlusher.jsm:131:7
[task 2022-01-12T17:38:54.660Z] 17:38:54     INFO - PID 7491 | flushWindow@resource:///modules/sessionstore/TabStateFlusher.jsm:173:28
[task 2022-01-12T17:38:54.660Z] 17:38:54     INFO - PID 7491 | flushWindow@resource:///modules/sessionstore/TabStateFlusher.jsm:42:36
[task 2022-01-12T17:38:54.661Z] 17:38:54     INFO - PID 7491 | flushAllWindowsAsync@resource:///modules/sessionstore/SessionStore.jsm:2145:50
[task 2022-01-12T17:38:54.661Z] 17:38:54     INFO - PID 7491 | ssi_onQuitApplicationGranted/<@resource:///modules/sessionstore/SessionStore.jsm:2063:32
[task 2022-01-12T17:38:54.661Z] 17:38:54     INFO - PID 7491 | trigger@resource://gre/modules/AsyncShutdown.jsm:747:23
[task 2022-01-12T17:38:54.661Z] 17:38:54     INFO - PID 7491 | _wait@resource://gre/modules/AsyncShutdown.jsm:899:15
[task 2022-01-12T17:38:54.661Z] 17:38:54     INFO - PID 7491 | wait@resource://gre/modules/AsyncShutdown.jsm:880:34
[task 2022-01-12T17:38:54.663Z] 17:38:54     INFO - PID 7491 | observe@resource://gre/modules/AsyncShutdown.jsm:557:10
[task 2022-01-12T17:38:54.663Z] 17:38:54     INFO - PID 7491 | GeckoDriver.prototype.quit@chrome://remote/content/marionette/driver.js:2680:22
[task 2022-01-12T17:38:54.663Z] 17:38:54     INFO - PID 7491 | despatch@chrome://remote/content/marionette/server.js:306:40
[task 2022-01-12T17:38:54.664Z] 17:38:54     INFO - PID 7491 | execute@chrome://remote/content/marionette/server.js:279:16
[task 2022-01-12T17:38:54.664Z] 17:38:54     INFO - PID 7491 | onPacket/<@chrome://remote/content/marionette/server.js:252:20
[task 2022-01-12T17:38:54.664Z] 17:38:54     INFO - PID 7491 | onPacket@chrome://remote/content/marionette/server.js:253:9
[task 2022-01-12T17:38:54.664Z] 17:38:54     INFO - PID 7491 | _onJSONObjectReady/<@chrome://remote/content/marionette/transport.js:500:20
[task 2022-01-12T17:38:55.133Z] 17:38:55     INFO - PID 7491 | JavaScript error: resource:///modules/sessionstore/TabStateFlusher.jsm, line 131: TypeError: can't access property "sendAsyncMessage", mm is null
[task 2022-01-12T17:38:55.192Z] 17:38:55     INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:55.196Z] 17:38:55     INFO - PID 7491 | ###!!! [Parent][PCompositorManagerParent] Error: RunMessage(msgname=PCompositorBridge::Msg___delete__) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:55.198Z] 17:38:55     INFO - PID 7491 | ###!!! [Parent][PImageBridgeParent] Error: RunMessage(msgname=PImageBridge::Msg_WillClose) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:55.236Z] 17:38:55     INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:59.878Z] 17:38:59     INFO - Browser exited with return code 0
[task 2022-01-12T17:38:59.880Z] 17:38:59     INFO - Closing logging queue
[task 2022-01-12T17:38:59.880Z] 17:38:59     INFO - queue closed
[task 2022-01-12T17:38:59.959Z] 17:38:59     INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2022-01-12T17:38:59.960Z] 17:38:59     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmphlij7w5f
[task 2022-01-12T17:38:59.986Z] 17:38:59     INFO - PID 8174 | 1642007882967	Marionette	INFO	Marionette enabled
[task 2022-01-12T17:38:59.987Z] 17:38:59     INFO - PID 8174 | console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /tmp/tmpohuqwazn/search.json.mozlz4", (void 0)))
[task 2022-01-12T17:38:59.987Z] 17:38:59     INFO - PID 8174 | 1642007912756	Marionette	INFO	Listening on port 39886
[task 2022-01-12T17:38:59.988Z] 17:38:59     INFO - PID 8174 | ###!!! [Parent][PCompositorManagerParent] Error: RunMessage(msgname=PCompositorBridge::Msg___delete__) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:59.988Z] 17:38:59     INFO - PID 8174 | 1642008273003	addons.xpi	ERROR	System addon update list error Error: Failed downloading XML, status: 0, reason: error
[task 2022-01-12T17:38:59.989Z] 17:38:59     INFO - Starting runner

This seems to happen only on TSan builds.

Something we could do is log the error code when fork fails; that's not a lot of info but for the real fork it would at least separate ENOMEM (out of memory) from EAGAIN (various resource limits on the number of processes/threads). However, TSan interposes fork, but the interposer appears to tail-call the real fork on all paths, but there's a C++ RAII object on the stack (ScopedInterceptor); the various functions called by the destructor might not mess up errno, given that they appear to CHECK (release assert) that the syscalls they use succeed, but it's something to keep in mind.

Interestingly it appears that the error in this case is probably not due to fork failing. I don't see a line being logged above https://treeherder.mozilla.org/logviewer?job_id=363881376&repo=mozilla-central&lineNumber=6217, and we should hopefully see https://searchfox.org/mozilla-central/rev/d4b9c457db637fde655592d9e2048939b7ab2854/ipc/chromium/src/base/process_util_linux.cc#274 logged there if it failed there. (It occurs to me that this uses the chromium logging, and I don't remember how this maps, so perhaps we're not showing it? That would be unfortunate...)

Assuming we're not accidentally suppressing the diagnostic logs I added in the bug, the only codepath which can return an error here without logging is https://searchfox.org/mozilla-central/rev/d4b9c457db637fde655592d9e2048939b7ab2854/ipc/glue/GeckoChildProcessHost.cpp#1070-1071 failing.

It's probably worth going through, checking what the behaviour of the chromium logging is around warnings, and seeing if we show them on debug builds.

Apparently these tests are all not running in debug mode, so we should change the logs I added in the patch on this bug to emit on release mode as well.

ni? :jld

Flags: needinfo?(jld)
See Also: → 1649857

Sorry for the noise, I wanted to link that to bug 1721793.

See Also: 1649857

Note that CHROMIUM_LOG with level WARNING or higher will log by
default (via NS_DebugBreak) on all build types.

Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4d1e424f1bad
Log IPC fork() failures on non-debug builds. r=nika
See Also: → 1658875

So far, no more failures reported after the landing in comment #55 (the one in comment #56 has a build ID before that commit and, as expected, no log messages). The patch itself shouldn't have affected this; its changes are only to code that runs after fork has already returned an error.

Bug 1658474 landed around the same time and also affects child process handling, but it doesn't make much sense for it to have affected this; also, that patch shouldn't change anything if Gecko Media Plugins aren't being used, and I don't think that applies to all (or any?) of the failures seen here.

Flags: needinfo?(jld)

No reports in almost 3 months, although I don't know exactly what would have caused it. This bug can be reopened if it starts happening again.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: