GeckoChildProcessHost::AsyncLaunch fails intermittently
Categories
(Core :: IPC, defect)
Tracking
()
People
(Reporter: jstutte, Unassigned)
References
Details
(Keywords: intermittent-failure)
Attachments
(2 files, 1 obsolete file)
There is a series of intermittents that carry a similar sequence of events:
[task 2021-11-10T11:51:23.357Z] 11:51:23 INFO - PID 8821 | [Parent 8821, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builds/worker/checkouts/gecko/ipc/glue/GeckoChildProcessHost.cpp:750
[task 2021-11-10T11:51:23.364Z] 11:51:23 INFO - PID 8821 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1909: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2021-11-10T11:51:23.367Z] 11:51:23 INFO - PID 8821 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1909: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2021-11-10T11:51:23.594Z] 11:51:23 INFO - PID 8821 | 1636545083586 Marionette ERROR TypeError: can't access property "addMessageListener", this.mm is null(chrome://remote/content/marionette/dom.js:50:5) JS Stack trace: WebElementEventTarget@dom.js:50:5
[task 2021-11-10T11:51:23.594Z] 11:51:23 INFO - PID 8821 | switchToTab@browser.js:398:26
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | GeckoDriver.prototype.setWindowHandle@driver.js:1232:35
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | GeckoDriver.prototype.switchToWindow@driver.js:1172:18
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | despatch@server.js:306:40
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | execute@server.js:279:16
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | onPacket/<@server.js:252:20
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | onPacket@server.js:253:9
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - PID 8821 | _onJSONObjectReady/<@transport.js:500:20
[task 2021-11-10T11:51:23.596Z] 11:51:23 INFO - NoSuchWindowException on command, setting status to CRASH
[task 2021-11-10T11:51:23.597Z] 11:51:23 INFO - TEST-UNEXPECTED-CRASH | /service-workers/service-worker/state.https.html | expected OK
[task 2021-11-10T11:51:23.598Z] 11:51:23 INFO - TEST-INFO took 617ms
[task 2021-11-10T11:51:23.604Z] 11:51:23 INFO - PID 8821 | 1636545083604 Marionette INFO Stopped listening on port 47567
[task 2021-11-10T11:51:23.784Z] 11:51:23 INFO - PID 8821 | WARNING: A blocker encountered an error while we were waiting.
[task 2021-11-10T11:51:23.787Z] 11:51:23 INFO - PID 8821 | Blocker: SessionStore: flushing all windows
[task 2021-11-10T11:51:23.788Z] 11:51:23 INFO - PID 8821 | Phase: quit-application-granted
[task 2021-11-10T11:51:23.788Z] 11:51:23 INFO - PID 8821 | State: {"total":-1,"current":-1}
[task 2021-11-10T11:51:23.788Z] 11:51:23 INFO - PID 8821 | WARNING: TypeError: can't access property "sendAsyncMessage", mm is null
The GeckoChildProcessHost related message points to GeckoChildProcessHost::AsyncLaunch
.
It is not clear if this might contribute to unexpected states like bug 1741149 or bug 1741145.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 1•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #0)
It is not clear if this might contribute to unexpected states like bug 1741149 or bug 1741145.
According to Nika this is most probably the case, and we do not expect process launch failures to happen such frequently on our try environment.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 2•3 years ago
|
||
Updated•3 years ago
|
Reporter | ||
Comment 3•3 years ago
|
||
I am not the right person to understand and fix this, probably. I am just proposing a simple diagnostic patch.
Comment 4•3 years ago
|
||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 5•3 years ago
|
||
I think Nika's patch is more helpful for diagnostics, for sure. The only recommendation I can give is to verify the error propagation if we can concentrate the resulting intermittents on one single bug in order to reduce the noise for unrelated tests somehow. Thanks for working on this!
Pushed by nlayzell@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ea061f724a39 Add diagnostic warnings to process launch codepaths, r=ipc-reviewers,mccr8
Comment 7•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Reporter | ||
Comment 11•3 years ago
|
||
:aryx, would you recommend to close all the other bugs as duplicate, too?
Comment 12•3 years ago
|
||
Not necessary, they should get closed automatically after 3 weeks without a failure instance. Sheriffs will try to identify future instances of this failure in yet unreported tests as duplicates of this one (and use this one for classification) but this requires manually checking the logs, a few might slip through.
Reporter | ||
Comment 13•3 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #12)
Not necessary, they should get closed automatically after 3 weeks without a failure instance. Sheriffs will try to identify future instances of this failure in yet unreported tests as duplicates of this one (and use this one for classification) but this requires manually checking the logs, a few might slip through.
Thanks! I might do so as well as this will hopefully help to reduce the noise in my bugmail until we are able to fix this.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 45•3 years ago
|
||
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=363881376&repo=mozilla-central&lineNumber=6230
[task 2022-01-12T17:38:53.715Z] 17:38:53 INFO - TEST-START | /html/semantics/embedded-content/the-img-element/usemap-casing.html
[task 2022-01-12T17:38:53.730Z] 17:38:53 INFO - Closing window cfac1958-f8e2-4ec6-b9e0-4c526e5fde45
[task 2022-01-12T17:38:53.921Z] 17:38:53 INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:54.062Z] 17:38:54 INFO - PID 7491 | [Parent 7491, IPC I/O Parent] WARNING: Failed to launch tab subprocess: file /builds/worker/checkouts/gecko/ipc/glue/GeckoChildProcessHost.cpp:756
[task 2022-01-12T17:38:54.068Z] 17:38:54 INFO - PID 7491 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1912: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2022-01-12T17:38:54.071Z] 17:38:54 INFO - PID 7491 | JavaScript error: chrome://browser/content/tabbrowser.js, line 1912: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIWebProgress.removeProgressListener]
[task 2022-01-12T17:38:54.353Z] 17:38:54 INFO - PID 7491 | 1642009134347 Marionette ERROR TypeError: can't access property "addMessageListener", this.mm is null(chrome://remote/content/marionette/dom.js:50:5) JS Stack trace: WebElementEventTarget@dom.js:50:5
[task 2022-01-12T17:38:54.353Z] 17:38:54 INFO - PID 7491 | switchToTab@browser.js:398:26
[task 2022-01-12T17:38:54.356Z] 17:38:54 INFO - PID 7491 | GeckoDriver.prototype.setWindowHandle@driver.js:1194:35
[task 2022-01-12T17:38:54.357Z] 17:38:54 INFO - PID 7491 | GeckoDriver.prototype.switchToWindow@driver.js:1134:18
[task 2022-01-12T17:38:54.357Z] 17:38:54 INFO - PID 7491 | despatch@server.js:306:40
[task 2022-01-12T17:38:54.358Z] 17:38:54 INFO - PID 7491 | execute@server.js:279:16
[task 2022-01-12T17:38:54.358Z] 17:38:54 INFO - PID 7491 | onPacket/<@server.js:252:20
[task 2022-01-12T17:38:54.358Z] 17:38:54 INFO - PID 7491 | onPacket@server.js:253:9
[task 2022-01-12T17:38:54.358Z] 17:38:54 INFO - PID 7491 | _onJSONObjectReady/<@transport.js:500:20
[task 2022-01-12T17:38:54.359Z] 17:38:54 INFO - NoSuchWindowException on command, setting status to CRASH
[task 2022-01-12T17:38:54.361Z] 17:38:54 INFO - TEST-UNEXPECTED-CRASH | /html/semantics/embedded-content/the-img-element/usemap-casing.html | expected OK
[task 2022-01-12T17:38:54.361Z] 17:38:54 INFO - TEST-INFO took 673ms
[task 2022-01-12T17:38:54.367Z] 17:38:54 INFO - PID 7491 | 1642009134366 Marionette INFO Stopped listening on port 52448
[task 2022-01-12T17:38:54.655Z] 17:38:54 INFO - PID 7491 | WARNING: A blocker encountered an error while we were waiting.
[task 2022-01-12T17:38:54.658Z] 17:38:54 INFO - PID 7491 | Blocker: SessionStore: flushing all windows
[task 2022-01-12T17:38:54.658Z] 17:38:54 INFO - PID 7491 | Phase: quit-application-granted
[task 2022-01-12T17:38:54.659Z] 17:38:54 INFO - PID 7491 | State: {"total":-1,"current":-1}
[task 2022-01-12T17:38:54.659Z] 17:38:54 INFO - PID 7491 | WARNING: TypeError: can't access property "sendAsyncMessage", mm is null
[task 2022-01-12T17:38:54.659Z] 17:38:54 INFO - PID 7491 | WARNING: flush@resource:///modules/sessionstore/TabStateFlusher.jsm:131:7
[task 2022-01-12T17:38:54.660Z] 17:38:54 INFO - PID 7491 | flushWindow@resource:///modules/sessionstore/TabStateFlusher.jsm:173:28
[task 2022-01-12T17:38:54.660Z] 17:38:54 INFO - PID 7491 | flushWindow@resource:///modules/sessionstore/TabStateFlusher.jsm:42:36
[task 2022-01-12T17:38:54.661Z] 17:38:54 INFO - PID 7491 | flushAllWindowsAsync@resource:///modules/sessionstore/SessionStore.jsm:2145:50
[task 2022-01-12T17:38:54.661Z] 17:38:54 INFO - PID 7491 | ssi_onQuitApplicationGranted/<@resource:///modules/sessionstore/SessionStore.jsm:2063:32
[task 2022-01-12T17:38:54.661Z] 17:38:54 INFO - PID 7491 | trigger@resource://gre/modules/AsyncShutdown.jsm:747:23
[task 2022-01-12T17:38:54.661Z] 17:38:54 INFO - PID 7491 | _wait@resource://gre/modules/AsyncShutdown.jsm:899:15
[task 2022-01-12T17:38:54.661Z] 17:38:54 INFO - PID 7491 | wait@resource://gre/modules/AsyncShutdown.jsm:880:34
[task 2022-01-12T17:38:54.663Z] 17:38:54 INFO - PID 7491 | observe@resource://gre/modules/AsyncShutdown.jsm:557:10
[task 2022-01-12T17:38:54.663Z] 17:38:54 INFO - PID 7491 | GeckoDriver.prototype.quit@chrome://remote/content/marionette/driver.js:2680:22
[task 2022-01-12T17:38:54.663Z] 17:38:54 INFO - PID 7491 | despatch@chrome://remote/content/marionette/server.js:306:40
[task 2022-01-12T17:38:54.664Z] 17:38:54 INFO - PID 7491 | execute@chrome://remote/content/marionette/server.js:279:16
[task 2022-01-12T17:38:54.664Z] 17:38:54 INFO - PID 7491 | onPacket/<@chrome://remote/content/marionette/server.js:252:20
[task 2022-01-12T17:38:54.664Z] 17:38:54 INFO - PID 7491 | onPacket@chrome://remote/content/marionette/server.js:253:9
[task 2022-01-12T17:38:54.664Z] 17:38:54 INFO - PID 7491 | _onJSONObjectReady/<@chrome://remote/content/marionette/transport.js:500:20
[task 2022-01-12T17:38:55.133Z] 17:38:55 INFO - PID 7491 | JavaScript error: resource:///modules/sessionstore/TabStateFlusher.jsm, line 131: TypeError: can't access property "sendAsyncMessage", mm is null
[task 2022-01-12T17:38:55.192Z] 17:38:55 INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:55.196Z] 17:38:55 INFO - PID 7491 | ###!!! [Parent][PCompositorManagerParent] Error: RunMessage(msgname=PCompositorBridge::Msg___delete__) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:55.198Z] 17:38:55 INFO - PID 7491 | ###!!! [Parent][PImageBridgeParent] Error: RunMessage(msgname=PImageBridge::Msg_WillClose) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:55.236Z] 17:38:55 INFO - PID 7491 | JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm, line 90: TypeError: can't access property "removeEventListener", Services.cpmm.sharedData is null
[task 2022-01-12T17:38:59.878Z] 17:38:59 INFO - Browser exited with return code 0
[task 2022-01-12T17:38:59.880Z] 17:38:59 INFO - Closing logging queue
[task 2022-01-12T17:38:59.880Z] 17:38:59 INFO - queue closed
[task 2022-01-12T17:38:59.959Z] 17:38:59 INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2022-01-12T17:38:59.960Z] 17:38:59 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox --marionette about:blank -profile /tmp/tmphlij7w5f
[task 2022-01-12T17:38:59.986Z] 17:38:59 INFO - PID 8174 | 1642007882967 Marionette INFO Marionette enabled
[task 2022-01-12T17:38:59.987Z] 17:38:59 INFO - PID 8174 | console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /tmp/tmpohuqwazn/search.json.mozlz4", (void 0)))
[task 2022-01-12T17:38:59.987Z] 17:38:59 INFO - PID 8174 | 1642007912756 Marionette INFO Listening on port 39886
[task 2022-01-12T17:38:59.988Z] 17:38:59 INFO - PID 8174 | ###!!! [Parent][PCompositorManagerParent] Error: RunMessage(msgname=PCompositorBridge::Msg___delete__) Channel closing: too late to send/recv, messages will be lost
[task 2022-01-12T17:38:59.988Z] 17:38:59 INFO - PID 8174 | 1642008273003 addons.xpi ERROR System addon update list error Error: Failed downloading XML, status: 0, reason: error
[task 2022-01-12T17:38:59.989Z] 17:38:59 INFO - Starting runner
Comment 46•3 years ago
|
||
This seems to happen only on TSan builds.
Something we could do is log the error code when fork
fails; that's not a lot of info but for the real fork
it would at least separate ENOMEM
(out of memory) from EAGAIN
(various resource limits on the number of processes/threads). However, TSan interposes fork
, but the interposer appears to tail-call the real fork
on all paths, but there's a C++ RAII object on the stack (ScopedInterceptor
); the various functions called by the destructor might not mess up errno
, given that they appear to CHECK
(release assert) that the syscalls they use succeed, but it's something to keep in mind.
Comment hidden (Intermittent Failures Robot) |
Comment 48•3 years ago
|
||
Interestingly it appears that the error in this case is probably not due to fork
failing. I don't see a line being logged above https://treeherder.mozilla.org/logviewer?job_id=363881376&repo=mozilla-central&lineNumber=6217, and we should hopefully see https://searchfox.org/mozilla-central/rev/d4b9c457db637fde655592d9e2048939b7ab2854/ipc/chromium/src/base/process_util_linux.cc#274 logged there if it failed there. (It occurs to me that this uses the chromium logging, and I don't remember how this maps, so perhaps we're not showing it? That would be unfortunate...)
Assuming we're not accidentally suppressing the diagnostic logs I added in the bug, the only codepath which can return an error here without logging is https://searchfox.org/mozilla-central/rev/d4b9c457db637fde655592d9e2048939b7ab2854/ipc/glue/GeckoChildProcessHost.cpp#1070-1071 failing.
It's probably worth going through, checking what the behaviour of the chromium logging is around warnings, and seeing if we show them on debug builds.
Comment 49•3 years ago
|
||
Apparently these tests are all not running in debug mode, so we should change the logs I added in the patch on this bug to emit on release mode as well.
Reporter | ||
Comment 51•3 years ago
|
||
Sorry for the noise, I wanted to link that to bug 1721793.
Comment hidden (Intermittent Failures Robot) |
Comment 53•2 years ago
|
||
Note that CHROMIUM_LOG
with level WARNING
or higher will log by
default (via NS_DebugBreak
) on all build types.
Comment 54•2 years ago
|
||
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4d1e424f1bad Log IPC fork() failures on non-debug builds. r=nika
Comment 55•2 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment 57•2 years ago
|
||
So far, no more failures reported after the landing in comment #55 (the one in comment #56 has a build ID before that commit and, as expected, no log messages). The patch itself shouldn't have affected this; its changes are only to code that runs after fork
has already returned an error.
Bug 1658474 landed around the same time and also affects child process handling, but it doesn't make much sense for it to have affected this; also, that patch shouldn't change anything if Gecko Media Plugins aren't being used, and I don't think that applies to all (or any?) of the failures seen here.
Comment 58•2 years ago
|
||
No reports in almost 3 months, although I don't know exactly what would have caused it. This bug can be reopened if it starts happening again.
Updated•2 years ago
|
Description
•