Do not indefinitely attempt to launch non-content child processes if they fail with STARTED_BUSY
Categories
(GeckoView :: General, task)
Tracking
(firefox146 fixed)
| Tracking | Status | |
|---|---|---|
| firefox146 | --- | fixed |
People
(Reporter: jnicol, Assigned: jnicol)
References
Details
Attachments
(1 file)
On Android we can only launch a single instance of non-content child processes, such as the GPU process, as we have only defined a single service for each, eg here. See bug 1844829.
This is fine for normal uses as we only need a single instance. But in tests where we have multiple GeckoRuntimes, i.e. xpcshell tests as a whole and a few particular geckoview-junit tests, it means we enter an indefinite loop of attempting to launch the process, getting a STARTED_BUSY error code, and retrying again. We get loads of this spam in the logcat:
ServiceChildProcess: This process belongs to a different GeckoRuntime owner: 2be1f43f-96a7-4c19-8e38-936982545e3a process: e2f3ae06-e30a-4568-94ae-7700600f4a43
Eventually the GPU process launch times out, and the tests appear to pass, so I thought these warnings were innocuous. But recently we've attempted to retry failed GPU process launches (bug 1991502) or increase the timeout (bug 1994984), and we can see that we are being killed by the OS:
Killing 13612:org.mozilla.geckoview.test_runner:xpcshell0/u0a195 (adj 500): Too many Binders sent to SYSTEM
In fact it turns out in some xpcshell tests we are already hitting this issue and being killed by the OS, but the harness does not report it as an failure. Actually it is disguising an actual legitimate failing test because before the test has a chance to timeout and fail the xpcshell process is being killed.
Until bug 1844829 is fixed and we can actually run multiple instances of non-content child processes, we should work around the problem by not indefinitely retrying to launch them. This means these tests will be run without a GPU process, which is not ideal. But the current situation is that they will eventually run without one anyway after the GPU process launch times out, that is unless the test process is killed before the timeout!
| Assignee | ||
Updated•3 days ago
|
| Assignee | ||
Comment 1•2 days ago
|
||
On Android it is not possible to run multiple instances of non-content
child processes, as we have only declared a single service of each
type in the manifest. This is fine in normal operation as there only
needs to be a single instance of each, but for xpcshell tests and
certain geckoview-junit tests we run multiple GeckoRuntimes in
parallel, each of which will attempt to launch its own child
processes.
Currently when a child process launch fails with a STARTED_BUSY error
code we will indefinitely retry launching it. This works for content
processes as eventually we will find a service that is not in use, but
for non-content process this will never succeed. We end up spamming
the logcat with error messages, and eventually the parent process will
get killed by the OS for making too many binder requests.
In bug 1844829 we will implement a proper solution to allow multiple
instances of these processes. But in the meantime, this patch makes us
avoid indefinitely retrying to launch busy non-content child
processes. This does mean that child processes will quickly fail to
launch in certain tests, which is not ideal. But it is an improvement
on the current situation where they never successfully launch anyway,
and instead are stuck in a loop attempting to launch until either the
test completes or the process is killed.
This causes certain tests to now fail on CI. In reality these tests
were already broken, but (after an increased timeout) were incorrectly
being marked as passing due to the process eventually being killed.
These tests are now skipped.
Comment 3•1 day ago
|
||
| bugherder | ||
Description
•