Closed Bug 1995541 Opened 3 days ago Closed 1 day ago

Do not indefinitely attempt to launch non-content child processes if they fail with STARTED_BUSY

Categories

(GeckoView :: General, task)

All
Android
task

Tracking

(firefox146 fixed)

RESOLVED FIXED
146 Branch
Tracking Status
firefox146 --- fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

Details

Attachments

(1 file)

On Android we can only launch a single instance of non-content child processes, such as the GPU process, as we have only defined a single service for each, eg here. See bug 1844829.

This is fine for normal uses as we only need a single instance. But in tests where we have multiple GeckoRuntimes, i.e. xpcshell tests as a whole and a few particular geckoview-junit tests, it means we enter an indefinite loop of attempting to launch the process, getting a STARTED_BUSY error code, and retrying again. We get loads of this spam in the logcat:

ServiceChildProcess: This process belongs to a different GeckoRuntime owner: 2be1f43f-96a7-4c19-8e38-936982545e3a process: e2f3ae06-e30a-4568-94ae-7700600f4a43

Eventually the GPU process launch times out, and the tests appear to pass, so I thought these warnings were innocuous. But recently we've attempted to retry failed GPU process launches (bug 1991502) or increase the timeout (bug 1994984), and we can see that we are being killed by the OS:

Killing 13612:org.mozilla.geckoview.test_runner:xpcshell0/u0a195 (adj 500): Too many Binders sent to SYSTEM

In fact it turns out in some xpcshell tests we are already hitting this issue and being killed by the OS, but the harness does not report it as an failure. Actually it is disguising an actual legitimate failing test because before the test has a chance to timeout and fail the xpcshell process is being killed.

Until bug 1844829 is fixed and we can actually run multiple instances of non-content child processes, we should work around the problem by not indefinitely retrying to launch them. This means these tests will be run without a GPU process, which is not ideal. But the current situation is that they will eventually run without one anyway after the GPU process launch times out, that is unless the test process is killed before the timeout!

Depends on: 1992996
Depends on: 1995596
Depends on: 1994165
No longer depends on: 1992996
Blocks: 1994984

On Android it is not possible to run multiple instances of non-content
child processes, as we have only declared a single service of each
type in the manifest. This is fine in normal operation as there only
needs to be a single instance of each, but for xpcshell tests and
certain geckoview-junit tests we run multiple GeckoRuntimes in
parallel, each of which will attempt to launch its own child
processes.

Currently when a child process launch fails with a STARTED_BUSY error
code we will indefinitely retry launching it. This works for content
processes as eventually we will find a service that is not in use, but
for non-content process this will never succeed. We end up spamming
the logcat with error messages, and eventually the parent process will
get killed by the OS for making too many binder requests.

In bug 1844829 we will implement a proper solution to allow multiple
instances of these processes. But in the meantime, this patch makes us
avoid indefinitely retrying to launch busy non-content child
processes. This does mean that child processes will quickly fail to
launch in certain tests, which is not ideal. But it is an improvement
on the current situation where they never successfully launch anyway,
and instead are stuck in a loop attempting to launch until either the
test completes or the process is killed.

This causes certain tests to now fail on CI. In reality these tests
were already broken, but (after an increased timeout) were incorrectly
being marked as passing due to the process eventually being killed.
These tests are now skipped.

Pushed by jnicol@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/43609422072d https://hg.mozilla.org/integration/autoland/rev/61e23be0aee4 Do not indefinitely attempt to launch non-content child processes if they fail with STARTED_BUSY. r=geckoview-reviewers,toolkit-telemetry-reviewers,janerik,tcampbell
Status: NEW → RESOLVED
Closed: 1 day ago
Resolution: --- → FIXED
Target Milestone: --- → 146 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: