Open Bug 1530770 (android-e10s-multi) Opened 2 years ago Updated 2 months ago

[Meta] Enable Android e10s-multi

Categories

(GeckoView :: General, enhancement, P3)

Unspecified
Android
enhancement

Tracking

(Not tracked)

People

(Reporter: snorp, Assigned: aklotz)

References

(Depends on 4 open bugs, Blocks 2 open bugs)

Details

(Whiteboard: [geckoview:m1910])

Right now we only support a single content process. We should support more than that.

Alias: android-e10s-multi
Blocks: e10s-multi
OS: All → Android

P1. Important for crash resilience and Fission, but not needed for Fenix MVP.

Priority: -- → P1
Whiteboard: [geckoview:fenix:p3] [geckoview:fxr:p2]
See Also: → 1535365

James says that we should enable e10s-multi for Fenix MVP.

Assignee: nobody → snorp
Summary: Enable multi-e10s → Enable Android e10s-multi
Whiteboard: [geckoview:fenix:p3] [geckoview:fxr:p2] → [geckoview:fenix:m5] [geckoview:fxr:p2]

Assigning to Bobby because James says Bobby's working on e10s-multi. This is a nice-to-have for Fenix MVP.

Whiteboard: [geckoview:fenix:m5] [geckoview:fxr:p2] → [geckoview:fenix:p2] [geckoview:fxr:p2]

I've rebased snorp's patch and debugged it a bit. The main problem causing the test failures is that we're asked to create more than three content processes, which we can't do, and then just hang. This happens because the parent spawns three child processes and then shuts a few of them down, leading it to believe it has runway to start more.

AFAICT There are two problems in the code that lead to this condition:
(1) At present, the standard shutdown path in the child process doesn't seem to actually cause the process to be terminated such that we receive onServiceDisconnected in the parent. I think if this were an opt build we'd eventually terminate via the ForceKillTimer (which is disabled for debug builds), but then we'd still hit the second problem:
(2) It's racey, because child process shutdown is fundamentally asynchronous. We'll need to handle this better.

Let's tackle (1) first. I've traced us all the way to returning from XRE_InitChildProcess, which returns out of nativeRun. But the java portion of the app continues to run. What's the right way for this to be hooked up?

Flags: needinfo?(snorp)

(In reply to Bobby Holley (:bholley) from comment #4)

I've rebased snorp's patch and debugged it a bit. The main problem causing the test failures is that we're asked to create more than three content processes, which we can't do, and then just hang. This happens because the parent spawns three child processes and then shuts a few of them down, leading it to believe it has runway to start more.

AFAICT There are two problems in the code that lead to this condition:
(1) At present, the standard shutdown path in the child process doesn't seem to actually cause the process to be terminated such that we receive onServiceDisconnected in the parent. I think if this were an opt build we'd eventually terminate via the ForceKillTimer (which is disabled for debug builds), but then we'd still hit the second problem:

Ah. The way we handle this in GeckoRuntime is to listen[0] for the Gecko:Exited message. We need to do something similar in GeckoServiceChildProcess and then stop the service.

(2) It's racey, because child process shutdown is fundamentally asynchronous. We'll need to handle this better.

I thought we had something hacked up to just kill children in order to avoid this, but maybe I'm misremembering?

Let's tackle (1) first. I've traced us all the way to returning from XRE_InitChildProcess, which returns out of nativeRun. But the java portion of the app continues to run. What's the right way for this to be hooked up?

[0] https://searchfox.org/mozilla-central/source/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/GeckoRuntime.java#151

Flags: needinfo?(snorp)

Adding [geckoview:fenix:m7] whiteboard tag because we want e10s-multi soon but it's not strictly a Fenix MVP release blocker.

Whiteboard: [geckoview:fenix:p2] [geckoview:fxr:p2] → [geckoview:fenix:m7] [geckoview:fxr:p2]

I spoke with James and I recommend explicitly not targeting MVP for this as it adds risk.

Deferring this bug from Fenix's M7 (July) milestone to the M8 backlog for later in Q3.

Whiteboard: [geckoview:fenix:m7] [geckoview:fxr:p2] → [geckoview:fenix:m8]
Whiteboard: [geckoview:fenix:m8] → [geckoview:fenix:m8] [geckoview:fxr:p2]
Assignee: snorp → bobbyholley

I'm editing a bunch of GeckoView bugs. If you'd like to filter all this bugmail, search and destroy emails containing this UUID:

e88a5094-0fc0-4b7c-b7c5-aef00a11dbc9

Priority: P1 → P2

Fenix issue requesting e10s-multi so one tab crash doesn't crash every tab:
https://github.com/mozilla-mobile/fenix/issues/2863

Defer e10s-multi to October because we may get more engineering help then.

Whiteboard: [geckoview:fenix:m8] [geckoview:fxr:p2] → [geckoview:m1910] [geckoview:fxr:p2]
Rank: 13
Whiteboard: [geckoview:m1910] [geckoview:fxr:p2] → [geckoview:fxr:p2]
Whiteboard: [geckoview:fxr:p2]

Aaron will start working on Android e10s-multi in October, but we expect e10s-multi to be a multi-month project.

Assignee: bholley → aklotz
Priority: P2 → P1
Whiteboard: [geckoview:m1910]
Depends on: 1594820
Depends on: 1595834
Depends on: 1600663
Priority: P1 → --
Summary: Enable Android e10s-multi → [Meta] Enable Android e10s-multi
Depends on: 1608299
No longer depends on: 1595834
No longer blocks: 1560995
Depends on: 1611554
Blocks: 1619655
Depends on: 1622060
Depends on: 1625325
Depends on: 1625326
No longer blocks: 1535365
Depends on: 1634097
Depends on: 1634178
Depends on: 1647470
Depends on: 1647883
Depends on: 1654570
Depends on: 1654817
You need to log in before you can comment on or make changes to this bug.