Incorrect handling of Android launch failure

RESOLVED FIXED in Firefox 69

Status

()

defect
P1
critical
RESOLVED FIXED
3 months ago
Last month

People

(Reporter: davidb, Assigned: jld)

Tracking

(Blocks 1 bug, {crash, regression, topcrash})

unspecified
mozilla69
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(geckoview66 unaffected, firefox-esr60 unaffected, firefox66 unaffected, firefox67 unaffected, firefox67.0.1 unaffected, firefox68 wontfix, firefox69 fixed)

Details

(Whiteboard: [geckoview:fenix:m6])

Attachments

(1 attachment)

This bug is for crash report bp-438d7634-2103-4bcf-a85c-56ac50190416.

Top 10 frames of crashing thread:

0 libxul.so mozilla::layers::PCompositorManager::CreateEndpoints ipc/glue/ProtocolUtils.h:852
1 libxul.so mozilla::gfx::GPUProcessManager::CreateContentBridges gfx/ipc/GPUProcessManager.cpp:797
2 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal const dom/ipc/ContentParent.cpp:2539
3 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal dom/ipc/ContentParent.cpp:2211
4 libxul.so mozilla::dom::ContentParent::LaunchSubprocessSync dom/ipc/ContentParent.cpp:2234
5 libxul.so mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess dom/ipc/ContentParent.cpp:899
6 libxul.so mozilla::dom::ContentParent::CreateBrowser dom/ipc/ContentParent.cpp:1142
7 libxul.so nsFrameLoader::TryRemoteBrowser dom/base/nsFrameLoader.cpp:2605
8 libxul.so nsFrameLoader::ShowRemoteFrame dom/base/nsFrameLoader.cpp:868
9 libxul.so nsFrameLoader::Show dom/base/nsFrameLoader.cpp:737

This is a fenix crash.

Priority: -- → P3
Whiteboard: [geckoview] → [geckoview:fenix]

There have only been five reports with the signature so far.

Whiteboard: [geckoview:fenix] → [geckoview:fenix:p5]
Duplicate of this bug: 1552201

GV doesn't use a GPU process but this stack looks like Gecko is trying to access a GPU process.

Crash Signature: [@ mozilla::layers::PCompositorManager::CreateEndpoints] → [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints]
Whiteboard: [geckoview:fenix:p5] → [geckoview:fenix:p3]

The naming of the code here doesn't really match the way we talk about it. This code is creating IPC connections for the content process to connect to where the GPU code will run, which might be a standalone GPU process, or might just be the parent.

We're crashing because of this assertion - https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/ipc/glue/ProtocolUtils.h#860

It looks like base::ProcessId is always just some form of integer, so that would mean aChildDestPid==0.

That values comes from here - https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/dom/ipc/ContentParent.cpp#2512

We also verify that value here - https://searchfox.org/mozilla-central/rev/94c6b5f06d2464f6780a52f32e917d25ddc30d6b/ipc/glue/ProtocolUtils.cpp#587

Except that time it's being compared against kInvalidProcessId (-1).

Is the crashing assertion just wrong? It seems like 0 might be a valid value for a ProcessId.

Flags: needinfo?(jld)

0 isn't a valid ProcessId; POSIX requires pids to be positive integers. (Linux in particular uses pid 0 to indicate the nonexistence of a pid in some cases, like for getppid().)

Leaving needinfo because I want to look at this more closely to see if I can see how the 0 got in here. It's interesting that this happens only(?) on Android, because child process launching goes through the Android service mechanism instead of using normal Unix APIs, so there might be a bug there.

Since Comment 3 it looks as if we are up to 95 crashes for the first signature and 39 for the second.

Adding the [geckoview:fenix:m6] whiteboard tag because we should try to fix this crash for Fenix MVP (GeckoView 68). This crash signature is top native code crash in Fenix.

Whiteboard: [geckoview:fenix:p3] → [geckoview:fenix:m6]

The Android process launch path returns 0 to indicate failure to launch. We should be checking for that here, and we're not.

And probably this debug assertion should be promoted at least to a MOZ_DIAGNOSTIC_ASSERT (or maybe a MOZ_RELEASE_ASSERT, if we're going to end up hitting another MOZ_RELEASE_ASSERT anyway).

I don't know exactly why we'd be failing to launch the content process service, but since this is mobile, maybe OOM? In any case, assuming that it's launch failure, even with the error handling fixed we'd still fail to load the document, but it wouldn't take down the browser and in theory the frameloader could try to handle it.

I'm going to assume that that is what's going on and make that change, because we should be doing that check regardless.

Assignee: nobody → jld
Component: Graphics: Layers → IPC
Flags: needinfo?(jld)
Priority: P3 → P2
Summary: Crash in [@ mozilla::layers::PCompositorManager::CreateEndpoints] → Incorrect handling of Android launch failure | Crash in [@ mozilla::layers::PCompositorManager::CreateEndpoints]

I doubt that this is a regression. Did we release GeckoView to a wider audience sometime around 2019-05-16, maybe?

If there is a change that made this occur much more frequently relative to the size of the userbase, we should try to find out what it might be.

Flags: needinfo?(cpeterson)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #9)

I doubt that this is a regression. Did we release GeckoView to a wider audience sometime around 2019-05-16, maybe?

Yes. We shipped Focus 8.0.11 Beta with GeckoView 68 Nightly. However, we had been shipping earlier snapshots of GeckoView 68 Nightly without these crashes:

2019-04-22 - Shipped Focus 8.0.9 Beta with a snapshot of GeckoView 68 Nightly
2019-04-25 - Promoted Focus 8.0.9 Beta to Release
2019-05-10 - Shipped Focus 8.0.10 Beta with a snapshot of GeckoView 68 Nightly
2019-05-15 - Shipped Focus 8.0.11 Beta with a snapshot of GeckoView 68 Nightly

So it seems like something may have regressed in Focus or GeckoView 68 between 2019-05-10 and 2019-05-15?

@ Dylan, do you recall any particular Focus or A-C changes in Focus 8.0.11 that might have caused problems launching content processes?

If not in Focus or A-C, do you know the Gecko changesets used for Focus 8.0.10 and 8.0.11? We can produce a Gecko changelog between those snapshots and look for suspicious Gecko or GV changes.

Flags: needinfo?(cpeterson) → needinfo?(droeh)

Focus 8.0.10 with GeckoView 2019-04-29 (hg revision 420e18a75314b8123b515d8a93cbacd145ecb03c)
https://github.com/mozilla-mobile/focus-android/commit/6678ad426ccca236ddc0caedcbd521d5f39d7d2c

Focus 8.0.11 with GeckoView 2019-05-31 (hg revision fa3cfee27619ddc9bcbcf70555bda4eb1e815146)
https://github.com/mozilla-mobile/focus-android/commit/1f0cf098a9c357acc49563ed8b8aec18a4529a1c

m-c changelog between those revisions:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=420e18a75314b8123b515d8a93cbacd145ecb03c&tochange=fa3cfee27619ddc9bcbcf70555bda4eb1e815146

Could APZ/GPU process Bug 1533673 be related? Just a guess, since it touched GPU process code.

Priority: P2 → P1

This crash signature has really spiked in Fenix. Even if we handle errors returned from LaunchAndroidService, we don't know yet why the process is failing to launch. What can we do in the meantime to get more diagnostic information on the root cause?

(In reply to Chris Peterson [:cpeterson] from comment #11)

Could APZ/GPU process Bug 1533673 be related? Just a guess, since it touched GPU process code.

Adding a needinfo for Kats on this part since this is our top fenix gecko crash. Kats could that bug be involved here somehow?

Flags: needinfo?(kats)

(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #13)

Adding a needinfo for Kats on this part since this is our top fenix gecko crash. Kats could that bug be involved here somehow?

I doubt it. The problem seems to be with launching a process, not IPC. And GPU process code is not relevant here, as Matt pointed out.

Flags: needinfo?(kats)

(In reply to Chris Peterson [:cpeterson] from comment #10)

@ Dylan, do you recall any particular Focus or A-C changes in Focus 8.0.11 that might have caused problems launching content processes?

If not in Focus or A-C, do you know the Gecko changesets used for Focus 8.0.10 and 8.0.11? We can produce a Gecko changelog between those snapshots and look for suspicious Gecko or GV changes.

I don't really see any way this could be caused by a Focus or AC change; they shouldn't really have any involvement in this.

As for the changesets used, I think you're looking at 420e18a75314b8123b515d8a93cbacd145ecb03c for 8.0.10 and fa3cfee27619ddc9bcbcf70555bda4eb1e815146 for 8.0.11

Flags: needinfo?(droeh)

(In reply to Chris Peterson [:cpeterson] from comment #12)

This crash signature has really spiked in Fenix. Even if we handle errors returned from LaunchAndroidService, we don't know yet why the process is failing to launch. What can we do in the meantime to get more diagnostic information on the root cause?

You'll need to ask someone who knows more about Android how to instrument these service launch failures; it's very different from how we manage processes on any other OS we support, and I don't even have a good idea of what failure modes to consider (besides cross-platform concerns like OOM).

Note that, with the crash fixed, we'll get SUBPROCESS_LAUNCH_FAILURE telemetry instead.

68.0b0 has over 1800 crashes. Adding the topcrash keyword.

Keywords: topcrash

Currently the Android implementation of process launch signals failure by
returning pid 0 through an out-parameter; we don't check for that, so we
report success and then the bad pid spreads through IPC until it sets off
an assertion somewhere else. This patch adds the missing check and
strengthens an assertion that would have caught the problem more directly.

Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/158f25256022
Detect Android launch failure. r=snorp
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

68=fix-optional. We don't need to uplift this error handling cleanup to GeckoView 68 Beta for Fenix because it doesn't fix or, AFAIU, help diagnose the root cause of the content process launch failures.

I filed bug 1555447 to implement snorp's suggestion that GeckoView crash itself closer to the error inside LaunchAndroidService so we can stuff more diagnostic information in an exception report.

No longer blocks: 1555447
Depends on: 1555447

I filed new bug 1557449 to track the actual CreateEndpoint crash.

Blocks: 1557449
Crash Signature: [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints]
Summary: Incorrect handling of Android launch failure | Crash in [@ mozilla::layers::PCompositorManager::CreateEndpoints] → Incorrect handling of Android launch failure
You need to log in before you can comment on or make changes to this bug.