Reenable bug 1348361 (async content process launching) on nightly and investigate/fix crashes (bug 1446900).

RESOLVED WONTFIX

Status

()

enhancement
P2
normal
RESOLVED WONTFIX
Last year
8 months ago

People

(Reporter: spohl, Assigned: jld)

Tracking

60 Branch
Unspecified
Windows 10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(crash signature)

Attachments

(3 attachments, 1 obsolete attachment)

We should reenable bug 1348361 on nightly and investigate the crashes that were reported in bug 1446900.
Posted patch Enable bug 1348361 on nightly (obsolete) — Splinter Review
Attachment #8975622 - Flags: review?(ted)
(In reply to Stephen A Pohl [:spohl] from comment #1)
> Created attachment 8975622 [details] [diff] [review]
> Enable bug 1348361 on nightly

This switches ASYNC_CONTENTPROC_LAUNCH (introduced in attachment 8975169 [details] [diff] [review]) on for nightly.
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::Open]
Attachment #8975622 - Flags: review?(ted) → review+
Updated commit message to what landed on inbound. Carrying over r+.
Attachment #8975622 - Attachment is obsolete: true
Attachment #8979318 - Flags: review+
Keywords: leave-open
These crash signatures are expected to start showing up again in Nightly now that we have reenabled bug 1348361.
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::Open]
I'd like to start by investigating the two major crash signatures here, i.e.:

IToplevelProtocol::OtherPid
IToplevelProtocol::OtherPid const

These crashes only occur when mOtherPidState is equal to ProcessIdState::eError[1], which is only the case when the process failed to launch for some reason and we ended up calling ContentProcessHost::OnProcessLaunchError()[2]. This is the case when PerformAsyncLaunch[3], i.e. PerformAsyncLaunchInternal[4] returned false.

The MOZ_RELEASE_ASSERTs that I propose to add in this patch seem hacky, but they seem like the quickest way to isolate the problem by gathering more meaningful crash dumps. I'm going to remove them as soon as we were able to determine if there is only one, or possibly several different places where we're currently failing in GeckoChildProcessHost::PerformAsyncLaunchInternal. Also, asserts are behind the ASYNC_CONTENTPROC_LAUNCH define, which is only enabled on nightly.

[1] https://hg.mozilla.org/mozilla-central/annotate/fa5724780fe76d6ccbbd08d978342a1db6a43d49/ipc/glue/ProtocolUtils.cpp#l711
[2] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#559
[3] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#551
[4] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#625
Attachment #8983527 - Flags: review?(jld)
Comment on attachment 8983527 [details] [diff] [review]
Debug GeckoChildProcessHost::PerformAsyncLaunchInternal

Review of attachment 8983527 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/GeckoChildProcessHost.cpp
@@ +627,5 @@
>    // We rely on the fact that InitializeChannel() has already been processed
>    // on the IO thread before this point is reached.
>    if (!GetChannel()) {
> +#ifdef ASYNC_CONTENTPROC_LAUNCH
> +    MOZ_RELEASE_ASSERT(1 == 2);

FYI, if you just want to crash with a distinct crash annotation message for each crash site, MOZ_CRASH with an explanation string will also work, even on non-debug builds: https://searchfox.org/mozilla-central/rev/3737701cfab93ccea04c0e9cab211ad10f931d87/mfbt/Assertions.h#263

(Also, MOZ_CRASH writes __LINE__ to NULL to defeat the compiler's basic block merging, so they should all have correct source line info if they're from a build we have symbols for.)

But these “assertions” are temporary, so this can land as-is.
Attachment #8983527 - Flags: review?(jld) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/25f96c342f93d6f5bd60d8c4b3573887377cbf6e
Bug 1461459: Add nightly-only release asserts in GeckoChildProcessHost::PerformAsyncLaunchInternal to investigate failures to asynchronously launch content processes. r=jld
There are 3 crashes with signature "mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal" (from the same install).
The moz_crash_reason is MOZ_RELEASE_ASSERT(13 == 14).
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::Open] → [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::Open] [@ mozilla::ipc::GeckoChildProcessHost::PerformAsyncL…
Here's one on Mac in the (3 == 4) condition: bp-6a4147cc-8a60-40d7-8bba-eb4550180606

The rest are (13 == 14); currently 3 on Windows, 4 on Android.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #12)
> Here's one on Mac in the (3 == 4) condition:
> bp-6a4147cc-8a60-40d7-8bba-eb4550180606
> 
> The rest are (13 == 14); currently 3 on Windows, 4 on Android.

Thanks, this matches what I'm seeing. Both assert reasons are due to base::LaunchApp (or LaunchAndroidService on Android) failing to set |process| to anything valid, i.e. leaving it null.
(In reply to Stephen A Pohl [:spohl] from comment #13)
> (In reply to Jed Davis [:jld] (⏰UTC-6) from comment #12)
> > The rest are (13 == 14); currently 3 on Windows, 4 on Android.
> 
> Thanks, this matches what I'm seeing. Both assert reasons are due to
> base::LaunchApp (or LaunchAndroidService on Android) failing to set
> |process| to anything valid, i.e. leaving it null.

I believe that would usually be SandboxBroker::LaunchApp on Windows.  I notice that both LaunchApp()s return a boolean for success, which GeckoChildProcessHost mostly isn't looking at, and instead it's assuming that the out parameter is left unchanged on failure (which happens to be true at the moment, but it probably shouldn't be depending on that).
Let's get a better understanding of what's going on in base::LaunchApp. Focusing on macOS first.
Attachment #8984537 - Flags: review?(jld)
Comment on attachment 8984537 [details] [diff] [review]
Debug base::LaunchApp on macOS

Review of attachment 8984537 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, and I noticed some things while looking at the code.

The Mac crashes are all (3 == 4), which means failing to get a Mach message from the child process (or timing out waiting for it).  That could be a side-effect of failing to launch the process, because we don't check for failure where we should be.

But also… we don't construct the ReceivePort for that until after LaunchApp returns, and the comments in mach_ipc_mac.mm say that that creates and register a new port, while the MachPortSender constructor seems to assume the port is already registered and only looks it up.  I don't know Mach IPC very well, but it seems possible for the child to try to look up and send to the port before the parent registers it, in which case everything fails.

However, this patch covers that: there's a crash for every reason LaunchApp could fail, so if we keep seeing (3 == 4) with build IDs later than when this lands, then it's probably something about the Mach messaging.

Also, in case this is one of the posix_spawn* calls failing, and we want to find out the errno (or other bits that aren't privacy-sensistive), MOZ_CRASH_UNSAFE_PRINTF seems to be the right tool for the job.  (It's named "unsafe" to discourage using it if there might be heap corruption, but that's not a problem here.)
Attachment #8984537 - Flags: review?(jld) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/edb403643cefb276a5ab7cdd9364e873b18fdb82
Bug 1461459: Add nightly-only MOZ_CRASH statements in base::LaunchApp on macOS to investigate failures to asynchronously launch content processes. r=jld
The most recent Mac crashes I see - https://bit.ly/2yp7VBD - seem to all have MOZ_RELEASE_ASSERT(3 == 4) as the crash reason. These are crashes in 20180618220118 and 20180617220505.
Handing this off as discussed via IRC.
Assignee: spohl.mozilla.bugs → jld
There are also significant-looking numbers of failures to launch on Windows; the Mac crashes are somewhere around 25x to 50x the rate per user, but there are a lot more Windows users.  There is some telemetry (SANDBOX_FAILED_LAUNCH_KEYED) but it's not telling me anything useful.

We're going to have to do something about those, and that's likely going to take the form of making child process launching fallible and handling failure semi-gracefully (as we currently do) instead of taking down the entire browser.  I expect that this will be even more important post-Fission.
Although early in the 63 crash cycle, mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal signature is at the top of Mac nightly crash stats.
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #22)
> Although early in the 63 crash cycle,
> mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunchInternal signature is
> at the top of Mac nightly crash stats.

To avoid confusion and concern: this is only enabled in Nightly and will not be enabled anywhere else until we have isolated and resolved the issues here. We have this enabled on Nightly to collect the crash data necessary to fix the issues.
I just ran into https://crash-stats.mozilla.com/report/index/53a114d3-47d1-43a5-bcc6-8bfdb0180629 when I accidentally clicked the Firefox icon in the Dock while it was in the process of upgrading.
Depends on: 1467331
(In reply to Dirkjan Ochtman (:djc) from comment #24)
> I just ran into
> https://crash-stats.mozilla.com/report/index/53a114d3-47d1-43a5-bcc6-
> 8bfdb0180629 when I accidentally clicked the Firefox icon in the Dock while
> it was in the process of upgrading.

Bug 1469691 covers that; thanks for the report.
https://crash-stats.mozilla.com/report/index/ee320dbf-713c-4b3a-b8cd-dd6db0180712 appears to be another crash with a similar Moz Crash Reason - should I add the signature to this bug? Thanks.
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #26)
> https://crash-stats.mozilla.com/report/index/ee320dbf-713c-4b3a-b8cd-
> dd6db0180712 appears to be another crash with a similar Moz Crash Reason -
> should I add the signature to this bug? Thanks.

No, that signature (and the entire stack) doesn't make sense for this.
The future of async content process launch is going in a different direction: ContentProcess will still delay post-launch initialization until the actual pid/handle is available, but it will use MozPromise chaining to yield control of the main thread until then.  Currently this is doable for preallocated processes (bug 1446161), and for the general case Project Fission will be changing the frame loader logic so it can asynchronously await the new process.
Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → WONTFIX
Can we remove all the code behind ASYNC_CONTENTPROC_LAUNCH now? https://searchfox.org/mozilla-central/search?q=ASYNC_CONTENTPROC_LAUNCH&case=false&regexp=false&path=
Flags: needinfo?(jld)
(In reply to Marco Castelluccio [:marco] from comment #29)
> Can we remove all the code behind ASYNC_CONTENTPROC_LAUNCH now?
> https://searchfox.org/mozilla-central/
> search?q=ASYNC_CONTENTPROC_LAUNCH&case=false&regexp=false&path=

There's a patch for that in bug 1446161; I'm in the process of handling the review comments for that patch stack, so it should be landing soon.
Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.