Closed Bug 1446900 Opened 2 years ago Closed Last year

Crash in mozilla::ipc::IToplevelProtocol::OtherPid

Categories

(Core :: IPC, defect, P1, blocker)

60 Branch
Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 blocking verified
firefox62 + verified

People

(Reporter: calixte, Assigned: spohl)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is
report bp-4d3582e4-854f-4cc0-ae06-5470f0180318.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::IToplevelProtocol::OtherPid ipc/glue/ProtocolUtils.cpp:602
1 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::FileDescriptor>::Write ipc/glue/FileDescriptor.cpp:235
2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::dom::FileDescOrError>::Write ipc/ipdl/PContent.cpp:4049
3 xul.dll mozilla::dom::PContentParent::SendPScriptCacheConstructor ipc/ipdl/PContentParent.cpp:1128
4 xul.dll mozilla::ScriptPreloader::InitContentChild js/xpconnect/loader/ScriptPreloader.cpp:184
5 xul.dll mozilla::dom::ContentParent::InitInternal dom/ipc/ContentParent.cpp:2319
6 xul.dll mozilla::dom::ContentParent::LaunchSubprocess dom/ipc/ContentParent.cpp:2073
7 xul.dll mozilla::dom::ContentParent::PreallocateProcess dom/ipc/ContentParent.cpp:608
8 xul.dll mozilla::PreallocatedProcessManagerImpl::AllocateNow dom/ipc/PreallocatedProcessManager.cpp:269
9 xul.dll mozilla::detail::RunnableMethodImpl<nsCOMPtr<nsIThreadPool>, nsresult  xpcom/threads/nsThreadUtils.h:1200

=============================================================

There are 2 crashes (from 2 installations) in nightly 61 with buildid 20180317220121. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1348361.

[1] https://hg.mozilla.org/mozilla-central/rev?node=dd3dc46a8d7d
Flags: needinfo?(agaynor)
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] → [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
Hmmm, I think this is what would happen if the process failed to spawn for any reason. OtherPid() doesn't have any way of signaling that the process is in an errored state so we crash. We probably need a better way to handle asynchronous process spawn failures -- though I'm not sure what circumstances they occur in, or what we do when they occur! (They seem to be very low volume, which is good).

:spohl, I believe this has been true since the very earliest iterations of the async process spawning patch; was async spawn errors something that was ever considered/discussed?
Flags: needinfo?(spohl.mozilla.bugs)
Discussed via IRC
Flags: needinfo?(spohl.mozilla.bugs)
Memorializing the conclusion here:

1) Need to make sure we understand the previous behaviour on m-c, :spohl thinks we already crashed when a content process failed to spawn, in which case this is really just a change-in-stack, not a new crash.

2) If this is a regression, the right fix is probably to move from OtherPid() blocking to structuring the code so we don't even try to call OtherPid() until the process is spawned, probably with some sort of Promise/Resolver abstraction.

I'm going to attempt to setup an environment where the process fails to spawn and compare.
Flags: needinfo?(agaynor)
Previous behavior was in DEBUG builds to crash, and in release builds to render a blank content area that could not be interacted with: no new tabs, no entering URLs into the browser.

In light of that, I don't think this is a critical issue, but it's likely by work on bug 1446161 will overlap with resolving this.
Assignee: nobody → agaynor
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] → [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol]
In addition to the signature I added above, I also see another one [@ mozilla::ipc::IProtocol::GetActorEventTarget ]. Alex - is this all part of this bug or do I need to file another?
Flags: needinfo?(agaynor)
Do you have a link to one of those crashes?
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #6)
> Do you have a link to one of those crashes?

Sure, https://crash-stats.mozilla.com/report/index/e24dbbd2-c86c-4720-ac85-8cbfb0180408
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol] → [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol] [@ mozilla::ipc::IToplevelProtocol::Ope…
That stacktrace is pretty confusing to me, but I believe it's the same underlying issue.
Flags: needinfo?(agaynor)
Adding another signature - https://bit.ly/2HU2Btc - that has the same crash reason: MOZ_RELEASE_ASSERT(mOtherPidState == ProcessIdState::eReady)
Crash Signature: mozilla::ipc::IToplevelProtocol::Open] → mozilla::ipc::IToplevelProtocol::Open] [@ IDMap<T>::Remove]
I observed this crash when I had a single google doc tab open, suspended my laptop, then unsupended it 1.5 hours later.
This is actually one of the more frequent parent process crashes we're seeing on 61 at the moment.
Hey Alex, this is currently 11% of our browser crashes on Beta61. Any chance you can take another look or provide an update?
Flags: needinfo?(agaynor)
#3 & #5 topcrash on Beta61. Bumping this up to blocker level.
Severity: critical → blocker
Priority: P3 → P1
It might make sense to back out bug 1348361, reenable on nightly only and gather data until we have this fixed. Also, we should follow up on the perf gains to see what we have gained from bug 1348361.
Attached patch Place bug 1348361 behind ifdef (obsolete) — Splinter Review
I suggest that we place bug 1348361 behind an ifdef. Once it is confirmed on Nightly that we fixed these crashes and that there are no other new crashes, this patch could be uplifted. We could then reenable bug 1348361 on Nightly and work to fix these crashes with async process launching.
Assignee: agaynor → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8975144 - Flags: review?(jld)
Attached patch Enable bug 1348361 on nightly (obsolete) — Splinter Review
Once we have confirmed that the crashes are fixed by reverting bug 1348361 (attachment 8975144 [details] [diff] [review]), we should reenable it on nightly to gather data, investigate and fix the crashes.
Attachment #8975164 - Flags: review?(ted)
Fix OR'd ifdef.
Attachment #8975144 - Attachment is obsolete: true
Attachment #8975144 - Flags: review?(jld)
Attachment #8975169 - Flags: review?(jld)
(In reply to Stephen A Pohl [:spohl] from comment #20)
> Created attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
> 
> Fix OR'd ifdef.

s/OR/AND/
Comment on attachment 8975144 [details] [diff] [review]
Place bug 1348361 behind ifdef

Review of attachment 8975144 [details] [diff] [review]:
-----------------------------------------------------------------

Mostly looks good, but doesn't seem to handle the error case correctly — that is, the same way it was handled before the async pid changes, which (if I understand correctly) was that OtherPid would crash if it were called because it release-asserted that the pid was valid, but ContentParent::LaunchSubprocess would have returned false so it (probably?) wasn't being called.

::: dom/ipc/ContentParent.cpp
@@ +1546,5 @@
>      }
>    }
>  #endif
>  
> +#ifdef MOZ_CODE_COVERAGE && ASYNC_CONTENTPROC_LAUNCH

#if defined(MOZ_CODE_COVERAGE) && defined(ASYNC_CONTENTPROC_LAUNCH)

@@ +2067,4 @@
>    if (!mSubprocess->Launch(extraArgs)) {
> +#else
> +  if (!mSubprocess->LaunchAndWaitForProcessHandle(extraArgs)) {
> +#endif

This looks like a good place for a SetOtherProcessId(kInvalidProcessId, ProcessIdState::eError) to replace the one in OnProcessLaunchError.

::: ipc/glue/GeckoChildProcessHost.cpp
@@ +559,1 @@
>      OnProcessLaunchError();

If OnProcessLaunchError isn't called, won't the pid state just stay at ePending and cause OtherPid to block forever instead of crashing?
Attachment #8975144 - Attachment is obsolete: false
Comment on attachment 8975144 [details] [diff] [review]
Place bug 1348361 behind ifdef

(Fixing mid-air collision.)
Attachment #8975144 - Attachment is obsolete: true
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef

Review of attachment 8975169 [details] [diff] [review]:
-----------------------------------------------------------------

See comment #22.
Attachment #8975169 - Flags: review?(jld) → review-
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> Comment on attachment 8975144 [details] [diff] [review]
> Place bug 1348361 behind ifdef
> 
> Review of attachment 8975144 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Mostly looks good, but doesn't seem to handle the error case correctly —
> that is, the same way it was handled before the async pid changes, which (if
> I understand correctly) was that OtherPid would crash if it were called
> because it release-asserted that the pid was valid, but
> ContentParent::LaunchSubprocess would have returned false so it (probably?)
> wasn't being called.

This isn't trying to address the error case. All indicators are that we're not encountering the previous error case, but newly introduced errors from the patches in bug 1348361. The new crashes are top crashers. Previously, if there were crashes, they didn't seem to rank anywhere in the top crashers. This patch intends to be a minimal patch to revert to the behavior pre- bug 1348361 to be uplifted to beta.

> ::: dom/ipc/ContentParent.cpp
> @@ +1546,5 @@
> >      }
> >    }
> >  #endif
> >  
> > +#ifdef MOZ_CODE_COVERAGE && ASYNC_CONTENTPROC_LAUNCH
> 
> #if defined(MOZ_CODE_COVERAGE) && defined(ASYNC_CONTENTPROC_LAUNCH)
> 
> @@ +2067,4 @@
> >    if (!mSubprocess->Launch(extraArgs)) {
> > +#else
> > +  if (!mSubprocess->LaunchAndWaitForProcessHandle(extraArgs)) {
> > +#endif
> 
> This looks like a good place for a SetOtherProcessId(kInvalidProcessId,
> ProcessIdState::eError) to replace the one in OnProcessLaunchError.

This is helpful for a patch that will address the new error cases from async process launching. However, it shouldn't be relevant to this patch that tries to revert to pre- bug 1348361 behavior.


> ::: ipc/glue/GeckoChildProcessHost.cpp
> @@ +559,1 @@
> >      OnProcessLaunchError();
> 
> If OnProcessLaunchError isn't called, won't the pid state just stay at
> ePending and cause OtherPid to block forever instead of crashing?

Without async process launching, this is set in ContentParent::OnChannelConnected.
Attachment #8975169 - Flags: review- → review?(jld)
(In reply to Stephen A Pohl [:spohl] from comment #25)
> Comment on attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
> 
> (In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> > Comment on attachment 8975144 [details] [diff] [review]
> > ::: ipc/glue/GeckoChildProcessHost.cpp
> > @@ +559,1 @@
> > >      OnProcessLaunchError();
> > 
> > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > ePending and cause OtherPid to block forever instead of crashing?
> 
> Without async process launching, this is set in
> ContentParent::OnChannelConnected.

Oh, I misread your comment. Let me double-check.
(In reply to Stephen A Pohl [:spohl] from comment #26)
> (In reply to Stephen A Pohl [:spohl] from comment #25)
> > Comment on attachment 8975169 [details] [diff] [review]
> > Place bug 1348361 behind ifdef
> > 
> > (In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> > > Comment on attachment 8975144 [details] [diff] [review]
> > > ::: ipc/glue/GeckoChildProcessHost.cpp
> > > @@ +559,1 @@
> > > >      OnProcessLaunchError();
> > > 
> > > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > > ePending and cause OtherPid to block forever instead of crashing?
> > 
> > Without async process launching, this is set in
> > ContentParent::OnChannelConnected.
> 
> Oh, I misread your comment. Let me double-check.

Ok, so we launch the subprocess synchronously in ContentParent::LaunchSubprocess and OtherPid shouldn't get called if we were unable to launch the process successfully because we return early in the failure case.
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #4)
> Previous behavior was in DEBUG builds to crash, and in release builds to
> render a blank content area that could not be interacted with: no new tabs,
> no entering URLs into the browser.

Just want to highlight this. Assuming the analysis is correct, this is exposing a content process launch problem on Windows and OSX that up until now, we just ignored.

Looking through crashes I see most uptimes < 15 seconds, and I see a lot of foreign 3rd party libraries loaded in the browser on Windows.

We'll double down on it next week. Crash rates don't look that serious yet. Most of the signatures are 2x+ crash per install.
(In reply to Stephen A Pohl [:spohl] from comment #25)
> This isn't trying to address the error case. All indicators are that we're
> not encountering the previous error case, but newly introduced errors from
> the patches in bug 1348361. The new crashes are top crashers. Previously, if
> there were crashes, they didn't seem to rank anywhere in the top crashers.
> This patch intends to be a minimal patch to revert to the behavior pre- bug
> 1348361 to be uplifted to beta.

Right, but it looks like your patch will turn those crashes into hangs, and setting the state to eError would make them continue being crashes, like they were before bug 1348361.

> > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > ePending and cause OtherPid to block forever instead of crashing?
> 
> Without async process launching, this is set in
> ContentParent::OnChannelConnected.

But OnChannelConnected isn't called until the parent gets the child's Hello message, which it can't send if it didn't launch: https://searchfox.org/mozilla-central/rev/babf96cf0c37b608e9663e2af73a4d21819c4af1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#551
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef

Review of attachment 8975169 [details] [diff] [review]:
-----------------------------------------------------------------

Summarizing discussion on IRC: ContentParent::LaunchSubprocess is private, and all of its call sites are static methods that destroy the object immediately if it returns false, so we don't need to care about the pid state in that case.  It might be nice to assert that that can't happen, but not essential.
Attachment #8975169 - Flags: review?(jld) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/250e4ae8ede9fce6f3f7a357438bc6fd6a56416f
Bug 1446900: Place bug 1348361 (async content process launching) behind an ifdef and disable it to avoid numerous top crashers. r=jld
Looking good so far with today's Nightlies! Stephen, can you please nominate this patch for Beta uplift as soon as you're comfortable doing? Ideally, I'd like to include it in Monday's 61.0b5 builds.
Flags: needinfo?(agaynor) → needinfo?(spohl.mozilla.bugs)
Clearing signatures that don't have the MOZ_RELEASE_ASSERT(mOtherPidState == ProcessIdState::eReady) crash reason. Furthermore, all crash reports for these two signatures occurred in versions prior to 61 when bug 1348361 landed.
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol] [@ mozilla::ipc::IToplevelProtocol::Ope… → [@ mozilla::ipc::IToplevelProtocol::OtherPid] [@ mozilla::ipc::IToplevelProtocol::OtherPid const] [@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] [@ mozilla::ipc::IToplevelProtocol::Open]
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1348361
[User impact if declined]: top crashes
[Is this code covered by automated tests?]: Launching of child processes is exercised throughout. However, there is no dedicated test for this patch that reverts to the way we used to launch child processes prior to bug 1348361.
[Has the fix been verified in Nightly?]: Yes. No crash reports on nightly since landing.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: medium/minimal
[Why is the change risky/not risky?]: This reverts our code paths to the way we used to launch child processes prior to the landing of bug 1348361. However, since this isn't a straight backout, there is a risk of introducing new problems. Attempts were made to keep the patch as minimal and straightforward as possible.
[String changes made/needed]: none
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8975169 - Flags: approval-mozilla-beta?
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef

Nightly crash data is still looking good, let's get this landed for 61.0b5. Is there anything in particular we should be on the lookout for from a regression standpoint after this lands?
Attachment #8975169 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ryan VanderMeulen [:RyanVM] from comment #36)
> Comment on attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
> 
> Nightly crash data is still looking good, let's get this landed for 61.0b5.
> Is there anything in particular we should be on the lookout for from a
> regression standpoint after this lands?

It is expected that there will be an uptick in users who encounter blank tabs that cannot be interacted with, since this was the behavior pre- bug 1348361. What would be of concern (and a regression) is if the landing of this patch caused any new crashes, or if the number of users encountering blank tabs outpaces the number of crashes that were reported here from landing bug 1348361.
Per IRC discussion with Stephen, we're going to move the re-enabling patch to a different bug to simplify tracking.
Status: ASSIGNED → RESOLVED
Closed: Last year
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Attachment #8975164 - Attachment is obsolete: true
Attachment #8975164 - Flags: review?(ted)
No crash reports from any builds including this fix -> verified.
Status: RESOLVED → VERIFIED

Similar bug just happened.
https://bugzilla.mozilla.org/show_bug.cgi?id=1564127

Reopen?

Flags: needinfo?(ryanvm)

No, tracking it in a new bug is the right thing to do.

Flags: needinfo?(ryanvm)
You need to log in before you can comment on or make changes to this bug.