Closed
Bug 1446900
Opened 7 years ago
Closed 6 years ago
Crash in mozilla::ipc::IToplevelProtocol::OtherPid
Categories
(Core :: IPC, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla62
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | unaffected |
firefox59 | --- | wontfix |
firefox60 | --- | wontfix |
firefox61 | blocking | verified |
firefox62 | + | verified |
People
(Reporter: calixte, Assigned: spohl)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file, 2 obsolete files)
4.37 KB,
patch
|
jld
:
review+
RyanVM
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-4d3582e4-854f-4cc0-ae06-5470f0180318.
=============================================================
Top 10 frames of crashing thread:
0 xul.dll mozilla::ipc::IToplevelProtocol::OtherPid ipc/glue/ProtocolUtils.cpp:602
1 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::FileDescriptor>::Write ipc/glue/FileDescriptor.cpp:235
2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::dom::FileDescOrError>::Write ipc/ipdl/PContent.cpp:4049
3 xul.dll mozilla::dom::PContentParent::SendPScriptCacheConstructor ipc/ipdl/PContentParent.cpp:1128
4 xul.dll mozilla::ScriptPreloader::InitContentChild js/xpconnect/loader/ScriptPreloader.cpp:184
5 xul.dll mozilla::dom::ContentParent::InitInternal dom/ipc/ContentParent.cpp:2319
6 xul.dll mozilla::dom::ContentParent::LaunchSubprocess dom/ipc/ContentParent.cpp:2073
7 xul.dll mozilla::dom::ContentParent::PreallocateProcess dom/ipc/ContentParent.cpp:608
8 xul.dll mozilla::PreallocatedProcessManagerImpl::AllocateNow dom/ipc/PreallocatedProcessManager.cpp:269
9 xul.dll mozilla::detail::RunnableMethodImpl<nsCOMPtr<nsIThreadPool>, nsresult xpcom/threads/nsThreadUtils.h:1200
=============================================================
There are 2 crashes (from 2 installations) in nightly 61 with buildid 20180317220121. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1348361.
[1] https://hg.mozilla.org/mozilla-central/rev?node=dd3dc46a8d7d
Flags: needinfo?(agaynor)
Reporter | ||
Updated•7 years ago
|
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid] → [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
Comment 1•7 years ago
|
||
Hmmm, I think this is what would happen if the process failed to spawn for any reason. OtherPid() doesn't have any way of signaling that the process is in an errored state so we crash. We probably need a better way to handle asynchronous process spawn failures -- though I'm not sure what circumstances they occur in, or what we do when they occur! (They seem to be very low volume, which is good).
:spohl, I believe this has been true since the very earliest iterations of the async process spawning patch; was async spawn errors something that was ever considered/discussed?
Flags: needinfo?(spohl.mozilla.bugs)
Comment 3•7 years ago
|
||
Memorializing the conclusion here:
1) Need to make sure we understand the previous behaviour on m-c, :spohl thinks we already crashed when a content process failed to spawn, in which case this is really just a change-in-stack, not a new crash.
2) If this is a regression, the right fix is probably to move from OtherPid() blocking to structuring the code so we don't even try to call OtherPid() until the process is spawned, probably with some sort of Promise/Resolver abstraction.
I'm going to attempt to setup an environment where the process fails to spawn and compare.
Flags: needinfo?(agaynor)
Comment 4•7 years ago
|
||
Previous behavior was in DEBUG builds to crash, and in release builds to render a blank content area that could not be interacted with: no new tabs, no entering URLs into the browser.
In light of that, I don't think this is a critical issue, but it's likely by work on bug 1446161 will overlap with resolving this.
Updated•7 years ago
|
Assignee: nobody → agaynor
Updated•7 years ago
|
Priority: -- → P3
Updated•7 years ago
|
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid] → [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
[@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol]
Comment 5•7 years ago
|
||
In addition to the signature I added above, I also see another one [@ mozilla::ipc::IProtocol::GetActorEventTarget ]. Alex - is this all part of this bug or do I need to file another?
Flags: needinfo?(agaynor)
Comment 6•7 years ago
|
||
Do you have a link to one of those crashes?
Comment 7•7 years ago
|
||
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #6)
> Do you have a link to one of those crashes?
Sure, https://crash-stats.mozilla.com/report/index/e24dbbd2-c86c-4720-ac85-8cbfb0180408
Updated•6 years ago
|
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
[@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol] → [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
[@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol]
[@ mozilla::ipc::IToplevelProtocol::Ope…
Comment 8•6 years ago
|
||
That stacktrace is pretty confusing to me, but I believe it's the same underlying issue.
Flags: needinfo?(agaynor)
Updated•6 years ago
|
Comment 9•6 years ago
|
||
Adding another signature - https://bit.ly/2HU2Btc - that has the same crash reason: MOZ_RELEASE_ASSERT(mOtherPidState == ProcessIdState::eReady)
Crash Signature: mozilla::ipc::IToplevelProtocol::Open] → mozilla::ipc::IToplevelProtocol::Open]
[@ IDMap<T>::Remove]
Comment 10•6 years ago
|
||
I observed this crash when I had a single google doc tab open, suspended my laptop, then unsupended it 1.5 hours later.
Comment 11•6 years ago
|
||
This is actually one of the more frequent parent process crashes we're seeing on 61 at the moment.
status-firefox-esr60:
--- → affected
tracking-firefox61:
--- → +
Comment 12•6 years ago
|
||
Hey Alex, this is currently 11% of our browser crashes on Beta61. Any chance you can take another look or provide an update?
Comment 13•6 years ago
|
||
#3 & #5 topcrash on Beta61. Bumping this up to blocker level.
Assignee | ||
Comment 14•6 years ago
|
||
It might make sense to back out bug 1348361, reenable on nightly only and gather data until we have this fixed. Also, we should follow up on the perf gains to see what we have gained from bug 1348361.
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Assignee | ||
Comment 17•6 years ago
|
||
I suggest that we place bug 1348361 behind an ifdef. Once it is confirmed on Nightly that we fixed these crashes and that there are no other new crashes, this patch could be uplifted. We could then reenable bug 1348361 on Nightly and work to fix these crashes with async process launching.
Assignee: agaynor → spohl.mozilla.bugs
Status: NEW → ASSIGNED
Attachment #8975144 -
Flags: review?(jld)
Assignee | ||
Comment 18•6 years ago
|
||
Once we have confirmed that the crashes are fixed by reverting bug 1348361 (attachment 8975144 [details] [diff] [review]), we should reenable it on nightly to gather data, investigate and fix the crashes.
Attachment #8975164 -
Flags: review?(ted)
Assignee | ||
Updated•6 years ago
|
Keywords: leave-open
Assignee | ||
Comment 19•6 years ago
|
||
Assignee | ||
Comment 20•6 years ago
|
||
Fix OR'd ifdef.
Attachment #8975144 -
Attachment is obsolete: true
Attachment #8975144 -
Flags: review?(jld)
Attachment #8975169 -
Flags: review?(jld)
Assignee | ||
Comment 21•6 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #20)
> Created attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
>
> Fix OR'd ifdef.
s/OR/AND/
Comment 22•6 years ago
|
||
Comment on attachment 8975144 [details] [diff] [review]
Place bug 1348361 behind ifdef
Review of attachment 8975144 [details] [diff] [review]:
-----------------------------------------------------------------
Mostly looks good, but doesn't seem to handle the error case correctly — that is, the same way it was handled before the async pid changes, which (if I understand correctly) was that OtherPid would crash if it were called because it release-asserted that the pid was valid, but ContentParent::LaunchSubprocess would have returned false so it (probably?) wasn't being called.
::: dom/ipc/ContentParent.cpp
@@ +1546,5 @@
> }
> }
> #endif
>
> +#ifdef MOZ_CODE_COVERAGE && ASYNC_CONTENTPROC_LAUNCH
#if defined(MOZ_CODE_COVERAGE) && defined(ASYNC_CONTENTPROC_LAUNCH)
@@ +2067,4 @@
> if (!mSubprocess->Launch(extraArgs)) {
> +#else
> + if (!mSubprocess->LaunchAndWaitForProcessHandle(extraArgs)) {
> +#endif
This looks like a good place for a SetOtherProcessId(kInvalidProcessId, ProcessIdState::eError) to replace the one in OnProcessLaunchError.
::: ipc/glue/GeckoChildProcessHost.cpp
@@ +559,1 @@
> OnProcessLaunchError();
If OnProcessLaunchError isn't called, won't the pid state just stay at ePending and cause OtherPid to block forever instead of crashing?
Attachment #8975144 -
Attachment is obsolete: false
Comment 23•6 years ago
|
||
Comment on attachment 8975144 [details] [diff] [review]
Place bug 1348361 behind ifdef
(Fixing mid-air collision.)
Attachment #8975144 -
Attachment is obsolete: true
Comment 24•6 years ago
|
||
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef
Review of attachment 8975169 [details] [diff] [review]:
-----------------------------------------------------------------
See comment #22.
Attachment #8975169 -
Flags: review?(jld) → review-
Assignee | ||
Comment 25•6 years ago
|
||
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> Comment on attachment 8975144 [details] [diff] [review]
> Place bug 1348361 behind ifdef
>
> Review of attachment 8975144 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Mostly looks good, but doesn't seem to handle the error case correctly —
> that is, the same way it was handled before the async pid changes, which (if
> I understand correctly) was that OtherPid would crash if it were called
> because it release-asserted that the pid was valid, but
> ContentParent::LaunchSubprocess would have returned false so it (probably?)
> wasn't being called.
This isn't trying to address the error case. All indicators are that we're not encountering the previous error case, but newly introduced errors from the patches in bug 1348361. The new crashes are top crashers. Previously, if there were crashes, they didn't seem to rank anywhere in the top crashers. This patch intends to be a minimal patch to revert to the behavior pre- bug 1348361 to be uplifted to beta.
> ::: dom/ipc/ContentParent.cpp
> @@ +1546,5 @@
> > }
> > }
> > #endif
> >
> > +#ifdef MOZ_CODE_COVERAGE && ASYNC_CONTENTPROC_LAUNCH
>
> #if defined(MOZ_CODE_COVERAGE) && defined(ASYNC_CONTENTPROC_LAUNCH)
>
> @@ +2067,4 @@
> > if (!mSubprocess->Launch(extraArgs)) {
> > +#else
> > + if (!mSubprocess->LaunchAndWaitForProcessHandle(extraArgs)) {
> > +#endif
>
> This looks like a good place for a SetOtherProcessId(kInvalidProcessId,
> ProcessIdState::eError) to replace the one in OnProcessLaunchError.
This is helpful for a patch that will address the new error cases from async process launching. However, it shouldn't be relevant to this patch that tries to revert to pre- bug 1348361 behavior.
> ::: ipc/glue/GeckoChildProcessHost.cpp
> @@ +559,1 @@
> > OnProcessLaunchError();
>
> If OnProcessLaunchError isn't called, won't the pid state just stay at
> ePending and cause OtherPid to block forever instead of crashing?
Without async process launching, this is set in ContentParent::OnChannelConnected.
Attachment #8975169 -
Flags: review- → review?(jld)
Assignee | ||
Comment 26•6 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #25)
> Comment on attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
>
> (In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> > Comment on attachment 8975144 [details] [diff] [review]
> > ::: ipc/glue/GeckoChildProcessHost.cpp
> > @@ +559,1 @@
> > > OnProcessLaunchError();
> >
> > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > ePending and cause OtherPid to block forever instead of crashing?
>
> Without async process launching, this is set in
> ContentParent::OnChannelConnected.
Oh, I misread your comment. Let me double-check.
Assignee | ||
Comment 27•6 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #26)
> (In reply to Stephen A Pohl [:spohl] from comment #25)
> > Comment on attachment 8975169 [details] [diff] [review]
> > Place bug 1348361 behind ifdef
> >
> > (In reply to Jed Davis [:jld] (⏰UTC-6) from comment #22)
> > > Comment on attachment 8975144 [details] [diff] [review]
> > > ::: ipc/glue/GeckoChildProcessHost.cpp
> > > @@ +559,1 @@
> > > > OnProcessLaunchError();
> > >
> > > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > > ePending and cause OtherPid to block forever instead of crashing?
> >
> > Without async process launching, this is set in
> > ContentParent::OnChannelConnected.
>
> Oh, I misread your comment. Let me double-check.
Ok, so we launch the subprocess synchronously in ContentParent::LaunchSubprocess and OtherPid shouldn't get called if we were unable to launch the process successfully because we return early in the failure case.
Comment 28•6 years ago
|
||
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #4)
> Previous behavior was in DEBUG builds to crash, and in release builds to
> render a blank content area that could not be interacted with: no new tabs,
> no entering URLs into the browser.
Just want to highlight this. Assuming the analysis is correct, this is exposing a content process launch problem on Windows and OSX that up until now, we just ignored.
Looking through crashes I see most uptimes < 15 seconds, and I see a lot of foreign 3rd party libraries loaded in the browser on Windows.
We'll double down on it next week. Crash rates don't look that serious yet. Most of the signatures are 2x+ crash per install.
Comment 29•6 years ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #25)
> This isn't trying to address the error case. All indicators are that we're
> not encountering the previous error case, but newly introduced errors from
> the patches in bug 1348361. The new crashes are top crashers. Previously, if
> there were crashes, they didn't seem to rank anywhere in the top crashers.
> This patch intends to be a minimal patch to revert to the behavior pre- bug
> 1348361 to be uplifted to beta.
Right, but it looks like your patch will turn those crashes into hangs, and setting the state to eError would make them continue being crashes, like they were before bug 1348361.
> > If OnProcessLaunchError isn't called, won't the pid state just stay at
> > ePending and cause OtherPid to block forever instead of crashing?
>
> Without async process launching, this is set in
> ContentParent::OnChannelConnected.
But OnChannelConnected isn't called until the parent gets the child's Hello message, which it can't send if it didn't launch: https://searchfox.org/mozilla-central/rev/babf96cf0c37b608e9663e2af73a4d21819c4af1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#551
Comment 30•6 years ago
|
||
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef
Review of attachment 8975169 [details] [diff] [review]:
-----------------------------------------------------------------
Summarizing discussion on IRC: ContentParent::LaunchSubprocess is private, and all of its call sites are static methods that destroy the object immediately if it returns false, so we don't need to care about the pid state in that case. It might be nice to assert that that can't happen, but not essential.
Attachment #8975169 -
Flags: review?(jld) → review+
Assignee | ||
Comment 31•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/250e4ae8ede9fce6f3f7a357438bc6fd6a56416f
Bug 1446900: Place bug 1348361 (async content process launching) behind an ifdef and disable it to avoid numerous top crashers. r=jld
Comment 32•6 years ago
|
||
bugherder |
Comment 33•6 years ago
|
||
Looking good so far with today's Nightlies! Stephen, can you please nominate this patch for Beta uplift as soon as you're comfortable doing? Ideally, I'd like to include it in Monday's 61.0b5 builds.
Flags: needinfo?(agaynor) → needinfo?(spohl.mozilla.bugs)
Assignee | ||
Comment 34•6 years ago
|
||
Clearing signatures that don't have the MOZ_RELEASE_ASSERT(mOtherPidState == ProcessIdState::eReady) crash reason. Furthermore, all crash reports for these two signatures occurred in versions prior to 61 when bug 1348361 landed.
Crash Signature: [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
[@ mozilla::ipc::IToplevelProtocol::~IToplevelProtocol]
[@ mozilla::ipc::IToplevelProtocol::Ope… → [@ mozilla::ipc::IToplevelProtocol::OtherPid]
[@ mozilla::ipc::IToplevelProtocol::OtherPid const]
[@ mozilla::ipc::IToplevelProtocol::OtherPidMaybeInvalid]
[@ mozilla::ipc::IToplevelProtocol::Open]
Assignee | ||
Comment 35•6 years ago
|
||
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef
Approval Request Comment
[Feature/Bug causing the regression]: Bug 1348361
[User impact if declined]: top crashes
[Is this code covered by automated tests?]: Launching of child processes is exercised throughout. However, there is no dedicated test for this patch that reverts to the way we used to launch child processes prior to bug 1348361.
[Has the fix been verified in Nightly?]: Yes. No crash reports on nightly since landing.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: medium/minimal
[Why is the change risky/not risky?]: This reverts our code paths to the way we used to launch child processes prior to the landing of bug 1348361. However, since this isn't a straight backout, there is a risk of introducing new problems. Attempts were made to keep the patch as minimal and straightforward as possible.
[String changes made/needed]: none
Flags: needinfo?(spohl.mozilla.bugs)
Attachment #8975169 -
Flags: approval-mozilla-beta?
Comment 36•6 years ago
|
||
Comment on attachment 8975169 [details] [diff] [review]
Place bug 1348361 behind ifdef
Nightly crash data is still looking good, let's get this landed for 61.0b5. Is there anything in particular we should be on the lookout for from a regression standpoint after this lands?
Attachment #8975169 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 37•6 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 38•6 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #36)
> Comment on attachment 8975169 [details] [diff] [review]
> Place bug 1348361 behind ifdef
>
> Nightly crash data is still looking good, let's get this landed for 61.0b5.
> Is there anything in particular we should be on the lookout for from a
> regression standpoint after this lands?
It is expected that there will be an uptick in users who encounter blank tabs that cannot be interacted with, since this was the behavior pre- bug 1348361. What would be of concern (and a regression) is if the landing of this patch caused any new crashes, or if the number of users encountering blank tabs outpaces the number of crashes that were reported here from landing bug 1348361.
Comment 39•6 years ago
|
||
Per IRC discussion with Stephen, we're going to move the re-enabling patch to a different bug to simplify tracking.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Updated•6 years ago
|
Attachment #8975164 -
Attachment is obsolete: true
Attachment #8975164 -
Flags: review?(ted)
Comment 40•6 years ago
|
||
No crash reports from any builds including this fix -> verified.
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Comment 44•5 years ago
|
||
Similar bug just happened.
https://bugzilla.mozilla.org/show_bug.cgi?id=1564127
Reopen?
Flags: needinfo?(ryanvm)
Comment 45•5 years ago
|
||
No, tracking it in a new bug is the right thing to do.
Flags: needinfo?(ryanvm)
You need to log in
before you can comment on or make changes to this bug.
Description
•