Closed Bug 1673711 Opened 3 months ago Closed 2 months ago

Release assert crash on pid != kInvalidProcess in [@ mozilla::dom::ContentParent::KillHard]

Categories

(Core :: DOM: Content Processes, defect, P1)

Unspecified
Windows 7
defect

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox81 --- unaffected
firefox82 --- unaffected
firefox83 + fixed
firefox84 --- fixed

People

(Reporter: aryx, Assigned: nika)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

2 of these crashes today with 84.0a1 20201027095021 - both on Windows 7 but on different machines.

Crash report: https://crash-stats.mozilla.org/report/index/8d9141ca-8479-4dea-a86f-6831b0201027

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(pid != kInvalidProcessId)

Top 10 frames of crashing thread:

0 xul.dll mozilla::dom::ContentParent::KillHard dom/ipc/ContentParent.cpp:3844
1 xul.dll nsTimerEvent::Run xpcom/threads/TimerThread.cpp:251
2 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:515
3 xul.dll mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:85:7'>::Run xpcom/threads/nsThreadUtils.h:577
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1197
5 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
6 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:327
7 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
8 xul.dll nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137
9 xul.dll nsAppShell::Run widget/windows/nsAppShell.cpp:602

This looks like an assert from mozilla::ipc::IToplevelProtocol::OtherPid, which doesn't seem like quite the same thing as the stack.

Wait, I got confused. KillHard does call OtherPid here: if (!base::OpenProcessHandle(OtherPid(), &otherProcessHandle)) {, but that doesn't quite match up with the line number.

Presumably this could happen if we call KillHard before we call ContentParent::OnChannelConnected() to set the PID?

These are the change sets added in the build where this first appeared: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b1a74943bc51bd3e62ea52242ec5e403ea3760bb&tochange=3d6ed9f4cf340215fb48fe1a8d414ac7e18488e7

Bug 1670557 is in that range and involves ITopLevelProtocol. It isn't obvious to me how that would cause this.

Flags: needinfo?(nika)

I found a few other crashes with this same release assert in the last few months, but they weren't from KillHard.

Summary: Crash in [@ mozilla::dom::ContentParent::KillHard] → Release assert crash on pid != kInvalidProcessI in [@ mozilla::dom::ContentParent::KillHard]
Summary: Release assert crash on pid != kInvalidProcessI in [@ mozilla::dom::ContentParent::KillHard] → Release assert crash on pid != kInvalidProcess in [@ mozilla::dom::ContentParent::KillHard]

If this started with bug 1670557, then I presume that these crashes used to crash in that code, and due to the fix are now making it later and crashing somewhere else. The function where the crash previously started was ContentParent::StartForceKillTimer, which would start a timer to ForceKill the content process within a certain timeframe.

The stacks we see here make sense in that context, in that StartForceKillTimer was previously crashing if called before the actor was opened, but now will succeed in starting the timer. When the timer resolves, it will call KillHard with a timer on the stack (https://searchfox.org/mozilla-central/rev/c409dd9235c133ab41eba635f906aa16e050c197/dom/ipc/ContentParent.cpp#3768). The crash is then likely happening here, when we read OtherPid() to get which process to kill. (https://searchfox.org/mozilla-central/rev/c409dd9235c133ab41eba635f906aa16e050c197/dom/ipc/ContentParent.cpp#3833)

I'm not sure why the line numbers in the crash report don't line up here, but I'm inclined to believe the MozCrashReason over the stack.

Looking at this a bit more, I am seeing other issues which can happen when we kill a content process before it's been connected, which should probably be looked into. We may need a new "LaunchingDead" state for a content process which has already been marked as dead, but is still launching and therefore doesn't have a PID yet.

Tracking for 83 as it blocks uplifting the crash fix in bug 1670557

Assignee: nobody → nika
Flags: needinfo?(nika)

Making P1 because this bug blocks high-volume crash bug 1670557.

Severity: -- → S3
Priority: -- → P1

Previously, it was possible to accidentaly shut down a content process while it
was still launching if the process switch which requested the switch was
cancelled.

mIPCOpen would incorrectly be true before the actual IPC channel had been
opened. By instead using the IPC-provided CanSend() method, we can avoid this.

Comment on attachment 9185771 [details]
Bug 1673711 - Part 1: Don't shutdown processes before launching,

Beta/Release Uplift Approval Request

  • User impact if declined: High-frequency crash from bug 1670557
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1670557
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Preemptively requesting beta review, as we have a tight schedule to uplift the fixes if we will be landing them.

I believe that the first part is all that is necessary in order to uplift fix the high-frequency crash from bug 1670557. Parts 2 and 3 are more in-depth fixes to how we launch processes which aim to also fix other potential ways this crash could be triggered.

Marking as medium risk largely because of the fragility of the process launch codepath, and the patch being untested on nightly.

I've chosen to only request approval for the first patch in the stack, in order to keep the change to uplift as small as possible, but we may want to uplift the entire stack for a more thorough fix to the issue.

  • String changes made/needed: None
Attachment #9185771 - Flags: approval-mozilla-beta?

Comment on attachment 9185771 [details]
Bug 1673711 - Part 1: Don't shutdown processes before launching,

Topcrash mitigation for our 83 release, uplift approved for our last beta, thanks Nika!
(We will take this patch in beta 9 even though it is not reviewed yet as we need time to evaluate the impact on crashes in bug 1670557)

Attachment #9185771 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Pushed by nlayzell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/337c396332ae
Part 1: Don't shutdown processes before launching, r=jesup
https://hg.mozilla.org/integration/autoland/rev/9395730616c3
Part 2: Use CanSend instead of mIPCOpen in ContentParent, r=kmag
https://hg.mozilla.org/integration/autoland/rev/1086a6f8a332
Part 3: Handle LaunchSubprocessResolve on an already-dead content process, r=kmag

Comment on attachment 9185772 [details]
Bug 1673711 - Part 2: Use CanSend instead of mIPCOpen in ContentParent,

Beta/Release Uplift Approval Request

  • User impact if declined: See comment 12
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The previous attempt to fix this issue (just uplifting part 1) did not succeed. I think, based on my tracing of the code, that uplifting part 2 as well will fix the issue, but I could easily be wrong again.

I believe that the issue is being caused by a different reason than I originally expected. Instead of the issue being caused by a cancelled process switch during a launch, I think it's being caused by a failed process launch, likely during shutdown, and then some buggy code attempting to start a new shutdown after the process launch already failed.

The code in this part would handle this situation. Previously the failed process launch codepath would have left mIPCOpen as true, whereas CanSend() will correctly return false, meaning that methods will early abort before the crashy calls.

It may be possible to uplift a smaller patch by adding targeted CanSend() calls in specific methods. Adding a single CanSend() check to ContentParent::MaybeBeginShutDown() before the call to StartForceKillTimer (https://searchfox.org/mozilla-beta/rev/805b8b92155021238969abffa8e36c0cb21e34f3/dom/ipc/ContentParent.cpp#1573) should also fix this specific case, but may be less thorough.

   // We're dying now, prevent anything from re-using this process.
   MarkAsDead();
+
+  // We can't shut down without an IPC connection.
+  if (!CanSend()) {
+    return;
+  }
+
   StartForceKillTimer();

I would like to write an automated test case for this type of failure to make sure we don't regress it again, and I have some ideas for how it could be possible, but I don't think it can be written in a reasonable timeframe for uplift.

  • String changes made/needed: None
Attachment #9185772 - Flags: approval-mozilla-beta?

Comment on attachment 9185772 [details]
Bug 1673711 - Part 2: Use CanSend instead of mIPCOpen in ContentParent,

This is still our biggest new top crasher in 83 beta, so let's take this patch in our RC build, thanks!

Attachment #9185772 - Flags: approval-mozilla-beta? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.