Closed Bug 1789803 Opened 2 years ago Closed 1 year ago

NotifyTabDestroying may start the ShutDownKill timer but apparently BrowserParent::ActorDestroy did not arrive (in time?)

Categories

(Core :: XPCOM, task)

task

Tracking

()

RESOLVED DUPLICATE of bug 1837467

People

(Reporter: jstutte, Unassigned)

References

(Blocks 1 open bug)

Details

We see some child process shutdown hangs where the main thread is apparently just waiting to be notified on mMainThreadCV.

Example of main thread stack (build id 20220904213226)

 	ntdll.dll!NtWaitForAlertByThreadId()	Unbekannt
 	ntdll.dll!RtlSleepConditionVariableSRW()	Unbekannt
 	KERNELBASE.dll!SleepConditionVariableSRW()	Unbekannt
 	mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50	C++
 	[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58	C++
 	xul.dll!mozilla::TaskController::ProcessPendingMTTask(bool aMayWait) Zeile 474	C++
 	[Inlineframe] xul.dll!mozilla::TaskController::InitializeInternal::<lambda_2>::operator()() Zeile 190	C++
 	[Inlineframe] xul.dll!mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:190:7'>::Run() Zeile 531	C++
>	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1205	C++
 	[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465	C++
 	xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Zeile 107	C++
 	[Inlineframe] xul.dll!MessageLoop::RunInternal() Zeile 381	C++
 	xul.dll!MessageLoop::RunHandler() Zeile 375	C++
 	xul.dll!MessageLoop::Run() Zeile 357	C++
 	xul.dll!nsBaseAppShell::Run() Zeile 152	C++

This TaskController condvar has been added by bug 1606706. IIUC before that we would have never waited here.

Edit: See comment 6.

Blocks: 1789810
No longer blocks: IPCError_ShutDownKill

Might be related to bug 1678330.

See Also: → 1678330
Flags: needinfo?(bas)

I mean.. it's meant to block if you pass aMayWait equals true, it's possible the -calling- code is incorrectly passing that, but I don't clearly see any potential for a bug in here? Like if there's a task it immediately breaks from the loop and will exit the function, the only way I could ever see this waiting, is if it's called with aMayWait is true, and there's no task in the queue, which is exactly what it's supposed to do. aMayWait is not supposed to be called with aMayWait is true on shutdown. See: https://hg.mozilla.org/mozilla-central/file/93ba1d57fd3385fe992e8723dfdd27169404b511/xpcom/threads/nsThread.cpp#l1077

Flags: needinfo?(bas)

If I inspect the crash cited above with VS related to the condition bool reallyWait = aMayWait && (mNestedEventLoopDepth > 0 || !ShuttingDown());, it seems that (if I can trust the nsThread this from the dump, but it looks all reasonable):

  • mNestedEventLoopDepth is 1 which means it must have been 0 when we were evaluating reallyWait
  • mShutdownContext is nullptr, thus we are not yet shutting down the main thread

which would mean that reallyWait should have been true because we called NS_ProcessNextEvent(thisThread, true);.

So I'd assume the main thread entered this state as expected before we were expecting any shutdown.

IIUC mMainThreadCV.Notify(); happens only from EnsureMainThreadTasksScheduled which in turn is called in three places:

  • TaskController::ProcessPendingMTTask which we can probably exclude to be the one we wait for as we are in there on our same stack
  • TaskController::MaybeInterruptTask
  • TaskController::RunPoolThread

In the dump, all 4 TaskController threads are waiting on mThreadPoolCV.Wait();.

None of the other threads shows sign of any activity that could schedule a new task. In particular the IPC I/O thread seems to be just waiting for input, as well.

The "IPCShutdownState" annotation says - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) . As expected, the paired dump from the parent browser is inside ContentParent::KillHard run by a timer. So we should have passed here:

  SignalImpendingShutdownToContentJS();
  StartForceKillTimer();

  if (aSendShutDown) {
    MaybeAsyncSendShutDownMessage();
  }

So if aSendShutDown is not true, we will not send any shutdown message to the content process but interrupt content JS and start the force kill timer, which might explain this situation if nothing else happens inside the content process.

This is the case for ContentParent::NotifyTabDestroying.

I am wondering if we should even start the ForceKillTimer in this case? IIUC we send the shutdown message later as a consequence inside ContentParent::NotifyTabDestroyed and that would be the right moment to activate the timer ?

Flags: needinfo?(smaug)

IIUC NotifyTabDestroyed can only come from BrowserParent::ActorDestroy, which in turn seems to be triggered by the DelayedDeleteRunnable dispatched inside BrowserChild::RecvDestroy.
SendDestroy(); inside DestroyInternal comes from BrowserParent::Destroy where I read:

  DestroyInternal();

  mIsDestroyed = true;

  Manager()->NotifyTabDestroying();

IIUC we rely in our book-keeping on the right order between NotifyTabDestroyed and NotifyTabDestroying - could it be that we see a race where we end up with NotifyTabDestroyed being called before its corrisponding NotifyTabDestroying ?

(In reply to Jens Stutte [:jstutte] from comment #4)

IIUC we rely in our book-keeping on the right order between NotifyTabDestroyed and NotifyTabDestroying - could it be that we see a race where we end up with NotifyTabDestroyed being called before its corrisponding NotifyTabDestroying ?

Actually I don't think that can happen, given that we run both on the same thread, IIUC.

Just a recap of my understanding of how we do our book keeping here:

  1. NotifyTabDestroying is meant to mark a content process as dead, if mNumDestroyingTabs reaches the total number of BrowserParents managed by this ContentParent. Only in this case it will start the ForceKillTimer and mark it as dead.
  2. NotifyTabDestroyed is meant to actually shutdown the process if the number of managed BrowserParents reached 1

The symptoms we see in the crash could well apply to a situation where we did 1. but 2. did not happen or did not find the right number of expected browser parents for whatever reasons and thus did not trigger the actual shutdown.

But I cannot see anything obvious here that would make us fail our book keeping here, assuming that MarkAsDead will always prevent us from adding a new BrowserParent and that BrowserParent::ActorDestroy is triggered for all instances.

Summary: Verify if ProcessPendingMTTTask might block in unknown edge cases → NotifyTabDestroying may start the ShutDownKill timer but apparently BrowserParent::ActorDestroy did not arrive (in time?)
Flags: needinfo?(smaug)

ContentParent::NotifyTabDestroying calls MaybeBeginShutDown with aSendShutDown == false.

MaybeBeginShutDown will do StartForceKillTimer(); regardless of aSendShutDown == false but will not send anything to the child.

BrowserParent::ActorDestroy will then NotifyTabDestroyed which will finally cause the shutdown message to be sent.

If we wait to long (or endlessly, who knows?) for BrowserParent::ActorDestroy to happen in the parent we might see this situation, also in some recent hangs like this.

It is amazing that I almost forgot about this and made the same analysis (but apparently better) on bug 1837467.

Status: NEW → RESOLVED
Closed: 1 year ago
Duplicate of bug: 1837467
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.