NotifyTabDestroying may start the ShutDownKill timer but apparently BrowserParent::ActorDestroy did not arrive (in time?)
Categories
(Core :: XPCOM, task)
Tracking
()
People
(Reporter: jstutte, Unassigned)
References
(Blocks 1 open bug)
Details
We see some child process shutdown hangs where the main thread is apparently just waiting to be notified on mMainThreadCV
.
Example of main thread stack (build id 20220904213226)
ntdll.dll!NtWaitForAlertByThreadId() Unbekannt
ntdll.dll!RtlSleepConditionVariableSRW() Unbekannt
KERNELBASE.dll!SleepConditionVariableSRW() Unbekannt
mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50 C++
[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58 C++
xul.dll!mozilla::TaskController::ProcessPendingMTTask(bool aMayWait) Zeile 474 C++
[Inlineframe] xul.dll!mozilla::TaskController::InitializeInternal::<lambda_2>::operator()() Zeile 190 C++
[Inlineframe] xul.dll!mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:190:7'>::Run() Zeile 531 C++
> xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1205 C++
[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465 C++
xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Zeile 107 C++
[Inlineframe] xul.dll!MessageLoop::RunInternal() Zeile 381 C++
xul.dll!MessageLoop::RunHandler() Zeile 375 C++
xul.dll!MessageLoop::Run() Zeile 357 C++
xul.dll!nsBaseAppShell::Run() Zeile 152 C++
This TaskController
condvar has been added by bug 1606706. IIUC before that we would have never waited here.
Edit: See comment 6.
Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 2•2 years ago
|
||
I mean.. it's meant to block if you pass aMayWait equals true, it's possible the -calling- code is incorrectly passing that, but I don't clearly see any potential for a bug in here? Like if there's a task it immediately breaks from the loop and will exit the function, the only way I could ever see this waiting, is if it's called with aMayWait is true, and there's no task in the queue, which is exactly what it's supposed to do. aMayWait is not supposed to be called with aMayWait is true on shutdown. See: https://hg.mozilla.org/mozilla-central/file/93ba1d57fd3385fe992e8723dfdd27169404b511/xpcom/threads/nsThread.cpp#l1077
Reporter | ||
Comment 3•2 years ago
•
|
||
If I inspect the crash cited above with VS related to the condition bool reallyWait = aMayWait && (mNestedEventLoopDepth > 0 || !ShuttingDown());
, it seems that (if I can trust the nsThread
this
from the dump, but it looks all reasonable):
mNestedEventLoopDepth
is1
which means it must have been0
when we were evaluatingreallyWait
mShutdownContext
isnullptr
, thus we are not yet shutting down the main thread
which would mean that reallyWait
should have been true
because we called NS_ProcessNextEvent(thisThread, true);
.
So I'd assume the main thread entered this state as expected before we were expecting any shutdown.
IIUC mMainThreadCV.Notify();
happens only from EnsureMainThreadTasksScheduled
which in turn is called in three places:
TaskController::ProcessPendingMTTask
which we can probably exclude to be the one we wait for as we are in there on our same stackTaskController::MaybeInterruptTask
TaskController::RunPoolThread
In the dump, all 4 TaskController
threads are waiting on mThreadPoolCV.Wait();
.
None of the other threads shows sign of any activity that could schedule a new task. In particular the IPC I/O thread seems to be just waiting for input, as well.
The "IPCShutdownState" annotation says - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
. As expected, the paired dump from the parent browser is inside ContentParent::KillHard
run by a timer. So we should have passed here:
SignalImpendingShutdownToContentJS();
StartForceKillTimer();
if (aSendShutDown) {
MaybeAsyncSendShutDownMessage();
}
So if aSendShutDown
is not true
, we will not send any shutdown message to the content process but interrupt content JS and start the force kill timer, which might explain this situation if nothing else happens inside the content process.
This is the case for ContentParent::NotifyTabDestroying
.
I am wondering if we should even start the ForceKillTimer
in this case? IIUC we send the shutdown message later as a consequence inside ContentParent::NotifyTabDestroyed
and that would be the right moment to activate the timer ?
Reporter | ||
Comment 4•2 years ago
|
||
IIUC NotifyTabDestroyed
can only come from BrowserParent::ActorDestroy
, which in turn seems to be triggered by the DelayedDeleteRunnable
dispatched inside BrowserChild::RecvDestroy
.
SendDestroy();
inside DestroyInternal
comes from BrowserParent::Destroy
where I read:
DestroyInternal();
mIsDestroyed = true;
Manager()->NotifyTabDestroying();
IIUC we rely in our book-keeping on the right order between NotifyTabDestroyed
and NotifyTabDestroying
- could it be that we see a race where we end up with NotifyTabDestroyed
being called before its corrisponding NotifyTabDestroying
?
Reporter | ||
Comment 5•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #4)
IIUC we rely in our book-keeping on the right order between
NotifyTabDestroyed
andNotifyTabDestroying
- could it be that we see a race where we end up withNotifyTabDestroyed
being called before its corrispondingNotifyTabDestroying
?
Actually I don't think that can happen, given that we run both on the same thread, IIUC.
Reporter | ||
Comment 6•2 years ago
•
|
||
Just a recap of my understanding of how we do our book keeping here:
NotifyTabDestroying
is meant to mark a content process as dead, ifmNumDestroyingTabs
reaches the total number ofBrowserParent
s managed by thisContentParent
. Only in this case it will start theForceKillTimer
and mark it as dead.NotifyTabDestroyed
is meant to actually shutdown the process if the number of managedBrowserParent
s reached 1
The symptoms we see in the crash could well apply to a situation where we did 1. but 2. did not happen or did not find the right number of expected browser parents for whatever reasons and thus did not trigger the actual shutdown.
But I cannot see anything obvious here that would make us fail our book keeping here, assuming that MarkAsDead
will always prevent us from adding a new BrowserParent
and that BrowserParent::ActorDestroy
is triggered for all instances.
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 7•2 years ago
|
||
ContentParent::NotifyTabDestroying
calls MaybeBeginShutDown
with aSendShutDown == false
.
MaybeBeginShutDown
will do StartForceKillTimer();
regardless of aSendShutDown == false
but will not send anything to the child.
BrowserParent::ActorDestroy
will then NotifyTabDestroyed
which will finally cause the shutdown message to be sent.
If we wait to long (or endlessly, who knows?) for BrowserParent::ActorDestroy
to happen in the parent we might see this situation, also in some recent hangs like this.
Reporter | ||
Comment 8•1 year ago
|
||
It is amazing that I almost forgot about this and made the same analysis (but apparently better) on bug 1837467.
Description
•