Lots of shutdown hangs with ShutdownAndMaybeGrabShutdownProfileFirst on the content MT stack
Categories
(Core :: Graphics, task, P3)
Tracking
()
People
(Reporter: mccr8, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 obsolete file)
While looking through the shutdown hangs, I've noticed an oddly large number of them have the URL chrome://gfxsanity/content/sanitytest.html , which I think is some kind of self testing the graphics code does. Looking at 76 only, we have about 7600 shutdownkill hang reports, and around 1000 crash reports with the sanity test URL. In fact, it looks like almost all of the crash reports with sanitytest URLs (all but 6!) are hangs. Maybe there's nothing to be done, but maybe that's something worth investigating.
Here's a couple of examples:
bp-998a7de5-fc1a-4d22-8025-5e0530200312
bp-742d9ad8-a000-4166-84e3-db71f0200312
Reporter | ||
Comment 1•5 years ago
|
||
Maybe sanitytest.html is loaded in a content process that opens and shuts down quickly? That might cause issues. The hangs are in all sorts of different code, so it doesn't feel like the test itself would be the problem.
Updated•5 years ago
|
Comment 2•5 years ago
|
||
Has this been happening for a while is it a more recent thing?
Reporter | ||
Comment 3•5 years ago
|
||
Most of the crashes are on 75 and later, but maybe we just autosubmit content process crashes on beta. I'm not sure of this history of this. We've mostly been ignoring shutdown hangs until recently.
Comment 4•3 years ago
|
||
I cracked up one of those:
ntdll.dll!NtWaitForAlertByThreadId() Unbekannt
ntdll.dll!RtlSleepConditionVariableSRW() Unbekannt
KERNELBASE.dll!SleepConditionVariableSRW() Unbekannt
mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50 C++
[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58 C++
[Inlineframe] xul.dll!mozilla::TaskController::GetRunnableForMTTask(bool aReallyWait) Zeile 586 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1139 C++
[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465 C++
[Inlineframe] xul.dll!mozilla::SpinEventLoopUntil(const nsTSubstring<char> & aVeryGoodReasonToDoThis, mozilla::ThreadEventTarget::Dispatch::<lambda_4> && aPredicate, nsIThread * aThread) Zeile 176 C++
> [Inlineframe] xul.dll!mozilla::ThreadEventTarget::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 89 C++
xul.dll!nsThread::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 676 C++
xul.dll!mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst(nsTString<char> * aOutShutdownProfile) Zeile 107 C++
xul.dll!mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown() Zeile 60 C++
xul.dll!mozilla::dom::ContentChild::ShutdownInternal() Zeile 3026 C++
xul.dll!mozilla::dom::ContentChild::RecvShutdown() Zeile 2968 C++
xul.dll!mozilla::dom::PContentChild::OnMessageReceived(const IPC::Message & msg__) Zeile 12123 C++
It seems something goes wrong with the DISPATCH_SYNC
and we are stuck inside its SpinEventLoopUntil
. AFAICS from that crash, the ProfilerChild
thread is just waiting for new events:
> mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50 C++
[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58 C++
[Inlineframe] xul.dll!mozilla::ThreadEventQueue::GetEvent(bool aMayWait, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> * aLastEventDelay) Zeile 179 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1141 C++
[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465 C++
xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate * aDelegate) Zeile 330 C++
[Inlineframe] xul.dll!MessageLoop::RunInternal() Zeile 381 C++
xul.dll!MessageLoop::RunHandler() Zeile 375 C++
xul.dll!MessageLoop::Run() Zeile 357 C++
That would imply, that nsThreadSyncDispatch::Run
ran on the target thread, unless there is a way for bool success = mSink->PutEvent(...)
to fail without returning false
.
Comment 5•3 years ago
|
||
There might be an edge case around mIsShutDown
here while dispatching back to the mOrigin
, though it is not clear to me how that could have been triggered.
Updated•3 years ago
|
Comment 6•3 years ago
•
|
||
Looking at a newer instance cb421913-16e7-4779-9147-d937f0230105 there is a very strange thing in the annotation:
IPCShutdownState:
- NotifiedImpendingShutdown
- HangMonitorChild::RecvRequestContentJSInterrupt (expected)
- NotifiedImpendingShutdown
- HangMonitorChild::RecvRequestContentJSInterrupt (expected)
- RecvShutdownConfirmedHP entry
- RecvShutdown entry
- content-child-will-shutdown started
- ShutdownInternal entry
- content-child-shutdown started
- StartForceKillTimer
- SendFinishShutdown (sending)
- SendFinishShutdown (sent)
where SendFinishShutdown (sent)
would mean we actually arrived at the very end of ShutdownInternal
. But the hanging stack tells us, we are only at the beginning.
But the annotation does not tell us that we entered ShutdownInternal entry
twice, so I am really puzzled how this can happen.
Comment 7•3 years ago
|
||
All instances that carry SendFinishShutdown (sent)
as IPCShutdownState
I clicked on seem to be stuck inside ShutdownAndMaybeGrabShutdownProfileFirst
, so it does not depend on the URL.
IIUC bug 1710607 tried to solve a similar problem, but apparently it did just change the symptoms.
My naive guess would be that the content process receives the shutdown message twice, though I'd have expected the ProfilerChild
thread to be long gone then and/or to see more/different annotations (but those might just not have arrived anymore having basically shut down everything after SendFinishShutdown (sent)
?).
Comment 8•3 years ago
|
||
In theory this should never happen, but we want to be explicitly sure it doesn't.
Updated•3 years ago
|
Comment 9•3 years ago
•
|
||
I do not expect that patch to solve this issue, at least from code reading it seems impossible that the parent sends the shutdown message twice. But if it does we should understand, why (and I do not have a better guess for now).
Updated•2 years ago
|
Updated•2 years ago
|
Comment 10•2 years ago
|
||
This is still happening from time to time, after bug 1825360 the stack slightly changed, but in principle this is still a sync dispatch.
Apparently we dispatch ChildProfilerController::ShutdownProfilerChild
to the ProfilerChild
thread (so the event must have arrived in its queue) and then wait for the event to finish. But that thread is just idling, according to its stack here. It seems unlikely that the event ran (as we are waiting for mIsPending
and that is always executed inside the task's Run
).
Ways how this might happen:
- something removed the event from the queue without executing it (but I see no sign for this in the code I looked at)
- the machine is preferring to give cycles to other processes than us
- cosmic rays...
At least in this stack trace I see not much actionable.
Comment 11•2 years ago
|
||
Looking at some recent instances it seems that many of them carry the ProfilerChildShutdownPhase : "Not profiling - SendShutdownProfile (sent)"
annotation and hang inside ChildProfilerController::GrabShutdownProfileAndShutdown
.
Now that annotation is written only after we did GrabShutdownProfileAndShutdown
- where we hang in all of these crashes I looked at. That would indicate we come here twice.
And the IPCShutdownState annotation shows a full cycle of shutdown already, too:
- NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
That seems to indicate that while we are waiting for the parent process to acknowledge we are finished we:
- receive a
ContentChild::RecvInitProfiler
message, otherwisemProfileController
would be nullptr - receive another
ContentChild::RecvShutdown
that will make us hang
I can only speculate why we do not see another RecvShutdown entry
in this case, though.
Anyhow it seems that there is something weird ongoing with asking us to shutdown more than once.
Description
•