Open Bug 1622110 Opened 5 years ago Updated 2 years ago

Lots of shutdown hangs with ShutdownAndMaybeGrabShutdownProfileFirst on the content MT stack

Categories

(Core :: Graphics, task, P3)

task

Tracking

()

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 obsolete file)

While looking through the shutdown hangs, I've noticed an oddly large number of them have the URL chrome://gfxsanity/content/sanitytest.html , which I think is some kind of self testing the graphics code does. Looking at 76 only, we have about 7600 shutdownkill hang reports, and around 1000 crash reports with the sanity test URL. In fact, it looks like almost all of the crash reports with sanitytest URLs (all but 6!) are hangs. Maybe there's nothing to be done, but maybe that's something worth investigating.

Here's a couple of examples:
bp-998a7de5-fc1a-4d22-8025-5e0530200312
bp-742d9ad8-a000-4166-84e3-db71f0200312

Maybe sanitytest.html is loaded in a content process that opens and shuts down quickly? That might cause issues. The hangs are in all sorts of different code, so it doesn't feel like the test itself would be the problem.

Flags: needinfo?(jmuizelaar)
Priority: -- → P3

Has this been happening for a while is it a more recent thing?

Flags: needinfo?(jmuizelaar) → needinfo?(continuation)

Most of the crashes are on 75 and later, but maybe we just autosubmit content process crashes on beta. I'm not sure of this history of this. We've mostly been ignoring shutdown hangs until recently.

Flags: needinfo?(continuation)
See Also: → 1625844

This is still happening

I cracked up one of those:

 	ntdll.dll!NtWaitForAlertByThreadId()	Unbekannt
 	ntdll.dll!RtlSleepConditionVariableSRW()	Unbekannt
 	KERNELBASE.dll!SleepConditionVariableSRW()	Unbekannt
 	mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50	C++
 	[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58	C++
 	[Inlineframe] xul.dll!mozilla::TaskController::GetRunnableForMTTask(bool aReallyWait) Zeile 586	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1139	C++
 	[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465	C++
 	[Inlineframe] xul.dll!mozilla::SpinEventLoopUntil(const nsTSubstring<char> & aVeryGoodReasonToDoThis, mozilla::ThreadEventTarget::Dispatch::<lambda_4> && aPredicate, nsIThread * aThread) Zeile 176	C++
>	[Inlineframe] xul.dll!mozilla::ThreadEventTarget::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 89	C++
 	xul.dll!nsThread::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 676	C++
 	xul.dll!mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst(nsTString<char> * aOutShutdownProfile) Zeile 107	C++
 	xul.dll!mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown() Zeile 60	C++
 	xul.dll!mozilla::dom::ContentChild::ShutdownInternal() Zeile 3026	C++
 	xul.dll!mozilla::dom::ContentChild::RecvShutdown() Zeile 2968	C++
 	xul.dll!mozilla::dom::PContentChild::OnMessageReceived(const IPC::Message & msg__) Zeile 12123	C++

It seems something goes wrong with the DISPATCH_SYNC and we are stuck inside its SpinEventLoopUntil. AFAICS from that crash, the ProfilerChild thread is just waiting for new events:

>	mozglue.dll!mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl & lock) Zeile 50	C++
 	[Inlineframe] xul.dll!mozilla::OffTheBooksCondVar::Wait() Zeile 58	C++
 	[Inlineframe] xul.dll!mozilla::ThreadEventQueue::GetEvent(bool aMayWait, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> * aLastEventDelay) Zeile 179	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1141	C++
 	[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 465	C++
 	xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate * aDelegate) Zeile 330	C++
 	[Inlineframe] xul.dll!MessageLoop::RunInternal() Zeile 381	C++
 	xul.dll!MessageLoop::RunHandler() Zeile 375	C++
 	xul.dll!MessageLoop::Run() Zeile 357	C++

That would imply, that nsThreadSyncDispatch::Run ran on the target thread, unless there is a way for bool success = mSink->PutEvent(...) to fail without returning false.

There might be an edge case around mIsShutDown here while dispatching back to the mOrigin, though it is not clear to me how that could have been triggered.

Severity: normal → S3

Looking at a newer instance cb421913-16e7-4779-9147-d937f0230105 there is a very strange thing in the annotation:

IPCShutdownState:
 - NotifiedImpendingShutdown
 - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
 - NotifiedImpendingShutdown
 - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
 - RecvShutdownConfirmedHP entry
 - RecvShutdown entry
 - content-child-will-shutdown started
 - ShutdownInternal entry
 - content-child-shutdown started
 - StartForceKillTimer
 - SendFinishShutdown (sending)
 - SendFinishShutdown (sent)

where SendFinishShutdown (sent) would mean we actually arrived at the very end of ShutdownInternal. But the hanging stack tells us, we are only at the beginning.

But the annotation does not tell us that we entered ShutdownInternal entry twice, so I am really puzzled how this can happen.

All instances that carry SendFinishShutdown (sent) as IPCShutdownState I clicked on seem to be stuck inside ShutdownAndMaybeGrabShutdownProfileFirst, so it does not depend on the URL.

IIUC bug 1710607 tried to solve a similar problem, but apparently it did just change the symptoms.

My naive guess would be that the content process receives the shutdown message twice, though I'd have expected the ProfilerChild thread to be long gone then and/or to see more/different annotations (but those might just not have arrived anymore having basically shut down everything after SendFinishShutdown (sent)?).

Summary: Lots of shutdown hangs with URL chrome://gfxsanity/content/sanitytest.html → Lots of shutdown hangs with ShutdownAndMaybeGrabShutdownProfileFirst on the content MT stack
See Also: → 1710607

In theory this should never happen, but we want to be explicitly sure it doesn't.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED

I do not expect that patch to solve this issue, at least from code reading it seems impossible that the parent sends the shutdown message twice. But if it does we should understand, why (and I do not have a better guess for now).

Keywords: leave-open
Attachment #9311032 - Attachment is obsolete: true
Assignee: jstutte → nobody
Status: ASSIGNED → NEW
Keywords: leave-open

This is still happening from time to time, after bug 1825360 the stack slightly changed, but in principle this is still a sync dispatch.

Apparently we dispatch ChildProfilerController::ShutdownProfilerChild to the ProfilerChild thread (so the event must have arrived in its queue) and then wait for the event to finish. But that thread is just idling, according to its stack here. It seems unlikely that the event ran (as we are waiting for mIsPending and that is always executed inside the task's Run).

Ways how this might happen:

  • something removed the event from the queue without executing it (but I see no sign for this in the code I looked at)
  • the machine is preferring to give cycles to other processes than us
  • cosmic rays...

At least in this stack trace I see not much actionable.

Looking at some recent instances it seems that many of them carry the ProfilerChildShutdownPhase : "Not profiling - SendShutdownProfile (sent)" annotation and hang inside ChildProfilerController::GrabShutdownProfileAndShutdown.

Now that annotation is written only after we did GrabShutdownProfileAndShutdown - where we hang in all of these crashes I looked at. That would indicate we come here twice.

And the IPCShutdownState annotation shows a full cycle of shutdown already, too:

 - NotifiedImpendingShutdown - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)

That seems to indicate that while we are waiting for the parent process to acknowledge we are finished we:

I can only speculate why we do not see another RecvShutdown entry in this case, though.

Anyhow it seems that there is something weird ongoing with asking us to shutdown more than once.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: