Closed Bug 1613798 Opened 4 years ago Closed 3 years ago

Crash in [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst]

Categories

(Core :: Gecko Profiler, defect, P1)

defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- fixed

People

(Reporter: mccr8, Assigned: mozbugz)

References

Details

(Keywords: crash)

Crash Data

This bug is for crash report bp-d9367f2d-b3b3-4990-bdab-1db030200206.

Top 10 frames of crashing thread:

0 ntdll.dll KiFastSystemCallRet 
1 ntdll.dll NtWaitForKeyedEvent 
2 ntdll.dll RtlSleepConditionVariableSRW 
3 kernel32.dll SleepConditionVariableSRW 
4 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
5 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1140
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
8 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:909
9 xul.dll mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst tools/profiler/gecko/ChildProfilerController.cpp:62

This is a pretty common shutdown hang on the Nightly I looked at. It looks like the profiler is trying to grab a profile during shutdown, but it takes too long and then we hang.

This is happening on Windows 10 too.

OS: Windows 7 → Windows

I'll need to have a deeper look, but some initial thoughts:
The main operation that could take a long time is ChildProfilerController::ShutdownProfilerChild, and especially profiler_get_profile under it, which would run in another thread (and the main thread waits for that thread to empty its queue and shutdown)... But these functions is not visible in these reports!?

Flags: needinfo?(gsquelart)
Priority: -- → P2

None of the other threads is doing any profiler-related work but your comment gave me an idea of why this might be happening. While triaging these crashes I stumbled upon more than one crash when the main thread was waiting on an event-queue - potentially after posting a runnable - but no other thread was doing any work. My gut feeling is that during the shutdown procedure we're killing the runnables in the event queues (or just stopping them) so the main thread will be waiting on something that never happens.

Do we really need to synchronously shutdown the thread there? Is there any reason that we couldn't use AsyncShutdown()?

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill]
Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill]
Hardware: x86 → All
Version: unspecified → Trunk

This is one of the top overall Nightly crashes. It would be nice if we could do something to mitigate it.

This is not a real crash, it's just us being forced to kill a content process that's not shutting down within 5 seconds. It's not directly visible by users but it warns us that there's something slow (or stuck) in the content process shutdown procedure.

It should also only occur while the profiler is running, unless there's a bug. The profiler isn't used by a significant share of our users, so I'm surprised this crash shows up that highly.

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch]

Another signature for the same crash.

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFi…
Flags: needinfo?(gsquelart)

Coming back here because of bug 1638753, which sounded the alarm (again).

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

This is not a real crash, it's just us being forced to kill a content process that's not shutting down within 5 seconds. It's not directly visible by users but it warns us that there's something slow (or stuck) in the content process shutdown procedure.

Thank you for this information!

Looking at a couple of reports, they do have the ProfilerChild thread doing a ChildProfilerController::ShutdownProfilerChild() (which is the last task sent to the ProfilerChild thread, and now the main thread is waiting for that ProfilerChild thread to shutdown).
And that task on the ProfilerChild thread seems to be waiting for a MessageChannel::SynchronouslyClose().
Could there be a deadlock between that channel sync-close and the main thread's ContentChild currently processing a message and waiting for the ProfilerChild thread to shutdown?

I'll investigate further... But other ideas are always welcome!

Assignee: nobody → gsquelart

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

After bug 1637048 the volume here dropped almost to zero, confirming that this was a case of content processes being slow to shut down in this particular phase. Given the above I'd say we can lower the priority of this bug but it would still be nice to speed up that particular shutdown phase if possible.

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFi… → [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerC…
Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerC… → [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerCo…

I need to look into this.

Severity: normal → S3
Flags: needinfo?(gsquelart)
Priority: P2 → P1
Crash Signature: [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerCo… → [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdow…
Crash Signature: [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdow… → [@ IPCError-browser | ShutDownKill | PR_GetCurrentThread | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] [@ IPCError-br…
Crash Signature: nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | trunc | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] → nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | trunc | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | Shut…

The candidate fix in bug 1707537 landed in 90.0a1 20210427095509, but I still see crash reports after that, so the hunt continues...

Crash Signature: nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | trunc | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | Shut… → nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | trunc | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | Shu…
Crash Signature: mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] → mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | RtlAcquireSRWLockExclusive | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | S…

More signatures.
And a few reports on Mac (those with __psynch_cvwait) so it's not only Windows.

Crash Signature: IPCError-browser | ShutDownKill | mozilla::TimeStamp::Now | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] → IPCError-browser | ShutDownKill | mozilla::TimeStamp::Now | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | mozglue.dll | nsThread::Shutdown | mozilla::ChildProfilerCo…
OS: Windows → All

Too many signatures added in comment 21, which hid the graph.
I've restricted the list to those that were known previously, plus the missing ones that had at least 2 reports in the last month.

To see everything, do a search for "ChildProfilerController".

Crash Signature: [@ IPCError-browser | ShutDownKill | PR_GetCurrentThread | trunc | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] [@ IPCError-br… → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdo…

With bug 1710607 landed in build 20210512174859, it already looks good with zero crashes since then. I'll monitor for a few more days before declaring victory...

Zero crash reports since build 20210512174859, I'm declaring it:
Fixed by bug 1710607.

No need to uplift, this only happens on Nightly. (So technically speaking, previous versions that graduated to beta or release are no longer affected anyway, but there's no tracking flag that captures this subtlety.)

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(gsquelart)
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
You need to log in before you can comment on or make changes to this bug.