Open Bug 1613798 Opened 9 months ago Updated 6 days ago

Crash in [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst]

Categories

(Core :: Gecko Profiler, defect, P2)

All
Windows
defect

Tracking

()

Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 --- fix-optional

People

(Reporter: mccr8, Assigned: gerald)

References

Details

(Keywords: crash)

Crash Data

This bug is for crash report bp-d9367f2d-b3b3-4990-bdab-1db030200206.

Top 10 frames of crashing thread:

0 ntdll.dll KiFastSystemCallRet 
1 ntdll.dll NtWaitForKeyedEvent 
2 ntdll.dll RtlSleepConditionVariableSRW 
3 kernel32.dll SleepConditionVariableSRW 
4 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
5 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1140
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
8 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:909
9 xul.dll mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst tools/profiler/gecko/ChildProfilerController.cpp:62

This is a pretty common shutdown hang on the Nightly I looked at. It looks like the profiler is trying to grab a profile during shutdown, but it takes too long and then we hang.

This is happening on Windows 10 too.

OS: Windows 7 → Windows
Duplicate of this bug: 1333464

I'll need to have a deeper look, but some initial thoughts:
The main operation that could take a long time is ChildProfilerController::ShutdownProfilerChild, and especially profiler_get_profile under it, which would run in another thread (and the main thread waits for that thread to empty its queue and shutdown)... But these functions is not visible in these reports!?

Flags: needinfo?(gsquelart)
Priority: -- → P2

None of the other threads is doing any profiler-related work but your comment gave me an idea of why this might be happening. While triaging these crashes I stumbled upon more than one crash when the main thread was waiting on an event-queue - potentially after posting a runnable - but no other thread was doing any work. My gut feeling is that during the shutdown procedure we're killing the runnables in the event queues (or just stopping them) so the main thread will be waiting on something that never happens.

Do we really need to synchronously shutdown the thread there? Is there any reason that we couldn't use AsyncShutdown()?

See Also: → 1611555
Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill]
Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill]
Hardware: x86 → All
Version: unspecified → Trunk

This is one of the top overall Nightly crashes. It would be nice if we could do something to mitigate it.

This is not a real crash, it's just us being forced to kill a content process that's not shutting down within 5 seconds. It's not directly visible by users but it warns us that there's something slow (or stuck) in the content process shutdown procedure.

It should also only occur while the profiler is running, unless there's a bug. The profiler isn't used by a significant share of our users, so I'm surprised this crash shows up that highly.

Duplicate of this bug: 1628264
Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch]

Another signature for the same crash.

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFirst] [@ IPCError-browser | ShutDownKill | RtlpWakeConditionVariable | nsThread::Dispatch] → [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFi…
Flags: needinfo?(gsquelart)
Duplicate of this bug: 1638753

Coming back here because of bug 1638753, which sounded the alarm (again).

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

This is not a real crash, it's just us being forced to kill a content process that's not shutting down within 5 seconds. It's not directly visible by users but it warns us that there's something slow (or stuck) in the content process shutdown procedure.

Thank you for this information!

Looking at a couple of reports, they do have the ProfilerChild thread doing a ChildProfilerController::ShutdownProfilerChild() (which is the last task sent to the ProfilerChild thread, and now the main thread is waiting for that ProfilerChild thread to shutdown).
And that task on the ProfilerChild thread seems to be waiting for a MessageChannel::SynchronouslyClose().
Could there be a deadlock between that channel sync-close and the main thread's ContentChild currently processing a message and waiting for the ProfilerChild thread to shutdown?

I'll investigate further... But other ideas are always welcome!

Assignee: nobody → gsquelart

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

After bug 1637048 the volume here dropped almost to zero, confirming that this was a case of content processes being slow to shut down in this particular phase. Given the above I'd say we can lower the priority of this bug but it would still be nice to speed up that particular shutdown phase if possible.

Crash Signature: [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerController::ShutdownAndMaybeGrabShutdownProfileFi… → [@ IPCError-browser | ShutDownKill | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::ChildProfilerController::GrabShutdownProfileAndShutdown ] [@ IPCError-browser | ShutDownKill | nsThread::Shutdown | mozilla::ChildProfilerC…
You need to log in before you can comment on or make changes to this bug.