Deadlock from BackgroundHangThread::NotifyActivity() creating a timer while holding mGraphMutex
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: mccr8, Assigned: florian)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
Crash report: https://crash-stats.mozilla.org/report/index/b60f23b3-169d-4588-9fd6-476530230217
Reason: EXCEPTION_BREAKPOINT
Top 10 frames of crashing thread:
0 ntdll.dll ZwWaitForAlertByThreadId
1 ntdll.dll RtlAcquireSRWLockExclusive
2 xul.dll mozilla::OffTheBooksMutex::Lock xpcom/threads/Mutex.h:65
2 xul.dll mozilla::Monitor::Lock xpcom/threads/Monitor.h:31
2 xul.dll mozilla::detail::BaseMonitorAutoLock<mozilla::Monitor>::BaseMonitorAutoLock xpcom/threads/Monitor.h:129
2 xul.dll TimerThread::RemoveTimer xpcom/threads/TimerThread.cpp:780
2 xul.dll TimerThreadWrapper::RemoveTimer xpcom/threads/nsTimerImpl.cpp:109
2 xul.dll nsTimerImpl::InitCommon xpcom/threads/nsTimerImpl.cpp:395
3 xul.dll nsTimerImpl::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.cpp:446
3 xul.dll nsTimer::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.h:201
This appears to be a reasonably common source of hangs. In the main thread, BackgroundHangThread::NotifyActivity() is holding TaskController::mGraphMutex, then it tries to create a new timer, which requires acquiring TimerThread::mMonitor. Unfortunately, over on the timer thread, it is attempting to call NS_DispatchToMainThread (frequently from FOG_RegisterContentChildShutdown()) while holding TimerThread::mMonitor, which tries to acquire mGraphMutex.
On Element, Nika said: "We should definitely not be holding mGraphMutex when in NotifyActivity".
Reporter | ||
Comment 1•2 years ago
|
||
I think this timer usage was added in bug 965392.
Reporter | ||
Comment 2•2 years ago
|
||
Nika also said: "We probably need to unlock around those like we do in a bunch of places in ExecuteNextTaskOnlyMainThreadInternal."
Reporter | ||
Comment 3•2 years ago
|
||
BHR is Nightly only.
Reporter | ||
Comment 4•2 years ago
|
||
There are around 900 hangs in the last week with BackgroundHangThread::NotifyActivity() in the proto signature, but given that there are around 800 hangs with BackgroundHangManager::CollectCPUUsage() in the proto signature, I'm guessing at most 100 of them are this. That's still a decent amount given that this is only on Nightly. (Here's an example of a BackgroundHangManager::CollectCPUUsage() hang, which I assume is due to some OS function being slow: bp-0843cd24-eff9-48f1-9985-42aef0230217)
Comment 5•2 years ago
|
||
:florian, since you are the author of the regressor, bug 965392, could you take a look?
For more information, please visit auto_nag documentation.
Comment 6•2 years ago
•
|
||
(In reply to Andrew McCreight [:mccr8] from comment #4)
... there are around 800 hangs with BackgroundHangManager::CollectCPUUsage() in the proto signature ... (Here's an example of a BackgroundHangManager::CollectCPUUsage() hang, which I assume is due to some OS function being slow: bp-0843cd24-eff9-48f1-9985-42aef0230217)
What is interesting about those is that we end up inside ZwQuerySystemInformation
which seems to be deprecated since Windows 8 but we get there through the apparently non-deprecated GetSystemTimes
function. Not sure if it would be worth exploring if there are alternatives for that one, too?
Edit: Errrh, the documentation says for SYSTEM_PROCESSOR_PERFORMANCE_INFORMATION
: "Use GetSystemTimes instead to retrieve this information.". So probably nothing we can do here on our side?
Comment 7•2 years ago
|
||
FWIW, got the same crash or something awfully similar just now on MacOS.
This occurred on a Google Docs document which did not load after Nightly update and it crashed after I closed the tab. Can't reproduce unfortunately.
Comment hidden (obsolete) |
Assignee | ||
Comment 9•2 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #0)
This appears to be a reasonably common source of hangs. In the main thread, BackgroundHangThread::NotifyActivity() is holding TaskController::mGraphMutex, then it tries to create a new timer, which requires acquiring TimerThread::mMonitor. Unfortunately, over on the timer thread, it is attempting to call NS_DispatchToMainThread (frequently from FOG_RegisterContentChildShutdown()) while holding TimerThread::mMonitor, which tries to acquire mGraphMutex.
Given this was filed 2 days after bug 1814718 landed, and that bug introduced the use of Glean in the timer thread, could this be a regression from bug 1814718 rather than bug 965392? Maybe it's a more actionable variation of bug 1818758?
Reporter | ||
Comment 10•2 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #9)
Given this was filed 2 days after bug 1814718 landed, and that bug introduced the use of Glean in the timer thread, could this be a regression from bug 1814718 rather than bug 965392? Maybe it's a more actionable variation of bug 1818758?
That may have caused the deadlock to happen more often (Glean was the most frequent thing on the stack in the timer thread that I saw, but not the only one), but my understanding from discussion on the XPCOM channel is that it is okay to dispatch to the main thread from the timer thread, but it is not okay to create a timer while holding the TaskController mutex.
Assignee | ||
Comment 11•2 years ago
|
||
Updated•2 years ago
|
Comment 12•2 years ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 13•2 years ago
|
||
Reporter | ||
Comment 14•2 years ago
|
||
This is Nightly only, so I'll just mark it S3. There's a fix now so the rating is largely irrelevant.
Comment 15•2 years ago
|
||
bugherder |
Reporter | ||
Comment 18•2 years ago
|
||
The original issue does seem to have cleared up. There are still some hangs with NotifyActivity on the crashing stack, but it is the mozilla::GetGlobalCPUStats() version noted by jstutte in comment 6. That appears to account for a third of the crashes with this hang signature on Nightly 112 with NotifyActivity on the crashing stack.
Description
•