Closed Bug 1817498 Opened 1 year ago Closed 1 year ago

Deadlock from BackgroundHangThread::NotifyActivity() creating a timer while holding mGraphMutex

Categories

(Core :: XPCOM, defect)

Unspecified
Windows 11
defect

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox-esr102 --- disabled
firefox110 --- disabled
firefox111 --- disabled
firefox112 --- fixed

People

(Reporter: mccr8, Assigned: florian)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/b60f23b3-169d-4588-9fd6-476530230217

Reason: EXCEPTION_BREAKPOINT

Top 10 frames of crashing thread:

0  ntdll.dll  ZwWaitForAlertByThreadId  
1  ntdll.dll  RtlAcquireSRWLockExclusive  
2  xul.dll  mozilla::OffTheBooksMutex::Lock  xpcom/threads/Mutex.h:65
2  xul.dll  mozilla::Monitor::Lock  xpcom/threads/Monitor.h:31
2  xul.dll  mozilla::detail::BaseMonitorAutoLock<mozilla::Monitor>::BaseMonitorAutoLock  xpcom/threads/Monitor.h:129
2  xul.dll  TimerThread::RemoveTimer  xpcom/threads/TimerThread.cpp:780
2  xul.dll  TimerThreadWrapper::RemoveTimer  xpcom/threads/nsTimerImpl.cpp:109
2  xul.dll  nsTimerImpl::InitCommon  xpcom/threads/nsTimerImpl.cpp:395
3  xul.dll  nsTimerImpl::InitHighResolutionWithCallback  xpcom/threads/nsTimerImpl.cpp:446
3  xul.dll  nsTimer::InitHighResolutionWithCallback  xpcom/threads/nsTimerImpl.h:201

This appears to be a reasonably common source of hangs. In the main thread, BackgroundHangThread::NotifyActivity() is holding TaskController::mGraphMutex, then it tries to create a new timer, which requires acquiring TimerThread::mMonitor. Unfortunately, over on the timer thread, it is attempting to call NS_DispatchToMainThread (frequently from FOG_RegisterContentChildShutdown()) while holding TimerThread::mMonitor, which tries to acquire mGraphMutex.

On Element, Nika said: "We should definitely not be holding mGraphMutex when in NotifyActivity".

I think this timer usage was added in bug 965392.

Keywords: regression
Regressed by: 965392

Nika also said: "We probably need to unlock around those like we do in a bunch of places in ExecuteNextTaskOnlyMainThreadInternal."

Blocks: 1789810

There are around 900 hangs in the last week with BackgroundHangThread::NotifyActivity() in the proto signature, but given that there are around 800 hangs with BackgroundHangManager::CollectCPUUsage() in the proto signature, I'm guessing at most 100 of them are this. That's still a decent amount given that this is only on Nightly. (Here's an example of a BackgroundHangManager::CollectCPUUsage() hang, which I assume is due to some OS function being slow: bp-0843cd24-eff9-48f1-9985-42aef0230217)

:florian, since you are the author of the regressor, bug 965392, could you take a look?

For more information, please visit auto_nag documentation.

Flags: needinfo?(florian)

(In reply to Andrew McCreight [:mccr8] from comment #4)

... there are around 800 hangs with BackgroundHangManager::CollectCPUUsage() in the proto signature ... (Here's an example of a BackgroundHangManager::CollectCPUUsage() hang, which I assume is due to some OS function being slow: bp-0843cd24-eff9-48f1-9985-42aef0230217)

What is interesting about those is that we end up inside ZwQuerySystemInformation which seems to be deprecated since Windows 8 but we get there through the apparently non-deprecated GetSystemTimes function. Not sure if it would be worth exploring if there are alternatives for that one, too?

Edit: Errrh, the documentation says for SYSTEM_PROCESSOR_PERFORMANCE_INFORMATION: "Use GetSystemTimes instead to retrieve this information.". So probably nothing we can do here on our side?

FWIW, got the same crash or something awfully similar just now on MacOS.
This occurred on a Google Docs document which did not load after Nightly update and it crashed after I closed the tab. Can't reproduce unfortunately.

(In reply to Andrew McCreight [:mccr8] from comment #0)

This appears to be a reasonably common source of hangs. In the main thread, BackgroundHangThread::NotifyActivity() is holding TaskController::mGraphMutex, then it tries to create a new timer, which requires acquiring TimerThread::mMonitor. Unfortunately, over on the timer thread, it is attempting to call NS_DispatchToMainThread (frequently from FOG_RegisterContentChildShutdown()) while holding TimerThread::mMonitor, which tries to acquire mGraphMutex.

Given this was filed 2 days after bug 1814718 landed, and that bug introduced the use of Glean in the timer thread, could this be a regression from bug 1814718 rather than bug 965392? Maybe it's a more actionable variation of bug 1818758?

Flags: needinfo?(florian)
See Also: → 1814718, 1818758

(In reply to Florian Quèze [:florian] from comment #9)

Given this was filed 2 days after bug 1814718 landed, and that bug introduced the use of Glean in the timer thread, could this be a regression from bug 1814718 rather than bug 965392? Maybe it's a more actionable variation of bug 1818758?

That may have caused the deadlock to happen more often (Glean was the most frequent thing on the stack in the timer thread that I saw, but not the only one), but my understanding from discussion on the XPCOM channel is that it is okay to dispatch to the main thread from the timer thread, but it is not okay to create a timer while holding the TaskController mutex.

Assignee: nobody → florian
Status: NEW → ASSIGNED

The severity field is not set for this bug.
:nika, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)
Pushed by fqueze@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b74f3944f8dd
Unlock the TaskController mGraphMutex mutex before calling into BackgroundHangThread APIs, r=bas.

This is Nightly only, so I'll just mark it S3. There's a fix now so the rating is largely irrelevant.

Severity: -- → S3
Flags: needinfo?(nika)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch
Duplicate of this bug: 1796060

The original issue does seem to have cleared up. There are still some hangs with NotifyActivity on the crashing stack, but it is the mozilla::GetGlobalCPUStats() version noted by jstutte in comment 6. That appears to account for a third of the crashes with this hang signature on Nightly 112 with NotifyActivity on the crashing stack.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: