Crash in [@ shutdownhang | PR_MD_WAIT_CV | PR_JoinThread | nsThreadShutdownAckEvent::Run]
Categories
(Core :: XPCOM, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox101 | + | fixed |
firefox102 | --- | fixed |
firefox103 | --- | fixed |
People
(Reporter: aryx, Assigned: jstutte)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-release+
|
Details | Review |
New Windows crash signature, 277 crash reports for Firefox 101.0 from 277 installations.
Crash report: https://crash-stats.mozilla.org/report/index/008e8b45-47c5-40de-934c-77f8a0220602
MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForSingleObject
1 KERNELBASE.dll WaitForSingleObjectEx
2 nss3.dll PR_MD_WAIT_CV nsprpub/pr/src/md/windows/w95cv.c:252
3 nss3.dll PR_JoinThread nsprpub/pr/src/threads/combined/pruthr.c:1567
4 xul.dll nsThreadShutdownAckEvent::Run xpcom/threads/nsThread.cpp:208
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:780
6 xul.dll mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:390
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:465
8 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:875
9 xul.dll nsThreadManager::ShutdownNonMainThreads xpcom/threads/nsThreadManager.cpp:401
Comment 1•2 years ago
|
||
This looks like we're hanging while shutting down background threads. It's interesting that we're specifically stuck in PR_JoinThread, as generally nsThreadShutdownAckEvent is only dispatched when the thread is properly dead already so we wouldn't spend very long there.
Redirecting to :jstutte who is more familiar with the situation around shutdown hangs right now.
Updated•2 years ago
|
Assignee | ||
Comment 2•2 years ago
|
||
Hmm, I suspect bug 1764119 being involved here. I see this spiked up since it was uplifted to 101.0b4. Need to investigate more, though.
Assignee | ||
Comment 3•2 years ago
|
||
Looking at some crash reports, in all of these we have 1 or more RunProcess
threads stuck inside _PR_NotifyJoinWaiters(thread);
, that is: after the main thread func has been executed and the nsThreadShutdownAckEvent
has been posted.
The main thread is blocking on PR_JoinThread
waiting for a thread->term
condvar to be notified. I cannot see from the stack, which thread it's waiting for, though, even in Visual Studio.
I assume we have some interleaving of events here, in this particular situation we seem to:
- Dying thread: End
nsThread::ThreadFunc
, posting thensThreadShutdownAckEvent
- Dying thread: Enter
_PR_NotifyJoinWaiters
, set the thread state to_PR_JOIN_WAIT
, notify the condvarthread->term
and then wait forthread->md.blocked_sema
to become green - Joining Thread: Execute the
nsThreadShutdownAckEvent
where we want to get notified on condvarthread->term
Looking at _PR_NotifyJoinWaiters
:
if (thread->term != NULL) {
PR_Lock(_pr_terminationCVLock);
_PR_THREAD_LOCK(thread);
thread->state = _PR_JOIN_WAIT;
if ( !_PR_IS_NATIVE_THREAD(thread) ) {
_PR_MISCQ_LOCK(thread->cpu);
_PR_ADD_JOINQ(thread, thread->cpu);
_PR_MISCQ_UNLOCK(thread->cpu);
}
--> Should we make a copy of thread->term and null it out already here?
I believe waiting on it makes only sense if the joining thread arrives
earlier than us where it waits for it?
_PR_THREAD_UNLOCK(thread);
PR_NotifyCondVar(thread->term);
PR_Unlock(_pr_terminationCVLock);
_PR_MD_WAIT(thread, PR_INTERVAL_NO_TIMEOUT);
PR_ASSERT(thread->state != _PR_JOIN_WAIT);
}
Assignee | ||
Comment 4•2 years ago
•
|
||
OK, spending a bit more time with the dmp in Visual Studio shows me, that we apparently wait on a different thread, and that thread seams to be in a same-stack-deadlock:
ntdll.dll!NtWaitForAlertByThreadId() Unbekannt
ntdll.dll!RtlAcquireSRWLockExclusive() Unbekannt
[Inlineframe] xul.dll!mozilla::OffTheBooksMutex::Lock() Zeile 65 C++
[Inlineframe] xul.dll!mozilla::detail::BaseAutoLock<mozilla::Mutex &>::BaseAutoLock(mozilla::Mutex & aLock) Zeile 236 C++
[Inlineframe] xul.dll!mozilla::ThreadEventQueue::PutEventInternal(already_AddRefed<nsIRunnable> && aEvent, mozilla::EventQueuePriority aPriority, mozilla::ThreadEventQueue::NestedSink * aSink) Zeile 116 C++
--> locks mMutex
a second time
[Inlineframe] xul.dll!mozilla::ThreadEventQueue::PutEvent(already_AddRefed<nsIRunnable> && aEvent, mozilla::EventQueuePriority aPriority) Zeile 73 C++
[Inlineframe] xul.dll!mozilla::ThreadEventTarget::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 98 C++
xul.dll!nsThread::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 662 C++
xul.dll!mozilla::LazyIdleThread::ShutdownThread() Zeile 272 C++
[Inlineframe] xul.dll!mozilla::LazyIdleThread::Shutdown() Zeile 503 C++
[Inlineframe] xul.dll!mozilla::LazyIdleThread::~LazyIdleThread() Zeile 49 C++
[Inlineframe] xul.dll!mozilla::LazyIdleThread::SelfDestruct() Zeile 323 C++
xul.dll!mozilla::LazyIdleThread::Release() Zeile 341 C++
[Inlineframe] xul.dll!nsCOMPtr_base::~nsCOMPtr_base() Zeile 328 C++
xul.dll!mozilla::ThreadEventQueue::SetObserver(nsIThreadObserver * aObserver) Zeile 280 C++
--> locks mMutex
the first time
xul.dll!nsThread::SetObserver(nsIThreadObserver * aObs) Zeile 1297 C++
> xul.dll!nsThread::ThreadFunc(void * aArg) Zeile 448 C++
nss3.dll!_PR_NativeRunThread(void * arg) Zeile 421 C
nss3.dll!pr_root(void * arg) Zeile 140 C
ucrtbase.dll!thread_start<unsigned int (__cdecl*)(void *),1>() Unbekannt
kernel32.dll!BaseThreadInitThunk() Unbekannt
[Inlineframe] mozglue.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()(int & aArgs, void * & aArgs, void * & aArgs) Zeile 150 C++
mozglue.dll!patched_BaseThreadInitThunk(int aIsInitialThread, void * aStartAddress, void * aThreadParam) Zeile 572 C++
ntdll.dll!RtlUserThreadStart() Unbekannt
A straight forward fix would be to use a RecursiveMutex
here, I think.
I assume this is happening more often now because we permit more event dispatching during shutdown since bug 1764119 landed, but the potential for the deadlock was always there.
(The good news is that I can stop the endless compile on my Windows laptop as I wanted to poke on _PR_NotifyJoinWaiters
under Windows...)
Assignee | ||
Comment 5•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/598649d8d10a Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r=xpcom-reviewers,jesup,nika
Comment 7•2 years ago
|
||
bugherder |
Assignee | ||
Comment 8•2 years ago
|
||
Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers
Beta/Release Uplift Approval Request
- User impact if declined: Possible shutdown hangs, that appear to be more frequent since bug 1764119 landed.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The change does just reduce the scope of a lock to not include unrelated processing.
- String changes made/needed:
- Is Android affected?: Unknown
Comment 9•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Assignee | ||
Comment 10•2 years ago
|
||
Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers
Beta/Release Uplift Approval Request
- User impact if declined: Possible shutdown hangs, that appear to be more frequent since bug 1764119 landed.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The fix landed in 102.0b5, so far no crashes.
- String changes made/needed:
- Is Android affected?: Unknown
Comment 11•2 years ago
|
||
Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers
Crash rate on Nightly and Beta is looking good. Approved for 101.0.1.
Comment 12•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Description
•