Closed Bug 1772281 Opened 2 years ago Closed 2 years ago

Crash in [@ shutdownhang | PR_MD_WAIT_CV | PR_JoinThread | nsThreadShutdownAckEvent::Run]

Categories

(Core :: XPCOM, defect)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox101 + fixed
firefox102 --- fixed
firefox103 --- fixed

People

(Reporter: aryx, Assigned: jstutte)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

New Windows crash signature, 277 crash reports for Firefox 101.0 from 277 installations.

Crash report: https://crash-stats.mozilla.org/report/index/008e8b45-47c5-40de-934c-77f8a0220602

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForSingleObject 
1 KERNELBASE.dll WaitForSingleObjectEx 
2 nss3.dll PR_MD_WAIT_CV nsprpub/pr/src/md/windows/w95cv.c:252
3 nss3.dll PR_JoinThread nsprpub/pr/src/threads/combined/pruthr.c:1567
4 xul.dll nsThreadShutdownAckEvent::Run xpcom/threads/nsThread.cpp:208
5 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:780
6 xul.dll mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:390
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:465
8 xul.dll nsThread::Shutdown xpcom/threads/nsThread.cpp:875
9 xul.dll nsThreadManager::ShutdownNonMainThreads xpcom/threads/nsThreadManager.cpp:401
Flags: needinfo?(nika)

This looks like we're hanging while shutting down background threads. It's interesting that we're specifically stuck in PR_JoinThread, as generally nsThreadShutdownAckEvent is only dispatched when the thread is properly dead already so we wouldn't spend very long there.

Redirecting to :jstutte who is more familiar with the situation around shutdown hangs right now.

Flags: needinfo?(nika) → needinfo?(jstutte)

Hmm, I suspect bug 1764119 being involved here. I see this spiked up since it was uplifted to 101.0b4. Need to investigate more, though.

See Also: → 1769821
See Also: 1769821

Looking at some crash reports, in all of these we have 1 or more RunProcess threads stuck inside _PR_NotifyJoinWaiters(thread);, that is: after the main thread func has been executed and the nsThreadShutdownAckEvent has been posted.

The main thread is blocking on PR_JoinThread waiting for a thread->term condvar to be notified. I cannot see from the stack, which thread it's waiting for, though, even in Visual Studio.

I assume we have some interleaving of events here, in this particular situation we seem to:

  1. Dying thread: End nsThread::ThreadFunc, posting the nsThreadShutdownAckEvent
  2. Dying thread: Enter _PR_NotifyJoinWaiters, set the thread state to _PR_JOIN_WAIT, notify the condvar thread->term and then wait for thread->md.blocked_sema to become green
  3. Joining Thread: Execute the nsThreadShutdownAckEvent where we want to get notified on condvar thread->term

Looking at _PR_NotifyJoinWaiters:

    if (thread->term != NULL) {
        PR_Lock(_pr_terminationCVLock);
        _PR_THREAD_LOCK(thread);
        thread->state = _PR_JOIN_WAIT;
        if ( !_PR_IS_NATIVE_THREAD(thread) ) {
            _PR_MISCQ_LOCK(thread->cpu);
            _PR_ADD_JOINQ(thread, thread->cpu);
            _PR_MISCQ_UNLOCK(thread->cpu);
        }
--> Should we make a copy of thread->term and null it out already here?
    I believe waiting on it makes only sense if the joining thread arrives 
    earlier than us where it waits for it?
        _PR_THREAD_UNLOCK(thread);
        PR_NotifyCondVar(thread->term);
        PR_Unlock(_pr_terminationCVLock);
        _PR_MD_WAIT(thread, PR_INTERVAL_NO_TIMEOUT);
        PR_ASSERT(thread->state != _PR_JOIN_WAIT);
    }
Flags: needinfo?(jstutte)

OK, spending a bit more time with the dmp in Visual Studio shows me, that we apparently wait on a different thread, and that thread seams to be in a same-stack-deadlock:

 	ntdll.dll!NtWaitForAlertByThreadId()	Unbekannt
 	ntdll.dll!RtlAcquireSRWLockExclusive()	Unbekannt
 	[Inlineframe] xul.dll!mozilla::OffTheBooksMutex::Lock() Zeile 65	C++
 	[Inlineframe] xul.dll!mozilla::detail::BaseAutoLock<mozilla::Mutex &>::BaseAutoLock(mozilla::Mutex & aLock) Zeile 236	C++
 	[Inlineframe] xul.dll!mozilla::ThreadEventQueue::PutEventInternal(already_AddRefed<nsIRunnable> && aEvent, mozilla::EventQueuePriority aPriority, mozilla::ThreadEventQueue::NestedSink * aSink) Zeile 116	C++

--> locks mMutex a second time

 	[Inlineframe] xul.dll!mozilla::ThreadEventQueue::PutEvent(already_AddRefed<nsIRunnable> && aEvent, mozilla::EventQueuePriority aPriority) Zeile 73	C++
 	[Inlineframe] xul.dll!mozilla::ThreadEventTarget::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 98	C++
 	xul.dll!nsThread::Dispatch(already_AddRefed<nsIRunnable> aEvent, unsigned int aFlags) Zeile 662	C++
 	xul.dll!mozilla::LazyIdleThread::ShutdownThread() Zeile 272	C++
 	[Inlineframe] xul.dll!mozilla::LazyIdleThread::Shutdown() Zeile 503	C++
 	[Inlineframe] xul.dll!mozilla::LazyIdleThread::~LazyIdleThread() Zeile 49	C++
 	[Inlineframe] xul.dll!mozilla::LazyIdleThread::SelfDestruct() Zeile 323	C++
 	xul.dll!mozilla::LazyIdleThread::Release() Zeile 341	C++
 	[Inlineframe] xul.dll!nsCOMPtr_base::~nsCOMPtr_base() Zeile 328	C++
 	xul.dll!mozilla::ThreadEventQueue::SetObserver(nsIThreadObserver * aObserver) Zeile 280	C++

--> locks mMutex the first time

 	xul.dll!nsThread::SetObserver(nsIThreadObserver * aObs) Zeile 1297	C++
>	xul.dll!nsThread::ThreadFunc(void * aArg) Zeile 448	C++
 	nss3.dll!_PR_NativeRunThread(void * arg) Zeile 421	C
 	nss3.dll!pr_root(void * arg) Zeile 140	C
 	ucrtbase.dll!thread_start<unsigned int (__cdecl*)(void *),1>()	Unbekannt
 	kernel32.dll!BaseThreadInitThunk()	Unbekannt
 	[Inlineframe] mozglue.dll!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()(int & aArgs, void * & aArgs, void * & aArgs) Zeile 150	C++
 	mozglue.dll!patched_BaseThreadInitThunk(int aIsInitialThread, void * aStartAddress, void * aThreadParam) Zeile 572	C++
 	ntdll.dll!RtlUserThreadStart()	Unbekannt

A straight forward fix would be to use a RecursiveMutex here, I think.

I assume this is happening more often now because we permit more event dispatching during shutdown since bug 1764119 landed, but the potential for the deadlock was always there.

(The good news is that I can stop the endless compile on my Windows laptop as I wanted to poke on _PR_NotifyJoinWaiters under Windows...)

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9279582 - Attachment description: Bug 1772281 - Have a dedicated lock for mObserver to avoid a potential recursive lock in ThreadEventQueue. r?#xpcom-reviewers → Bug 1772281 - Have a dedicated recursive lock for mObserver to avoid a potential recursive lock in ThreadEventQueue. r?#xpcom-reviewers
Attachment #9279582 - Attachment description: Bug 1772281 - Have a dedicated recursive lock for mObserver to avoid a potential recursive lock in ThreadEventQueue. r?#xpcom-reviewers → Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/598649d8d10a
Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r=xpcom-reviewers,jesup,nika
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 103 Branch

Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers

Beta/Release Uplift Approval Request

  • User impact if declined: Possible shutdown hangs, that appear to be more frequent since bug 1764119 landed.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change does just reduce the scope of a lock to not include unrelated processing.
  • String changes made/needed:
  • Is Android affected?: Unknown
Attachment #9279582 - Flags: approval-mozilla-beta?
Attachment #9279582 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers

Beta/Release Uplift Approval Request

  • User impact if declined: Possible shutdown hangs, that appear to be more frequent since bug 1764119 landed.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The fix landed in 102.0b5, so far no crashes.
  • String changes made/needed:
  • Is Android affected?: Unknown
Attachment #9279582 - Flags: approval-mozilla-release?

Comment on attachment 9279582 [details]
Bug 1772281 - Reduce the scope of the mMutex lock in ThreadEventQueue::SetObserver to avoid a potential recursive deadlock. r?#xpcom-reviewers

Crash rate on Nightly and Beta is looking good. Approved for 101.0.1.

Attachment #9279582 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: