[QM shutdown] Investigate hangs in shutdownAndJoinIOThread
Categories
(Core :: Storage: Quota Manager, task)
Tracking
()
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
(Keywords: leave-open)
Attachments
(2 files)
There is a flavor of QM shutdown hangs which hangs as late as shutdownAndJoinIOThread
.
Cracking up minidump 7f828ade-919a-45d9-aebf-4fa6c0220517 it looks as if:
- All expected work on the QuotaManager IO thread ended
- We are stuck inside
_PR_NotifyJoinWaiters
waiting for thethread->md.blocked_sema
semaphore to be unblocked.
This semaphore is created during thread initialization and locks are acquired through _PR_MD_WAIT
, it seems.
It seems there is some (rare) book-keeping or order error in releasing this semaphore that makes us block on ourself at the end of the thread.
There might be a good reason I ignore to do this, but the _PR_MD_WAIT at the very end of the thread execution looks indeed like a possible deadlock.
Assignee | ||
Comment 1•2 years ago
|
||
:mccr8, you may have more insights here?
Comment 2•2 years ago
|
||
Kris is a better person to ask about thread shutdown.
Assignee | ||
Comment 3•2 years ago
•
|
||
OK, looking at an example of this crash , it seems that:
-
The
IPDL Background
thread is running theQuotaManager::Shutdown
and is stuck on theSpinEventLoopUntil
waiting forcontext->GetCompleted()
until the timeout fires. -
The
QuotaManager IO
thread is waiting for theIPDL Background
thread to callPR_JoinThread
in order to finalize the joining. This implies in particular, that thensThreadShutdownAckEvent
should have been already dispatched from the terminatingQuotaManager IO
thread to the joiningIPDL Background
thread. At least from code reading there seems to be no way how we could end up here withmJoiningThread
being already nulled out (this should be possible only for threads in a thread pool).
Once nsThreadShutdownAckEvent
runs there seems to be no code path on which we could not have set mCompleted
to true. So AFAICS the remaining investigation paths are:
nsThreadShutdownAckEvent
has not been dispatched for whatever strange reasonnsThreadShutdownAckEvent
has been dispatched but has never been run (FWIW: it is derived fromCancelableRunnable
?)SpinEventLoopUntil
does never arrive to check the bail-out condition after having executednsThreadShutdownAckEvent
(which sounds very unlikely and would mean we somehow sit insideNS_ProcessNextEvent
until the timer fires?)
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 4•2 years ago
|
||
There is a code path that can make us not dispatch the nsThreadShutdownAckEvent
at the end of nsThread::ThreadFunc
. The only known and legitimate way to arrive here should be via StopWaitingAndLeakThread. We want to ensure that if we take this code path someone else already marked the shutdown context as complete.
Updated•2 years ago
|
Assignee | ||
Comment 5•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #4)
Created attachment 9280430 [details]
Bug 1770451 - Release assert if a thread shutdown is unexpectedly going to cause a hang on the joining thread. r?#xpcom-reviewers
wants to address/exclude case 1, adding some more diagnostics.
Assignee | ||
Comment 6•2 years ago
|
||
Depends on D148761
Assignee | ||
Comment 7•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #6)
Created attachment 9280444 [details]
Bug 1770451 - Trace important shutdown events in a thread's name if MOZ_DIAGNOSTIC_ASSERT_ENABLED. r?#xpcom-reviewers
wants to help to see that we actually dispatched the nsThreadShutdownAckEvent
, for case 2. I need to figure out an easy way to trace if it actually has been executed on the joining thread.
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/dc770b6f3dc0 Release assert if a thread shutdown is unexpectedly going to cause a hang on the joining thread. r=xpcom-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/6356ff8c8204 Trace important shutdown events in a thread's name if MOZ_DIAGNOSTIC_ASSERT_ENABLED. r=xpcom-reviewers,barret
Comment 9•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 10•1 year ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•11 months ago
|
Description
•