Open Bug 1770451 Opened 2 years ago Updated 11 months ago

[QM shutdown] Investigate hangs in shutdownAndJoinIOThread

Categories

(Core :: Storage: Quota Manager, task)

task

Tracking

()

ASSIGNED

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(2 files)

There is a flavor of QM shutdown hangs which hangs as late as shutdownAndJoinIOThread.

Cracking up minidump 7f828ade-919a-45d9-aebf-4fa6c0220517 it looks as if:

  1. All expected work on the QuotaManager IO thread ended
  2. We are stuck inside _PR_NotifyJoinWaiters waiting for the thread->md.blocked_sema semaphore to be unblocked.

This semaphore is created during thread initialization and locks are acquired through _PR_MD_WAIT, it seems.
It seems there is some (rare) book-keeping or order error in releasing this semaphore that makes us block on ourself at the end of the thread.

There might be a good reason I ignore to do this, but the _PR_MD_WAIT at the very end of the thread execution looks indeed like a possible deadlock.

:mccr8, you may have more insights here?

Flags: needinfo?(continuation)

Kris is a better person to ask about thread shutdown.

Flags: needinfo?(continuation) → needinfo?(kwright)

OK, looking at an example of this crash , it seems that:

  • The IPDL Background thread is running the QuotaManager::Shutdown and is stuck on the SpinEventLoopUntil waiting for context->GetCompleted() until the timeout fires.

  • The QuotaManager IO thread is waiting for the IPDL Background thread to call PR_JoinThread in order to finalize the joining. This implies in particular, that the nsThreadShutdownAckEvent should have been already dispatched from the terminating QuotaManager IO thread to the joining IPDL Background thread. At least from code reading there seems to be no way how we could end up here with mJoiningThread being already nulled out (this should be possible only for threads in a thread pool).

Once nsThreadShutdownAckEvent runs there seems to be no code path on which we could not have set mCompleted to true. So AFAICS the remaining investigation paths are:

  1. nsThreadShutdownAckEvent has not been dispatched for whatever strange reason
  2. nsThreadShutdownAckEvent has been dispatched but has never been run (FWIW: it is derived from CancelableRunnable ?)
  3. SpinEventLoopUntil does never arrive to check the bail-out condition after having executed nsThreadShutdownAckEvent (which sounds very unlikely and would mean we somehow sit inside NS_ProcessNextEvent until the timer fires?)
Keywords: leave-open

There is a code path that can make us not dispatch the nsThreadShutdownAckEvent at the end of nsThread::ThreadFunc. The only known and legitimate way to arrive here should be via StopWaitingAndLeakThread. We want to ensure that if we take this code path someone else already marked the shutdown context as complete.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED

(In reply to Jens Stutte [:jstutte] from comment #4)

Created attachment 9280430 [details]
Bug 1770451 - Release assert if a thread shutdown is unexpectedly going to cause a hang on the joining thread. r?#xpcom-reviewers

wants to address/exclude case 1, adding some more diagnostics.

(In reply to Jens Stutte [:jstutte] from comment #6)

Created attachment 9280444 [details]
Bug 1770451 - Trace important shutdown events in a thread's name if MOZ_DIAGNOSTIC_ASSERT_ENABLED. r?#xpcom-reviewers

wants to help to see that we actually dispatched the nsThreadShutdownAckEvent, for case 2. I need to figure out an easy way to trace if it actually has been executed on the joining thread.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dc770b6f3dc0
Release assert if a thread shutdown is unexpectedly going to cause a hang on the joining thread. r=xpcom-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/6356ff8c8204
Trace important shutdown events in a thread's name if MOZ_DIAGNOSTIC_ASSERT_ENABLED. r=xpcom-reviewers,barret
Flags: needinfo?(kwright)

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)
Flags: needinfo?(jstutte)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: