Closed Bug 422472 Opened 16 years ago Closed 16 years ago

Lock re-entrance in TimerThread, deadlock

Categories

(Core :: XPCOM, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla1.9beta5

People

(Reporter: mayhemer, Assigned: mayhemer)

References

()

Details

(Keywords: hang)

Attachments

(1 file, 2 obsolete files)

Happened with one day old TRUNK build during exit.

This is back trace of the main thread:

0  0x900248c7 in semaphore_wait_signal_trap ()
#1  0x90001582 in pthread_mutex_lock ()
#2  0x00677bab in PR_Lock (lock=0x2312220) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:206
#3  0x0177c839 in nsAutoLock::nsAutoLock (this=0xbfffef3c, aLock=0x2312220) at ../../dist/include/xpcom/nsAutoLock.h:219
#4  0x0172d80f in TimerThread::RemoveTimer (this=0x23120f0, aTimer=0x23de5f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/TimerThread.cpp:378
#5  0x0172bfd5 in nsTimerImpl::Cancel (this=0x23de5f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/nsTimerImpl.cpp:286
#6  0x28b97b1d in imgContainer::Anim::~Anim (this=0x3450bd80) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/modules/libpr0n/src/imgContainer.h:194
#7  0x28b4d792 in imgContainer::~imgContainer (this=0x2d6ca2f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/modules/libpr0n/src/imgContainer.cpp:94
#8  0x28b4a884 in imgContainer::Release (this=0x2d6ca2f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/modules/libpr0n/src/imgContainer.cpp:73
#9  0x017915d9 in nsTimerImpl::ReleaseCallback (this=0x23de5f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/nsTimerImpl.h:119
#10 0x0172de17 in TimerThread::Shutdown (this=0x23120f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/TimerThread.cpp:169
#11 0x0172bf6d in nsTimerImpl::Shutdown () at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/nsTimerImpl.cpp:199
#12 0x016d54ce in NS_ShutdownXPCOM_P (servMgr=0x2312e04) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/build/nsXPComInit.cpp:739
#13 0x00207972 in ScopedXPCOMStartup::~ScopedXPCOMStartup (this=0xbffff4ac) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/toolkit/xre/nsAppRunner.cpp:909
#14 0x0020f3b5 in XRE_main (argc=4, argv=0xbffff7d4, aAppData=0x230b9e0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/toolkit/xre/nsAppRunner.cpp:3195
#15 0x00002798 in main (argc=4, argv=0xbffff7d4) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/browser/app/nsBrowserApp.cpp:158

TimerThread.cpp:169 iterates all timers and calls ReleaseCallback on each under the mLock being held.
TimerThread.cpp:378 enters the same mLock a second time.

I am curios why the assertion !mLocked didn't fail as I am used to on windows.

Just to report everything, this is back trace of the timer thread:
#0  0x900248c7 in semaphore_wait_signal_trap ()
#1  0x90001582 in pthread_mutex_lock ()
#2  0x90047d7c in pthread_cond_timedwait ()
#3  0x00677f5a in pt_TimedWait (cv=0x2311dd4, ml=0x2312220, timeout=97) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:280
#4  0x00678472 in PR_WaitCondVar (cvar=0x2311dd0, timeout=97) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:407
#5  0x0172dbce in TimerThread::Run (this=0x23120f0) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/TimerThread.cpp:334
#6  0x0172893d in nsThread::ProcessNextEvent (this=0x2357400, mayWait=1, result=0xb011ee8c) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/nsThread.cpp:510
#7  0x016cc8f3 in NS_ProcessNextEvent_P (thread=0x2357400, mayWait=1) at nsThreadUtils.cpp:227
#8  0x01728b4a in nsThread::ThreadFunc (arg=0x2357400) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/xpcom/threads/nsThread.cpp:254
#9  0x0067efab in _pt_root (arg=0x2357610) at /Users/starapica/Documents/mozilla/TRUNK/mozilla/nsprpub/pr/src/pthreads/ptthread.c:221
#10 0x90024227 in _pthread_body ()

It simply waits for notification.

STR:
1. go to http://images.google.cz/imgres?imgurl=http://www.getoggz.com/anim/AnimationWizard8.gif&imgrefurl=http://www.getoggz.com/GetOggz_Com/About_Oggz/&h=240&w=327&sz=700&hl=cs&start=9&um=1&tbnid=X8qxaVeqzz53NM:&tbnh=87&tbnw=118&prev=/images%3Fq%3Doggz%26um%3D1%26hl%3Dcs%26lr%3Dlang_cs%26sa%3DN

2. use Minefield > Quit to exit
3. Minefield hangs sometimes. This seems not to be reproducible.

My suggestion is to copy mTimers array to a local array copy, delete mTimers under the lock and then release the timers using local array copy outside the lock.
That sounds correct. Can you submit a patch?
Attached patch Fix (obsolete) — Splinter Review
Tested on Win and Mac, unfortunately I cannot reproduce this hang anymore to be sure this helps and doesn't break anything.
Attachment #308940 - Flags: review?(benjamin)
Comment on attachment 308940 [details] [diff] [review]
Fix

>Index: xpcom/threads/TimerThread.cpp

>+  nsTimerImpl *timer;
>+  PRInt32 timersCount = timers.Count();
>+  for (PRInt32 i = 0; i < timersCount; i++) {
>+    timer = static_cast<nsTimerImpl*>(timers[i]);

No need for "timer" to be outside the loop:

  nsTimerImpl *timer = static_cast<nsTimerImpl*>(timers[i]);

with that nit fixed, r=me... I'd like sr=brendan on this one for double-checking

drivers, we should really take this for 1.9/190x, since it's a timing-dependent hang that you can experience in the real world. Honza, can you attach an updated patch so we can request sr+approval on something that's ready for checkin?
Attachment #308940 - Flags: review?(benjamin) → review+
Attached patch Fix final (obsolete) — Splinter Review
I also tested for leaks - none is present.
Attachment #308940 - Attachment is obsolete: true
Attachment #308946 - Flags: superreview?(brendan)
Attachment #308946 - Flags: review?(benjamin)
Attachment #308946 - Flags: review?(benjamin) → review+
Flags: blocking1.9?
Please cc: me in future when requesting review (wish bugzilla did that for you). Thanks,

/be
Comment on attachment 308946 [details] [diff] [review]
Fix final

This is a good fix in general, to avoid deadlock calling out of module through a destructor to who-knows-what lock-taking code.

/be
Attachment #308946 - Flags: superreview?(brendan) → superreview+
Comment on attachment 308946 [details] [diff] [review]
Fix final

Wouldn't nsTArray<nsTimerImpl*> be type-safe without modifying reference counts?  nsVoidArray is rarely the right answer.  Also would be good to describe in comments (citing this bug by number) why the copy-to-local-array dance has to happen.

>   // These two internal helper methods must be called while mLock is held.
>   // AddTimerInternal returns the position where the timer was added in the
>   // list, or -1 if it failed.
>   PRInt32 AddTimerInternal(nsTimerImpl *aTimer);
>   PRBool  RemoveTimerInternal(nsTimerImpl *aTimer);
>+  void    ReleaseTimerInternal(nsTimerImpl *aTimer);

Can you update two to three here, please?
I thought about nsTArray, but it would require doing some magic with the existing mTimers array, which isn't worth it. We already have a copy-constructor that makes copying a nsVoidArray easy.
(In reply to comment #7)
> (From update of attachment 308946 [details] [diff] [review])
> Can you update two to three here, please?
> 

No, because ReleaseTimerInternal may not be called under the lock during shutdown - that is fix for this bug together with call of timer->ReleaseCallback() outside the lock. However, ReleaseTimerInternal might be called under the lock when called from other places through RemoveTimer(Internal) but it will not release the timer its self because there is always left one more strong reference to it.

(In reply to comment #7)
> (From update of attachment 308946 [details] [diff] [review])
> Wouldn't nsTArray<nsTimerImpl*> be type-safe without modifying reference
> counts?  nsVoidArray is rarely the right answer.  Also would be good to
> describe in comments (citing this bug by number) why the copy-to-local-array
> dance has to happen.

I wouldn't change this now, it is too risky IMHO. But filling a new bug for this issue (more cleanup then issue) might be appropriate.

I agree that commenting the code change is in place. I will submit one more patch to reflect it.
Status: NEW → ASSIGNED
This patch should be checked in.
Attachment #308946 - Attachment is obsolete: true
Attachment #309070 - Flags: approval1.9?
Comment on attachment 309070 [details] [diff] [review]
Fix final + added explanatory comment

a1.9+=damons
Attachment #309070 - Flags: approval1.9? → approval1.9+
Keywords: checkin-needed
Checking in xpcom/threads/TimerThread.cpp;
/cvsroot/mozilla/xpcom/threads/TimerThread.cpp,v  <--  TimerThread.cpp
new revision: 1.30; previous revision: 1.29
done
Checking in xpcom/threads/TimerThread.h;
/cvsroot/mozilla/xpcom/threads/TimerThread.h,v  <--  TimerThread.h
new revision: 1.12; previous revision: 1.11
done
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
OS: Mac OS X → All
Hardware: PC → All
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9beta5
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: