Closed Bug 166371 Opened 22 years ago Closed 22 years ago

Crashes at [@ nsThreadPool::Shutdown] in Trunk, N700

Categories

(Core :: Networking, defect)

x86
Windows 98
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: greer, Assigned: timeless)

References

()

Details

(Keywords: crash, qawanted, topcrash)

Crash Data

Attachments

(1 file, 3 obsolete files)

Crashes with the nsThreadPool::Shutdown signature are near the top of the list
of N7.00 crashes. The signature is also showing up in Trunk builds as recently
as 8/26. This needs to be fixed in the next point release.

The stack for this crash is mentioned in bug 134724 comment #19, but may be a
different issue.

Assigning to timeless (owner of bug 134724) for first look.

 Stack trace(Frame) 

	 nsThreadPool::Shutdown	[d:\builds\seamonkey\mozilla\xpcom\threads\nsThread.cpp
 line 745] 
	 nsFileTransportService::Shutdown
[d:\builds\seamonkey\mozilla\netwerk\base\src\nsFileTransportService.cpp  line 226] 
	 nsIOService::Observe
[d:\builds\seamonkey\mozilla\netwerk\base\src\nsIOService.cpp  line 1045] 
	 nsObserverService::NotifyObservers
[d:\builds\seamonkey\mozilla\xpcom\ds\nsObserverService.cpp  line 213] 
	 NS_ShutdownXPCOM	[d:\builds\seamonkey\mozilla\xpcom\build\nsXPComInit.cpp 
line 544] 
	 main	[d:\builds\seamonkey\mozilla\xpfe\bootstrap\nsAppRunner.cpp  line 1847] 
	 WinMain	[d:\builds\seamonkey\mozilla\xpfe\bootstrap\nsAppRunner.cpp  line 1857] 
	 WinMainCRTStartup()  
	 KERNEL32.DLL + 0x1b560 (0xbff8b560)  
	 KERNEL32.DLL + 0x1b412 (0xbff8b412)  
	 KERNEL32.DLL + 0x19dd5 (0xbff89dd5)
Keywords: crash, qawanted, topcrash
Attached file comments
FWIW, user comments from the N700 data.
:-(, i'll look, but i could probably use some help
Assignee: timeless → timeless
ok, i'm going to bet that multiple threads call shutdown

thread 1 and 2 enter:
721 nsThreadPool::Shutdown

thread 1 reaches
733     nsAutoLock lock(mLock);
thread 2 reaches it and gets stuck
thread 1 reaches
756     mThreads = nsnull;
757     return rv;
thread 2 gets the lock and decides to do:
742     rv = mThreads->EnumerateForwards(nsThreadPool::InterruptThreads, nsnull
which is reported as line 745 by talkback just to annoy me.

How's my driving?
Status: NEW → ASSIGNED
Related to bug 160922 (multiple calls for shutdown)?
Keywords: mozilla1.2
so this patch should theoretically work assuming i got the right cause.

mShuttingDown and mThreads are really equivalent conditions both are made false
by ::Shutdown() after it acquires the lock.

This might also be a caller bug, but I don't have an environment here to try to
figure that out. For 1.2a I think a variant of this should do. darin said he'd
look at it.
Target Milestone: --- → mozilla1.2alpha
Attached patch same patch, but for the caller (obsolete) — Splinter Review
Comment on attachment 97723 [details] [diff] [review]
condition and assertion are negotiable

>Index: mozilla/xpcom/threads/nsThread.cpp

>+        NS_ASSERTION(mThreads, "Bug 166371 was Shutdown() called more than once or did someone forget to call Init()?");

can a bug really be shutdown??	perhaps a ':' would do the trick ;-)

this patch seems like the right way to protect against multiple calls
to Shutdown() on different threads.  i'd really like to know if that's
actually the problem here or not.  there is the possibility that
nsFileTransportService::Shutdown is being called while another thread
is inside nsFileTransportService::Add/RemoveSuspendedTransport because
the locking in nsFileTransportService seems pretty broken to me.

sr=darin (fwiw)
Attachment #97723 - Flags: superreview+
Comment on attachment 97727 [details] [diff] [review]
same patch, but for the caller

>Index: mozilla/netwerk/base/src/nsFileTransportService.cpp

>     PR_Lock(mLock);
>-    mShuttingDown = 1;
>+    if (mShuttingDown) {
>+        NS_ASSERTION(!mShuttingDown, "Bug 166371 was Shutdown() called more than once!");
>+        return NS_OK;
>+    }
>+    mShuttingDown = PR_TRUE;
>     PR_Unlock(mLock);


this locking is silly...  assignment to a PRUint32 is atomic on all the
processors
we care about.	if you want, you could rewrite this like so:

  if (mShuttingDown)
      return NS_OK;
  mShuttingDown = PR_TRUE;
Attachment #97727 - Flags: needs-work+
Attachment #97723 - Attachment is obsolete: true
Attachment #97727 - Attachment is obsolete: true
Comment on attachment 97733 [details] [diff] [review]
combined version per darin (fixed1.0.2)

sr=darin, what about the assertion text? :-)

we also need to seriously look into protecting mSuspendedTransportList inside
Shutdown.  things would blow up in a big way if Add/RemoveSuspendedTransport
were called while inside Shutdown :(
Attachment #97733 - Flags: superreview+
Comment on attachment 97733 [details] [diff] [review]
combined version per darin (fixed1.0.2)

r=bz on irc
checked in
leaving open to ponder locking for mSuspendedTransportList
Attachment #97733 - Attachment is obsolete: true
Attachment #97733 - Flags: review+
The assignment is atomic, but a check + assignment is not atomic, is it?  i.e.
could we have a situation where two threads enter Shutdown at approximately the
same time, thread 1 finds mShuttingDown to be false, and before it sets
mShuttingDown to true, thread 2 has checked mShuttingDown and found it to be
false also?  Then both threads would execute the entire function... it seems
like we need a lock around the check _and_ assignment.
i should have qualified my last comment with :
nsFileTransportService::Shutdown() is only called on the UI thread ;-)  we know
this because it is called from nsIOService::Observe() in response to the
"xpcom-shutdown" event.  i don't see any way for this notification to arrive on
any thread other than the UI thread.

i think the crash may actually be caused by "xpcom-shutdown" happening twice.

or, it may be caused by "xpcom-shutdown" happening while a background file
transport thread is inside Add/RemoveSuspendedTransport.  see comment #7.
timeless: Is the patch checked into Trunk only ?
namachi - seeing that it does not have any approval, I'd hope that it's only in
trunk.
Have we verified timeless's fix on the Trunk?  This is currently the #3
topcrasher for Netscape 7.0.  

If his patch is safe and we know it fixed most of the crashes...it might be
worth a shot getting it into Blackbird?  I doubt it, but just thought I would
throw that out there.
Isn't there a (very small) hole in this code?

 nsresult
 nsFileTransportService::Shutdown(void)
 {
-    PR_Lock(mLock);
-    mShuttingDown = 1;
-    PR_Unlock(mLock);
+    if (mShuttingDown)
+        return NS_OK;
+
+    mShuttingDown = PR_TRUE;

Shouldn't that be more like:

     PR_Lock(mLock);
     if (mShuttingDown)
     {
         PR_Unlock(mLock);
         return NS_OK;
     }
     mShuttingDown = PR_TRUE;
     PR_Unlock(mLock);
Component: Networking → Accessibility APIs
no, because Shutdown is only called on the UI thread.
As timeless pointed out, Darin already answered the race-condition question.  I
still prefer using locking, since it's not good to make this sort of unchecked
assumption about how other modules will call you, but in practice we're ok for
now with it.
Comment on attachment 97733 [details] [diff] [review]
combined version per darin (fixed1.0.2)

a=rjesup@wgate.com for 1.0 branch; change mozilla1.0.2+ to fixed1.0.2 when
checked in
rjesup: adding a lock here would buy you very little.  we'd crash and burn
elsewhere if someone tried to shutdown the file transport service on a
background thread.  shutdown is only called from the io service when it receives
a xpcom-shutdown notification, which always occurs on the main thread.  if that
ever changes, much code would need to be rev'd.
Component: Accessibility APIs → Networking
timeless: can you explain how this is keyword==fixed1.0.2 while the
status==assigned?
I see checkins on 9/3 for the TRUNK.  I cannot find any for the BRANCH.

Ben, please verify this fix on the TRUNK.  If it's good, then we should feel
better about it landing on the BRANCH.  Thanks.
Comment on attachment 97733 [details] [diff] [review]
combined version per darin (fixed1.0.2)

http://bonsai.mozilla.org/cvsquery.cgi?module=MozillaTinderboxAll&branch=MOZILL
A_1_0(_0)?_BRANCH&branchtype=regexp&cvsroot=/cvsroot&date=explicit&mindate=1034
201640&maxdate=1034202180&who=timeless%25mac.com
Attachment #97733 - Attachment description: combined version per darin → combined version per darin (fixed1.0.2)
Keywords: fixed1.0.2
Sorry. I'm drawing the line here. Someeone else is going to have to take this
verification.
timeless, darin.  Is this being left open for additional work (comment# 11)?  Or
can it be marked fixed?

I am not seeing this crash in any recent talkback reports and the patch is
checked in.
marking FIXED
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
verified trunk and branch
Status: RESOLVED → VERIFIED
Crash Signature: [@ nsThreadPool::Shutdown]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: