Closed Bug 332730 Opened 18 years ago Closed 18 years ago

lock reentrancy through nsCacheEntryDescriptor::Close (deadlock)

Categories

(Core :: Networking: Cache, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: benjamin, Assigned: darin.moz)

References

Details

Attachments

(2 files, 3 obsolete files)

On my xptcall branch I may have changed proxy event processing or proxied object releasing; but I'm hitting the following stack pretty consistently. Seems to me that nsCacheEntryDescriptor::Close ought to protect against reentrancy somehow:

necko.dll!nsCacheEntryDescriptor::Close()  Line 415
necko.dll!nsCacheEntryDescriptor::~nsCacheEntryDescriptor()  Line 75
necko.dll!nsCacheEntryDescriptor::`scalar deleting destructor'()  + 0xf bytes
necko.dll!nsCacheEntryDescriptor::Release()  Line 53 + 0x90 bytes
necko.dll!nsCOMPtr<nsICacheEntryDescriptor>::~nsCOMPtr<nsICacheEntryDescriptor>()  Line 584
necko.dll!nsCacheService::ProcessPendingRequests(nsCacheEntry * entry=0x04291798)  Line 1572
necko.dll!nsCacheService::CloseDescriptor(nsCacheEntryDescriptor * descriptor=0x04291818)  Line 1352 + 0xf bytes
necko.dll!nsCacheEntryDescriptor::Close()  Line 420 + 0x9 bytes
necko.dll!nsCacheEntryDescriptor::~nsCacheEntryDescriptor()  Line 75
necko.dll!nsCacheEntryDescriptor::`scalar deleting destructor'()  + 0xf bytes
necko.dll!nsCacheEntryDescriptor::Release()  Line 53 + 0x90 bytes
necko.dll!nsCOMPtr<nsICacheEntryDescriptor>::assign_assuming_AddRef(nsICacheEntryDescriptor * newPtr=0x00000000)  Line 569
necko.dll!nsCOMPtr<nsICacheEntryDescriptor>::assign_with_AddRef(nsISupports * rawPtr=0x00000000)  Line 1225
necko.dll!nsCOMPtr<nsICacheEntryDescriptor>::operator=(nsICacheEntryDescriptor * rhs=0x00000000)  Line 714
necko.dll!nsHttpChannel::CloseCacheEntry(unsigned int status=0)  Line 1747
necko.dll!nsHttpChannel::OnStopRequest(nsIRequest * request=0x04291ed8, nsISupports * ctxt=0x00000000, unsigned int status=0)  Line 4162
necko.dll!nsInputStreamPump::OnStateStop()  Line 567
necko.dll!nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream * stream=0x04291c80)  Line 391 + 0xb bytes
xpcom_core.dll!nsInputStreamReadyEvent::EventHandler(PLEvent * plevent=0x04291fc4)  Line 121
xpcom_core.dll!PL_HandleEvent(PLEvent * self=0x04291fc4)  Line 688 + 0xa bytes
er... is the destructor reentering itself? and why does stuff call ProcessPendingEvents while holding a lock?
The destructor is reentering, but not for the same object... it's a different object being released/destroyed.
Alfred, any thoughts?
In this case, nsCacheEntryDescriptor::SetDataSize has a lock - entries are evicted, which eventually causes a CacheEntryDescriptor to get Closed, which tries to get the same lock (I think)
*** Bug 339819 has been marked as a duplicate of this bug. ***
*** Bug 340916 has been marked as a duplicate of this bug. ***
The cache service has one global lock.  See bug 340916 for a Linux stack.
Summary: lock reentrancy through nsCacheEntryDescriptor::Close → lock reentrancy through nsCacheEntryDescriptor::Close (deadlock)
Attached patch v1 patch (obsolete) — Splinter Review
This should do the trick.  My solution is to "proxy the release" of the cache entry descriptor to the thread where OnCacheEntryAvailable is called.  Strictly speaking we should be able to release cache entry descriptors on any thread, but in this case we have to avoid releasing it while holding the cache service lock.  That means deferring the Release until later, which in this case can be when the cache listener's OnCacheEntryAvailable method is called.

I also noticed that we should proxy the destruction of the cache listener to the thread on which its OnCacheEntryAvailable method is called.  That way we avoid executing other code during the cache service lock.
Attachment #224974 - Flags: review?(alfredkayser)
-> me
Assignee: nobody → darin
Severity: normal → critical
Status: NEW → ASSIGNED
Priority: -- → P1
That patch doesn't fix it.  All it does is add a ProxyRelease on the stack, because it calls the Release immediately.

#4  0x002a4214 in PR_Lock (lock=0x866ec20) at ptsynch.c:206
#5  0x01306095 in nsAutoLock::nsAutoLock (this=0xbfb4253c, aLock=0x866ec20)
    at ../../../../dist/include/xpcom/nsAutoLock.h:183
#6  0x01373730 in nsCacheEntryDescriptor::Close (this=0x100146d8)
    at nsCacheEntryDescriptor.cpp:414
#7  0x0137379f in nsCacheEntryDescriptor::~nsCacheEntryDescriptor$delete (
    this=0x100146d8) at nsCacheEntryDescriptor.cpp:73
#8  0x0137320f in nsCacheEntryDescriptor::Release (this=0x100146d8)
    at nsCacheEntryDescriptor.cpp:51
#9  0x00db6462 in nsCOMPtr_base::~nsCOMPtr_base$base (this=0x1055a39c)
    at nsCOMPtr.cpp:81
#10 0x00596842 in nsCOMPtr<nsICacheEntryDescriptor>::~nsCOMPtr (
    this=0x1055a39c) at ../../../dist/include/xpcom/nsCOMPtr.h:542
#11 0x0059616a in nsMsgMailNewsUrl::~nsMsgMailNewsUrl$base (this=0x1055a37c)
    at nsMsgMailNewsUrl.cpp:79
#12 0x02818527 in nsImapUrl::~nsImapUrl$delete (this=0x1055a378)
    at nsImapUrl.cpp:151
#13 0x00595e87 in nsMsgMailNewsUrl::Release (this=0x1055a37c)
    at nsMsgMailNewsUrl.cpp:82
---Type <return> to continue, or q <return> to quit---
#14 0x028182e8 in nsImapUrl::Release (this=0x1055a378) at nsImapUrl.cpp:155
#15 0x00db6462 in nsCOMPtr_base::~nsCOMPtr_base$base (this=0xfdc61ec)
    at nsCOMPtr.cpp:81
#16 0x00ebd43c in nsCOMPtr<nsIURI>::~nsCOMPtr (this=0xfdc61ec)
    at ../../../dist/include/xpcom/nsCOMPtr.h:542
#17 0x00ebfb0b in imgRequest::~imgRequest$delete (this=0xfdc61d0)
    at imgRequest.cpp:91
#18 0x00ebdb7d in imgRequest::Release (this=0xfdc61d0) at imgRequest.cpp:74
#19 0x00dc227c in NS_ProxyRelease (target=0x8561930, doomed=0xfdc61d0,
    alwaysProxy=0) at nsProxyRelease.cpp:75
#20 0x01371e6a in nsCacheEntry::~nsCacheEntry (this=0x1049f960)
    at nsCacheEntry.cpp:93
#21 0x0137a231 in nsMemoryCacheDevice::EvictEntry (this=0x88f3110,
    entry=0x1049f960, deleteEntry=1) at nsMemoryCacheDevice.cpp:349
#22 0x0137a425 in nsMemoryCacheDevice::EvictEntriesIfNecessary (this=0x88f3110)
    at nsMemoryCacheDevice.cpp:370
#23 0x0137a66e in nsMemoryCacheDevice::OnDataSizeChange (this=0x88f3110,
    entry=0x10b3fca0, deltaSize=13719552) at nsMemoryCacheDevice.cpp:317
#24 0x0137798b in nsCacheService::OnDataSizeChange (entry=0x10b3fca0,
    deltaSize=13719552) at nsCacheService.cpp:1376
#25 0x01373eb2 in nsCacheEntryDescriptor::SetDataSize (this=0x10b40048,
    dataSize=13719552) at nsCacheEntryDescriptor.cpp:223
Whoops.  I need to set the forceAsync flag to true!  Thanks for testing the patch.
Actually that stack is a little different.  The nsCacheEntry data item is being released while holding onto the cache lock, which is causing a similar problem to the bug reported.
Attached patch v2 patch (obsolete) — Splinter Review
OK, here's a better version of the patch.  It adds support for queuing up objects to be released when we exit the cache lock, and it makes use of that to delay the release of nsCacheEntry::mData as well as nsCacheRequest::mListener.
Attachment #224974 - Attachment is obsolete: true
Attachment #225400 - Flags: superreview?(bzbarsky)
Attachment #225400 - Flags: review?(cbiesinger)
Attachment #224974 - Flags: review?(alfredkayser)
Comment on attachment 225400 [details] [diff] [review]
v2 patch

+    // This method locks the cache service.  Use nsCacheServiceAutoLock instead
+    // of calling this method directly.
+    static void      Lock();
+
+    // This method unlocks the cache service.
+    static void      Unlock();

why not make this private then, and nsCacheServiceAutoLock a friend?
Attachment #225400 - Flags: review?(cbiesinger) → review+
Comment on attachment 225400 [details] [diff] [review]
v2 patch

This patch has been working great for me, no lockups in two days.  Thank you!
I won't get to this until sometime next week...
Attachment #225400 - Flags: superreview?(bzbarsky) → superreview?(bryner)
Comment on attachment 225400 [details] [diff] [review]
v2 patch

>Index: netwerk/cache/src/nsCacheEntry.h
>+    nsISupports *           mData;           // 

Document that this is a strong ref and that we're not using an nsCOMPtr to make it easier to call nsCacheService::ReleaseObject_Locked?

>Index: netwerk/cache/src/nsCacheRequest.h

>+    nsICacheListener *         mListener;

Same here.

sr=bzbarsky with those nits fixed
Attachment #225400 - Flags: superreview?(bryner) → superreview+
> why not make this private then, and nsCacheServiceAutoLock a friend?

OK, sure... I'll do that.
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Attached patch v2.1 patch (obsolete) — Splinter Review
Here's the version of the patch that I commited to the trunk.  I'd like to get this on the 1.8 branch once it has baked a bit.  We might want to even consider it for a 1.8.0.x update.
Attachment #225400 - Attachment is obsolete: true
Attachment #226262 - Flags: approval1.8.0.6?
Attachment #226262 - Flags: approval-branch-1.8.1?
This might have regressed Tp on btek a bit.
this also seems to have leaked 6K on nye bloat tbox.  bug 341700 is causing a bunch of leakage, and this checkin might have just made that bug leak more.  balsa has been orange since the checkin.  :(
Attachment #226262 - Flags: approval-branch-1.8.1? → approval1.8.1?
The leak log on balsa looks like this:

--ALL-LEAKS-----------------------------------leaks------leaks%-----------------------
TOTAL                                          8204     -10.32%
nsDiskCacheStreamIO                            2752     -14.00%
nsDiskCacheBinding                             2236     -14.00%
XPCWrappedNative                               1456       0.00%
nsLocalFile                                     496     -20.00%
nsXPCWrappedJS                                  240       0.00%
XPCNativeScriptableShared                       216       0.00%
nsXPCWrappedJSClass                             176       0.00%
nsCacheService                                  124       0.00%
xptiInterfaceInfo                                80       0.00%
nsThread                                         72       0.00%
nsRect                                           64       0.00%
nsXPCComponents                                  56       0.00%
nsStringBuffer                                   48     -14.29%
nsBaseAppShell                                   48       0.00%
nsJSID                                           36       0.00%
nsCacheEntryHashTable                            36       0.00%
nsSystemPrincipal                                36       0.00%
BackstagePass                                    20       0.00%
nsRunnable                                       12       0.00%

So it does look likely that this patch is causing leaks there...
Attachment #226262 - Flags: approval1.8.1?
Attachment #226262 - Flags: approval1.8.0.6?
backed out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
*** Bug 340224 has been marked as a duplicate of this bug. ***
OK, I found the problem.  nsCacheEntry's destructor had a bogus early return that was preventing mData from being released.  For disk cache entries, mData is a nsDiskCacheBinding, and it has an owning reference to a nsDiskCacheStreamIO.  Hence, the leak is fixed by simply removing the bogus early return.  That was not causing a problem previously because the code used to use a nsCOMPtr for mData, which would have its destructor called irregardless.

Revised patch checked into trunk.
Attached patch v2.2 patchSplinter Review
Here's the patch that I checked in.  (Ignore the unintended printf.)
Attachment #226262 - Attachment is obsolete: true
fixed-on-trunk
Status: REOPENED → RESOLVED
Closed: 18 years ago18 years ago
Resolution: --- → FIXED
(In reply to comment #20)
> Created an attachment (id=226262) [details]
> v2.1 patch
> 
> Here's the version of the patch that I commited to the trunk.  I'd like to get
> this on the 1.8 branch once it has baked a bit.  We might want to even consider
> it for a 1.8.0.x update.

This never made it into 1.8. 

Should it land on 1.8 for Thunderbird 2, which is going to live on for many more months?  bug 339819 cites TB.  Other examples? 
* Bug 338232 cites Seamonkey (Frequent PR_Assert failures "lock->owner != me")
* bug 400435 thunderbird appeared to deadlock  (won't let me view attachment so can't tell)

Is the patch too high risk?
I don't think we know if this is a large problem for 2.0, or how high a risk the patch is. The patch would also affect FF 2.0x while that's still around as well (but perhaps it's end-of-life is nigh or arrived, I'm not sure)
Years ago Scott did a series of m-builds and I suggest doing am M-Build with this patch put in for regression testing. It would enable some fact finding to make a landing derision. The FTP directory is branched out of the normal navigation flow so has a low risk of inadvertent access.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: