Closed Bug 464419 Opened 16 years ago Closed 15 years ago

thunderbird crashed when exiting NS_LogCOMPtrRelease_P

Categories

(MailNews Core :: Database, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: eagle.lu, Assigned: rkent)

References

Details

(Keywords: crash, topcrash)

Attachments

(1 file, 2 obsolete files)

I built and run the latest Thunderbird trunk code, TB crashed when exiting.

re-produce steps:
1. open a terminal
2. ./thunderbird -P
3. create a new profile
4. set up a Gmail IMAP account
5. log in
6. quit from TB after showing all messages are shown in Inbox

TB crashed and I got following call stack:
#6  0xb7cf9c6d in NS_LogCOMPtrRelease_P (aCOMPtr=0xaf8491f8, 
    aObject=0xaf8e1d00)
    at /home/brian/moz-ws/community3/1112/comm-central/mozilla/xpcom/base/nsTraceRefcntImpl.cpp:1156
#7  0xb5b0a447 in ~nsCOMPtr (this=0xaf8491f8)
    at ../../../mozilla/dist/include/xpcom/nsCOMPtr.h:508
#8  0xb5b0718b in ~nsParseMailMessageState (this=0xaf8491e0)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/local/src/nsParseMailbox.cpp:540
#9  0xb5b07a08 in nsParseMailMessageState::Release (this=0xaf8491e0)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/local/src/nsParseMailbox.cpp:502
#10 0xb42f042c in ~nsCOMPtr (this=0xb2d539b8)
    at ../../../mozilla/dist/include/xpcom/nsCOMPtr.h:510
---Type <return> to continue, or q <return> to quit--- 
#11 0xb42e937a in ~nsImapMailFolder (this=0xb2d538b0)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/imap/src/nsImapMailFolder.cpp:262
#12 0xb614c1ce in nsRDFResource::Release (this=0xb2d538b0)
    at nsRDFResource.cpp:78
#13 0xb60ff0fd in nsMsgDBFolder::Release (this=0xb2d538b0)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/base/util/nsMsgDBFolder.cpp:142
#14 0xb42e902b in nsImapMailFolder::Release (this=0xb2d538b0)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/imap/src/nsImapMailFolder.cpp:265
#15 0xb4609874 in nsCOMPtr<nsIMsgFolder>::assign_assuming_AddRef (
    this=0xaf848fbc, newPtr=0x0)
    at ../../../../mozilla/dist/include/xpcom/nsCOMPtr.h:495
#16 0xb46098aa in nsCOMPtr<nsIMsgFolder>::assign_with_AddRef (this=0xaf848fbc, 
    rawPtr=0x0) at ../../../../mozilla/dist/include/xpcom/nsCOMPtr.h:1171
#17 0xb460b796 in nsCOMPtr<nsIMsgFolder>::operator= (this=0xaf848fbc, rhs=0x0)
    at ../../../../mozilla/dist/include/xpcom/nsCOMPtr.h:640
#18 0xb4603ea1 in nsMsgDatabase::CleanupCache ()
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/db/msgdb/src/nsMsgDatabase.cpp:694
#19 0xb45f2a55 in msgDBModuleDtor (self=0xb632ba30)
    at /home/brian/moz-ws/community3/1112/comm-central/mailnews/db/msgdb/build/n---Type <return> to continue, or q <return> to quit---
sMsgDBFactory.cpp:72
#20 0xb7c74e2e in nsGenericModule::Shutdown (this=0xb632ba30)
    at nsGenericFactory.cpp:340
#21 0xb7c74e5c in ~nsGenericModule (this=0xb632ba30)
    at nsGenericFactory.cpp:237
#22 0xb7c75039 in nsGenericModule::Release (this=0xb632ba30)
    at nsGenericFactory.cpp:245
#23 0xb7cdfbe8 in nsCOMPtr<nsIModule>::assign_assuming_AddRef (
    this=0xb6335288, newPtr=0x0) at ../../dist/include/xpcom/nsCOMPtr.h:495
#24 0xb7cdfc6c in nsCOMPtr<nsIModule>::assign_with_AddRef (this=0xb6335288, 
    rawPtr=0x0) at ../../dist/include/xpcom/nsCOMPtr.h:1171
#25 0xb7ce29ec in nsCOMPtr<nsIModule>::operator= (this=0xb6335288, rhs=0x0)
    at ../../dist/include/xpcom/nsCOMPtr.h:640
#26 0xb7ce1eb9 in nsNativeModuleLoader::ReleaserFunc (aHashedFile=0xb632d284, 
    aLoadData=@0xb6335288)
    at /home/brian/moz-ws/community3/1112/comm-central/mozilla/xpcom/components/nsNativeComponentLoader.cpp:219
#27 0xb7ce2867 in nsBaseHashtable<nsHashableHashKey, nsNativeModuleLoader::NativeLoadData, nsNativeModuleLoader::NativeLoadData>::s_EnumStub (
    table=0xb6c61290, hdr=0xb6335280, number=23, arg=0xbfbe883c)
    at ../../dist/include/xpcom/nsBaseHashtable.h:346
#28 0xb7c62ec9 in PL_DHashTableEnumerate (table=0xb6c61290, 

I think the root cause is 
In class nsParseMailMessageState, there is a memeber  
nsCOMPtr<nsIMsgDatabase> m_backupMailDB;
when ~nsParseMailMessageState() is called, m_backupMailDB will be
dereferenced.  But the objec it points to has been deleted at 

http://mxr.mozilla.org/comm-central/source/mailnews/db/msgdb/src/nsMsgDatabase.cpp#706

The codes around this "delete" look a little weird. I don't understand 
why it was written like this.
Severity: normal → critical
Priority: -- → P1
This bug is related to the patch of the bug 267575
Component: General → Database
Keywords: crash
Product: Thunderbird → MailNews Core
QA Contact: general → database
Summary: thunderbird crashed when exiting → thunderbird crashed when exiting NS_LogCOMPtrRelease_P
While tracing out crashes related to bug 471071, I am seeing a shutdown crash that I traced to the same bit of code that is discussed here in comment 0 (pMessageDB->ForceClosed() at nsMsgDatabase.cpp#737). I traced that code to the original bug 267575, where Boying Lu kindly left a comment pointing to this bug.

Basically, I applied the patch from bug 471071, and I am running a js nsIMsgDatabase listener in a custom extension, similar to the do-nothing extension attached to that bug, but containing the full functionality of the original extension rather than the stripped down version there. I attached a stack trace in that bug. Like Boying Lu, I also find the delete at http://mxr.mozilla.org/comm-central/source/mailnews/db/msgdb/src/nsMsgDatabase.cpp#748 to be very peculiar, but I'm not sure if that is really a cause of what I am seeing.

I still don't have a full story yet, but I am continuing to investigate.
Bug 471678 is #1 topcrash, making this bug topcrash as well.
Flags: blocking-thunderbird3?
Keywords: topcrash
I can reproduce this on Windows using the STR in comment 0, so taking.
Assignee: nobody → kent
Status: NEW → ASSIGNED
OS: Linux → All
Hardware: x86 → All
Let me give my history and interpretation of the "delete pMessageDB" at http://mxr.mozilla.org/comm-central/source/mailnews/db/msgdb/src/nsMsgDatabase.cpp#748 and why I think it must go.

This line was originally added in revision 1.186. It originally had the comment on it "// try this...shake out people holding onto db." This was before there was routine use of bugzilla bugs for changes. The CVS patch said "add in use hdr cache, performance enhancement" (this is not part of the in use hdr cache, so it must be a "performance enhancement".)

The assumption behind this delete seems to be that ForceClosed() will get rid of all legitimate users of a database in the cache, and anyone left standing, since they were too stupid to go away with ForceClosed(), are surely also too stupid to try to properly release their db reference. Since they are too stupid to release themselves, we are going to stop a shutdown memory leak by just deleting the object. Or maybe "shake out people" in the comments means "force a crash so that people will fix their stupid routines," which seems to have succeeded today.

But people are getting smarter. js XPCOM users, or people using nsCOMPtr variables (such as the guilty backup database in this case) will eventually call a Release() and crash. Let's not penalize people for doing the Right Thing.

So I think that I'll try a couple of things that should get rid of the crash.

1) There is an undocumented assumption that anyone holding on to a db reference must also be a db listener, and release that reference with the OnAnnouncerGoingAway() callback. That is what makes ForceClosed() work. I'll add that to users of the backup database.

2) Let's change this "delete pMessageDB" to an assertion instead of a delete.

It would be good to have a more robust mechanism to close the database, that does not rely on this OnAnnouncerGoingAway, but for now let's just document this in the idl, and leave it.
Attached patch Remove delete in CleanupCache (obsolete) — Splinter Review
I cure the crash by reversing the order of the access to the elements in the cache, and also removing the delete. As it turns out, it is actually the reverse of the order that cures the crash, as the backup database then gets deleted naturally without causing the refcount to be != 0 on any pass. Had I not reversed, then there would be one pass where refcount == 1, but on another pass two items would be deleted, which would have cleaned out the cache without needing the delete pMessageDB.

But that is all coincidence. There are issues I want to investigate with the backup database closing, which I will pursue in bug 476911. This current patch should prevent further crashes from this delete, which will then result in the assertion that was already in the code. This is a better solution than crashing.

I'd like to get this in ASAP because changes like this can have unforeseen consequences, and I would like to see both that it cures the current top crasher, as well as has no unforeseen issues, before we get into the final mad rush to TB beta 2.

I added Neil as a reviewer since he had opinions about this code in bug 471071 comment 16.
Attachment #360564 - Flags: superreview?(bienvenu)
Attachment #360564 - Flags: review?(neil)
Target Milestone: --- → Thunderbird 3.0b2
wouldn't a trivial fix for this crash be to set m_msgParser to null in nsImapMailFolder::Shutdown()? We should obviously fix the underlying problem, but if you're worried about unforeseen consequences, that would seem like a much safer interim fix.
There are a number of possible trivial fixes for this, and for each case where this delete pMessageDB fires. I even mentioned one trivial fix - reverse the order of the cache removal. I'll investigate other underlying issues in bug 476911

But no underlying fixes at all are really needed in this case - other than to stop this delete pMessageDB from firing, because it is that delete that prevents the natural shutdown releases from doing their job. The vast majority of the time that this delete pMessageDB fires, it is going to cause problems. Do we really want to go through a painful process of identifying the causes of each case in a crash, instead of an Assertion? Admittedly we will probably ignore a shutdown Assertion, but is a crash to get our attention the right answer? A shutdown memory leak, while regrettable, is much less of a problem than a shutdown crash. I'm not even sure this leak occurs any more, a lot has changed since 2000.
Comment on attachment 360564 [details] [diff] [review]
Remove delete in CleanupCache

This loop looks scary - any chance you could convert this to an nsTObserverArray? ;-)
(In reply to comment #10)
I seem to be doing reverse loops alot lately, so I've gotten so forward loops are starting to look scarier than reverse loops ;-)

I have never implemented an nsTObserverArray (though I have cursed other's implementations of them since it makes it really hard to add a breakpoint on the call to the listener. But I have the same problem with NS_Assert). My limited understanding of them is that they are trying to solve the same problem that this loop is doing.

But you don't REALLY want to try to convert the message database cache into an observer array, do you? It seems to me that would be forcing it into a mold it does not really match, for the sake of this one loop.
OK, so I tried running with this and the patch to bug 471071, but unfortunately I hit the assertion when I shut down...
Would you be able to set a debugger on the assertion you are hitting, and tell me the name and account type of any databases that are still in the cache after cache cleanup?
Attachment #360564 - Flags: review?(neil) → review+
Comment on attachment 360564 [details] [diff] [review]
Remove delete in CleanupCache

I still think an nsTObserverArray is the correct long-term fix.
Thinking about this some more, perhaps we should downgrade the assertion to a warning. What is the problem we are trying to solve with this code? I'm not sure the consequences of failure are worth an assertion.
Attachment #360564 - Flags: superreview?(bienvenu)
Comment on attachment 360564 [details] [diff] [review]
Remove delete in CleanupCache

I'm seeing the assertion get hit on a few unit tests. I'll need to fix that obviously.
I propose here that we replace the assertion with a warning, as it is really a test for memory leaks - which do not fit the criteria for an assertion.

The original "delete pMessageDB" was an attempt to eliminate shutdown memory leaks, but because it broke XPCOM rules it caused occasional issues. I believe that the Assertion from the first patch, that I removed in this one, caught cases where a leak was occurring, that might have been fixed by the previous delete. Perhaps Neil hit one of those in his tests, and we were certainly hitting them in the unit tests.

For the unit test issues, I poured over reference count logs to determine the source of the leak. In ~nsMsgHdr there were cases where m_mdbRow == 0, but there was still a reference to the db, which then was not getting released. That has been fixed. There might be some high-level effects of this, as the missing release() could have caused symptoms where the database could not be closed correctly in operations such as reindex or compact - but I don't really have any evidence of that.

With the change that I added, I could leave the assertion in and it would now pass unit tests - but I still think that we should remove it, so that's what the patch does.

Anyone, have a look. Neil, sorry to ask you to review again, but this is not a trivial change from the original patch.
Attachment #360564 - Attachment is obsolete: true
Attachment #360865 - Flags: superreview?(bienvenu)
Attachment #360865 - Flags: review?(neil)
Attachment #360865 - Attachment is patch: true
Attachment #360865 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 360865 [details] [diff] [review]
Cleanup ~nsMsgHdr leak, change assertion to a warning

>+    if (GetNumInCache() != 0)
>+      NS_WARNING("some msg dbs left open");  // better not be any open db's.
Nit: use NS_WARN_IF_FALSE
Attachment #360865 - Flags: review?(neil) → review+
(In reply to comment #18)
> Nit: use NS_WARN_IF_FALSE

I used the two-line solution on purpose to make it easier to add a debugger to this. I've already seen that these leaks are intermittent, and I wanted to be able to easily add a breakpoint to the warning line. Is there a way to do that with NS_WARN_IF_FALSE for an expression like GetNumInCache()?
I'll run with this...
(In reply to comment #19)
> (In reply to comment #18)
> > Nit: use NS_WARN_IF_FALSE
> I used the two-line solution on purpose to make it easier to add a debugger to
> this. I've already seen that these leaks are intermittent, and I wanted to be
> able to easily add a breakpoint to the warning line.
Then use #ifdef DEBUG to avoid release builds from calling GetNumInCache().
I'm hitting that warning still; one of the db's was a backup database, which I think would have been fixed by my previous suggestion; one was a local folder that gets pop3 mail filtered into it.
The backup db probably needs a listener when used, so that OnAnnouncerGoingAway can remove the reference. I opened bug 476491 to investigate. The copy service might be another issue, if it has a db reference it will need a listener or other method as well.

So the decision is whether to land this patch and fix the warnings as they occur, just continue to leave the old code (with its occasional crashes), or to try to fix all discovered leaks before we do anything else. My guess is that other cases will be easier than the message header case that I already fixed, as the simple solution is to just add the listener when possible. (Though msgHdr could still cause issues in other contexts, since it holds onto a db reference but does not have a listener. Adding the listener there is not easy.)

These extra cases where this warning is occurring should be fixed regardless of what we do with this bug, as they represent cases where ForceClosed did not successfully remove all db references. If ForceClosed() fails, then we could get all sorts of strange issues with things that need it, like reindex, rename,  or compact.

So where would you like for me to go with this, David? BTW this patch is not nearly as important to me as bug 471071. I hope that the issues here do not also cloud consideration of that other patch.
I'm planning on landing this as long as I don't run into any new issues; I was just recording for posterity the warnings I saw.

The msg hdr db issue could probably be handled in nsMsgDatabase::ClearHeaderEnumerator by adding a line to release m_mdb. I'll try it out and attach a patch, if that doesn't cause problems.

I'm still working through my review q; I'll get to bug 471071 this weekend, I hope!
I'm going to update the patch to include the #ifdef DEBUG, and your suggestion from comment 8 - though I do wonder if the delete at shutdown is just masking a problem that we should really be solving.
I've already done both of those locally...but I wasn't suggesting a delete, just a releasing of the reference, which I think is fair game at shutdown time...
Well here's my updated version, feel free to do your own variation.
Attachment #360865 - Attachment is obsolete: true
Attachment #360865 - Flags: superreview?(bienvenu)
Comment on attachment 361005 [details] [diff] [review]
#ifdef for warning, null parser at IMAP shutdown

sr=bienvenu, though I changed it to use NS_WARN_IF_FALSE per Neil.
Attachment #361005 - Flags: superreview+
should be fixed...
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment on attachment 361005 [details] [diff] [review]
#ifdef for warning, null parser at IMAP shutdown

GetNumInCache() was only defined in DEBUG builds, so I had to move it outside an #ifdef DEBUG to get this to build.
(In reply to comment #28)
> (From update of attachment 361005 [details] [diff] [review])
> sr=bienvenu, though I changed it to use NS_WARN_IF_FALSE per Neil.
Actually you don't need the #ifdef DEBUG with the NS_WARN_IF_FALSE.
Flags: blocking-thunderbird3?
several days after 3.0b2 ships this can be verified against 3.0b2 topcrash list to see if it's 100% gone. should be obvious from the numbers - right now it's still #1 topcrash for 3.0b1 
1 nsCOMPtr_base::~nsCOMPtr_base() 60
You need to log in before you can comment on or make changes to this bug.