Closed Bug 471071 Opened 15 years ago Closed 15 years ago

js db listeners cause double destroy of nsMsgDatabase

Categories

(MailNews Core :: Database, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: Bienvenu, Assigned: rkent)

Details

Attachments

(4 files, 2 obsolete files)

Attached patch proposed fixSplinter Review
If you add a listener to a db from js, sending the announcer going away notification to the js listener can cause a double destroy of the db, because it can addref the db pointer, and then releaseref it at gc time, sending the ref count back to 0, and causing another delete.

I have a possible fix for this, but I wonder if there's some built-in mechanism to fix this. I'll attach a patch.
Attachment #354414 - Flags: superreview?(bugzilla)
Attachment #354414 - Flags: review?(neil)
David, have you got an example of where this may be happening? or a test-case? It seems a pretty strange scenario.
rkent has a do-nothing extension that demonstrates the problem - I could attach it, w/ his permission...
of course you have my permission.  Please attach the extension.
Running with this patch, I no longer get the shutdown crash in the do-nothing extension. However, I am still getting a shutdown crash in the original extension from which I derived the do-nothing extension. Let me investigate some more to see if I can understand what is happening.
The crash I'm getting on shutdown is on this line http://mxr.mozilla.org/comm-central/source/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp#3390 :

    else if(NS_SUCCEEDED(wrapper->mThread->IsOnCurrentThread(&val)) && !val)

because wrapper->mThread is 0.

Call Stack:

	xpc3250.dll!DEBUG_CheckWrapperThreadSafety(const XPCWrappedNative * wrapper=0x04e8d380)  Line 3390 + 0x22 bytes	C++
 	xpc3250.dll!XPCCallContext::XPCCallContext(XPCContext::LangType callerLanguage=LANG_JS, JSContext * cx=0x04ffc750, JSObject * obj=0x029d1020, JSObject * funobj=0x00000000, long name=106641244, unsigned int argc=4294967295, long * argv=0x00000000, long * rval=0x00000000)  Line 151 + 0xc bytes	C++
 	xpc3250.dll!XPC_WN_NoHelper_Resolve(JSContext * cx=0x04ffc750, JSObject * obj=0x029d1020, long idval=106641244)  Line 714	C++
 	js3250.dll!js_LookupPropertyWithFlags(JSContext * cx=0x04ffc750, JSObject * obj=0x029d1020, long id=106641244, unsigned int flags=65535, JSObject * * objp=0x0012e328, JSProperty * * propp=0x0012e318)  Line 3508 + 0xf bytes	C++
 	js3250.dll!js_GetPropertyHelper(JSContext * cx=0x04ffc750, JSObject * obj=0x029d1020, long id=106641244, long * vp=0x0012f190, JSPropCacheEntry * * entryp=0x0012ef7c)  Line 3772 + 0x23 bytes	C++
 	js3250.dll!js_Interpret(JSContext * cx=0x04ffc750)  Line 4403 + 0x28 bytes	C++
 	js3250.dll!js_Invoke(JSContext * cx=0x04ffc750, unsigned int argc=1, long * vp=0x069768d0, unsigned int flags=0)  Line 1331 + 0x9 bytes	C++
 	xpc3250.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper=0x069c7928, unsigned short methodIndex=7, const XPTMethodDescriptor * info=0x05ede748, nsXPTCMiniVariant * nativeParams=0x0012f5ac)  Line 1606 + 0x1b bytes	C++
 	xpc3250.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex=7, const XPTMethodDescriptor * info=0x05ede748, nsXPTCMiniVariant * params=0x0012f5ac)  Line 564	C++
 	xpcom_core.dll!PrepareAndDispatch(nsXPTCStubBase * self=0x069c7990, unsigned int methodIndex=7, unsigned int * args=0x0012f66c, unsigned int * stackBytesToPop=0x0012f65c)  Line 114 + 0x21 bytes	C++
 	xpcom_core.dll!SharedStub()  Line 142	C++
 	msgdb.dll!nsMsgDatabase::NotifyAnnouncerGoingAway()  Line 700 + 0x51 bytes	C++
 	msgdb.dll!nsMsgDatabase::NotifyAnnouncerGoingAway()  Line 700 + 0x51 bytes	C++
 	msgdb.dll!nsMsgDatabase::ForceClosed()  Line 1240	C++
 	msgdb.dll!nsMailDatabase::ForceClosed()  Line 119	C++
 	msgdb.dll!nsMsgDatabase::CleanupCache()  Line 738	C++
 	msgdb.dll!msgDBModuleDtor(nsIModule * self=0x05ec3de8)  Line 73	C++
 	xpcom_core.dll!nsGenericModule::Shutdown()  Line 340 + 0xc bytes	C++
 	xpcom_core.dll!nsGenericModule::~nsGenericModule()  Line 243	C++
 	xpcom_core.dll!nsGenericModule::`scalar deleting destructor'()  + 0xf bytes	C++
 	xpcom_core.dll!nsGenericModule::Release()  Line 245 + 0x8e bytes	C++
 	xpcom_core.dll!nsCOMPtr<nsIModule>::assign_assuming_AddRef(nsIModule * newPtr=0x00000000)  Line 496	C++
 	xpcom_core.dll!nsCOMPtr<nsIModule>::assign_with_AddRef(nsISupports * rawPtr=0x00000000)  Line 1172	C++
 	xpcom_core.dll!nsCOMPtr<nsIModule>::operator=(nsIModule * rhs=0x00000000)  Line 641	C++
 	xpcom_core.dll!nsNativeModuleLoader::ReleaserFunc(nsIHashable * aHashedFile=0x05ec3abc, nsNativeModuleLoader::NativeLoadData & aLoadData={...}, void * __formal=0x00000000)  Line 220	C++
 	xpcom_core.dll!nsBaseHashtable<nsHashableHashKey,nsNativeModuleLoader::NativeLoadData,nsNativeModuleLoader::NativeLoadData>::s_EnumStub(PLDHashTable * table=0x022c2668, PLDHashEntryHdr * hdr=0x04c433e0, unsigned int number=3, void * arg=0x0012f7e8)  Line 346 + 0x1e bytes	C++
 	xpcom_core.dll!PL_DHashTableEnumerate(PLDHashTable * table=0x022c2668, PLDHashOperator (PLDHashTable *, PLDHashEntryHdr *, unsigned int, void *)* etor=0x0048c060, void * arg=0x0012f7e8)  Line 735 + 0x19 bytes	C
 	xpcom_core.dll!nsBaseHashtable<nsHashableHashKey,nsNativeModuleLoader::NativeLoadData,nsNativeModuleLoader::NativeLoadData>::Enumerate(PLDHashOperator (nsIHashable *, nsNativeModuleLoader::NativeLoadData &, void *)* enumFunc=0x0048bc30, void * userArg=0x00000000)  Line 221 + 0x12 bytes	C++
 	xpcom_core.dll!nsNativeModuleLoader::UnloadLibraries()  Line 259	C++
 	xpcom_core.dll!nsComponentManagerImpl::Shutdown()  Line 750	C++
 	xpcom_core.dll!NS_ShutdownXPCOM_P(nsIServiceManager * servMgr=0x00000000)  Line 843 + 0xb bytes	C++
 	xul.dll!ScopedXPCOMStartup::~ScopedXPCOMStartup()  Line 948 + 0xc bytes	C++
 	xul.dll!XRE_main(int argc=1, char * * argv=0x0204f4f0, const nsXREAppData * aAppData=0x0204f998)  Line 3311	C++
 	thunderbird.exe!NS_internal_main(int argc=1, char * * argv=0x0204f4f0)  Line 103 + 0x12 bytes	C++
 	thunderbird.exe!wmain(int argc=1, wchar_t * * argv=0x0204cd00)  Line 87 + 0xd bytes	C++
 	thunderbird.exe!__tmainCRTStartup()  Line 594 + 0x19 bytes	C
 	thunderbird.exe!wmainCRTStartup()  Line 414	C
 	kernel32.dll!7d4e7d2a() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]
Attached patch Stops crash, but unit test fails (obsolete) — Splinter Review
As far as I can tell, the crash occurs because NotifyAnnouncerGoingAway is being called indirectly in nsMsgDatabase::CleanupCache for the database with the js listener. All other databases are already removed from the cache in the database destructor, but because we are now forcing the database object with the js listener to stay alive, though the database is closed in the folder shutdown routines, the program attempts a last-ditch effort to remove it from the cache. But the database is mostly gone by then, and so the removal attempt causes a crash when it tries to form the wrapped XPCOM object to do the onAnnouncerGoingAway call.

I fix this crash in the current attachment by adding to bienvenu's patch a one line call to remove the db from the cache during closing, but this unfortunately causes gloda unit tests to fail - and possibly other issues that I have not tested for. I'm pretty sure I could get around that by adding the removefromcache call in the folder shutdown, but I'd probably have to add removeFromCache to nsIMsgDBFolder which I am reluctant to do. Hopefully Bienvenu will have some ideas, otherwise I'll keep trying things.
oops, change "removeFromCache to nsIMsgDBFolder" to "removeFromCache to nsIMsgDatabase"
Having dug around the code a little bit, I'm fairly convinced that we're getting our refcounts or something wrong somewhere.

From what I'm seeing in the debugger, we're not getting a ForceClosed notification for the database that is being removed - the cycle collection is just going straight into the destructor.

Looking at bayesListener, this is holding onto the db (once I uncommented the line), the bayesListener is registered and held onto by nsMsgDBService and the db. So, there should be no reason for bayesListener to be cleaned up automatically.

However, somehow the db is getting destructed, causing the NotifyAnnouncerGoingAway which even then I wouldn't expect to release the listener because nsMsgDBService should be holding on to it (and the service in js).

This doesn't quite make sense, but if I'm right on my analysis and assumptions, then I think we're under-counting on our references to the db somehow and releasing it too early.
please find/file a bug for the mThread crash (a testcase is great, but either way, i'll deal w/ it). thanks :) - and reference the bug here
(In reply to comment #9)
> Having dug around the code a little bit, I'm fairly convinced that we're
> getting our refcounts or something wrong somewhere.
>

Checkout this code at http://mxr.mozilla.org/comm-central/source/mailnews/db/msgdb/src/nsMsgDatabase.cpp#741 if you want to see some creative refcounting around the db:

        if (refcount != 0)
        {
          // The destructor may cause the remaining references to be
          // released, so stabilize the refcount and then manually
          // delete.
          ++pMessageDB->mRefCnt;
          delete pMessageDB;
        }

There's been some discussion of this at bug 464419. I can't believe that code's correct - and I've seen it fire with refcount=3 and refcount=7 - but as far as I can tell it is a different issue than this current bug.
Attached patch Now passes unit tests (obsolete) — Splinter Review
I hate to keep adding methods to the interface, but that's what I did. This avoids the problems discussed in comment 6 by explicitly removing the database from the cache during shutdown, rather than relying on the cleanup code. I'll ask for review with the same people Bienvenu requested, though he should really bless this himself as well.

Timeless, I don't know enough about the xpcom stuff to know if the mThread stuff is a real bug, or just us using it incorrectly. With this fix, I don't need the issue resolved either. Plus the STR would be "Apply this patch we decided not to use, then try this partially finished extension, and you can make it crash". I'd be embarrassed to file such a bug.
Attachment #354798 - Attachment is obsolete: true
Attachment #354848 - Flags: superreview?(neil)
Attachment #354848 - Flags: review?(bugzilla)
w/o your full extension, I'm not sure what problem you're actually seeing, since the do-nothing extension was working with your patch. Can you send me your full extension?

re Mark's comments, the msg db service doesn't hold a reference to the db. nsMsgDBService::UnregisterPendingListener temporarily holds a comptr reference, but that is long gone by the time the js gc runs.
Here's the current version of the extension, slightly modified to simplify which folders are open, that gives a shutdown crash for me with Bienvenu's original patch.

To use this, I setup a test profile. I use Pop3 and a non-global inbox. Under the local inbox, I have a folder "Crashme" which is the only folder that the extension is currently loading a db listener for. I have three simple text-based messages in the Crashme folder.

To demonstrate the problem, I open TB with the extension loaded, in the small test profile. Then I click on the Crashme folder (which opens that db) and then back to the root of the POP3 account. I test two different cases. You can either exit immediately, or wait 4-10 seconds for garbage collection to occur before you exit. The crash occurs on shutdown.
(In reply to comment #0)
> I wonder if there's some built-in mechanism to fix this.
Isn't that the whole point of "stabilising" the refcount?
(In reply to comment #11)
>         if (refcount != 0)
>         {
>           // The destructor may cause the remaining references to be
>           // released, so stabilize the refcount and then manually
>           // delete.
>           ++pMessageDB->mRefCnt;
>           delete pMessageDB;
>         }
> 
> There's been some discussion of this at bug 464419. I can't believe that code's
> correct - and I've seen it fire with refcount=3 and refcount=7 - but as far as
> I can tell it is a different issue than this current bug.
It's not correct, only Release() is allowed to call delete. Trying to clean up the cache when one of the databases has a JS listener would definitely be one way of triggering this bug.
(In reply to comment #15)
> (In reply to comment #0)
> > I wonder if there's some built-in mechanism to fix this.
> Isn't that the whole point of "stabilising" the refcount?
I see now; that fails badly when GC is involved...
OK, so as far as I can see, except for the virtual folder change listener, which I can't actually see being added to a database, listeners should have an owning reference to the database to which they're listening, which means that it will never announce it is going away unless it is forced closed. So, I recommend:

1. Not announcing the database is going away in the destructor
2. Clearing out the list of listeners in ForceClosed just to be sure
Attachment #354414 - Flags: review?(neil) → review-
Attachment #354848 - Flags: superreview?(neil) → superreview-
pending listeners should not be required to have an owning reference to a db - it defeats the whole purpose of a pending listener, which is to have a listener when the db is open, but not to require the db be held open.
OK, so what's a pending listener, and where can I find one in the code?
the account manager registers them for the virtual folders to keep counts updated:

http://mxr.mozilla.org/comm-central/search?string=registerpending&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=comm-central
Sure, but it doesn't really care when the db goes away.
Although Bienvenu started this, I really need for this to move forward so I'll continue with another patch. Unless Bienvenu responds and says he is working on this, I'll do a patch along the following lines:

I basically agree with Neil's recommendations in comment 18, with some small changes. The main point of OnAnnouncerGoingAway is really to cleanup all references to the db, so that the db object can be destroyed. It is not really needed therefore to call this during the constructor - and that call is the root of all of the problems in this bug. So let's remove that call.

The minimum functionality required of OnAnnouncerGoingAway seems to be to remove its "this" from the listener list. For lightweight listeners, such as the type that I would like to add in extensions (or the in-tree virtual folder listeners), there is no local reference to the db, and this becomes the only point of this call. But the db knows about its listeners, it can take responsibility itself for removing them when it is being destroyed. Then there is no real need for the OnAnnouncerGoingAway call during the destructor. Lightweight listeners care more about pending listeners than actual listeners.

So when Neil suggests removing any remaining listeners in ForceClosed, I would instead do it in the destructor as normal behaviour, rather than "just to be sure". In practice I think that just means removing the assertion in ~nsMsgDatabase.
So I think the way forward for this bug would be:

- Write some unit tests that demonstrate the current problems.
- Fix the code in the way that Neil/Kent have suggested.

I think if we're going to do a change, then unit tests are required for us to be confident that we've understood the issues correctly and that we've fixed them correctly.
Attachment #354414 - Flags: superreview?(bugzilla) → superreview-
Attachment #354848 - Flags: review?(bugzilla) → review-
(In reply to comment #22)
> Sure, but it doesn't really care when the db goes away.

The point is that we want the db's to go away, to release their memory and file handles. Pending listeners tend to live for the life of the application. If pending listeners held their dbs open, then the db's would be open for the life of the application.
(In reply to comment #25)
> (In reply to comment #22)
> > Sure, but it doesn't really care when the db goes away.
> The point is that we want the db's to go away
Yes, but safely, rather than in a way that crashes...
I've been running this patch for the last few days without any issues. I still want to look at it one more time myself before I ask for review though. It's really the same approach advocated by Neil - with the extra fix of the shutdown crash added.

I think I could do a unit test that would crash with the existing code fairly easily. The difficulty of testing for the wider effects though is that it is hard to know all of the ways that a db might be left open by other code, which could prevent successful closing of the database.

I do think that fixing this is important though, as the powerful db listeners could then be used in javascript and extensions.
Assignee: bienvenu → kent
Attachment #354848 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Comment on attachment 359932 [details] [diff] [review]
Don't call ..goingAway during destruction

I think this works for the test case because nsMsgDBFolder::Shutdown is called pretty early in the shutdown process before other things start going away. But it relies on the fact that most consumers who get hold of nsIMsgDatabases do it through the folder - but if someone has a reference to the db and the corresponding folder does not, I'm not sure what will happen with this patch (e.g., object A gets the db from the folder and holds a reference, but some other code comes along and sets the msgDatabase on the folder to null - in that case, ForceClosed won't get called on the db from the folder).

This might still be the right thing to do, but it would be the removal of NotifyAnnouncerGoingAway from the destructor that would be the difference maker.
Just to be clear, the removal of NotifyAnnouncerGoingAway from the destructor solves a different problem than the change to nsMsgDBFolder::Shutdown. The shutdown error will occur even with the first change.

I don't really understand why the shutdown error occurs (see stack in comment 6), but it seems to be because the databases are being closed too late in the process. Making them close earlier, with the folder shutdown, solves the crashes for me.

The whole process of shutting down the databases seems rather risky to me. Currently, ForceClosed() really relies on the fact that anyone who has a reference to the database will also have a listener that will delete its reference when NotifyAnnouncerGoingAway() is called - which is not a stated requirement anywhere.  The database cache is another authority on who has a database open (by which I mean the file open) - but that is also not completely reliable, since we assume that being in the cache also implies that the database is somehow also valid, thus invalid databases can be open yet not in the cache.

Still, I don't think this bug is the place to address these issues - I have bug 475662 where I will propose some more radical, riskier changes. I'd like for this bug to just keep the darn code from crashing when I use a js-based db listener.
Comment on attachment 359932 [details] [diff] [review]
Don't call ..goingAway during destruction

Though this does not solve allof the issues iwth closing message databases, it moves in the right direction. I'd like to see it in beta 2 so that I can let others try soft tagging.
Attachment #359932 - Flags: superreview?(bienvenu)
Attachment #359932 - Flags: review?(neil)
Target Milestone: --- → Thunderbird 3.0b2
Comment on attachment 359932 [details] [diff] [review]
Don't call ..goingAway during destruction

>@@ -924,9 +923,7 @@ nsMsgDatabase::~nsMsgDatabase()
>     m_mdbEnv->Release(); //??? is this right?
>     m_mdbEnv = nsnull;
>   }
>-
>-  // Better not be any listeners, because we're going away.
>-  NS_ASSERTION(m_ChangeListeners.IsEmpty(), "shouldn't have any listeners");
>+  m_ChangeListeners.Clear();
Presumably ~nsMsgDatabase will invoke ~m_ChangeListners which Clear()s anyway?
(In reply to comment #31)
> Presumably ~nsMsgDatabase will invoke ~m_ChangeListners which Clear()s anyway?

Presumably. I have no problems leaving it out.
Weird. I had originally applied this patch and the patch from bug 464419, and I was triggering the cache shutdown assertion. I then backed this patch out, and the assertion went way, and I then reapplied this patch, and the assertion... still isn't firing!
Attachment #359932 - Flags: review?(neil) → review+
Comment on attachment 359932 [details] [diff] [review]
Don't call ..goingAway during destruction

fix checked in.
Attachment #359932 - Flags: superreview?(bienvenu) → superreview+
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.