Closed Bug 492116 Opened 15 years ago Closed 15 years ago

Venkman exposed flaw in nsMsgPropertyEnumerator, leads to crashes with Venkman

Categories

(MailNews Core :: Database, defect)

x86
Windows Vista
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Bienvenu, Assigned: Bienvenu)

Details

(Keywords: crash)

Attachments

(3 files)

Attached patch proposed fixSplinter Review
Whenever I try to use Venkman, I crash in the nsMsgPropertyEnumerator. Venkman is getting all the properties of a header, including the enumerator, and when it gets garbage collected, it crashes. 

The fix is to make sure that the property enumerator holds a reference to the hdr, and to make sure the rowCellCursor is freed before the hdr.
Attachment #376492 - Flags: superreview?(bugzilla)
Attachment #376492 - Flags: review?(kent)
I fear that we are creating here a section of black-magic code, that in the future will be considered untouchable because nobody will really understand the reason that the original crash occurred. I tried to get the code to crash in javascript following what I thought was the theory of the new patch, writing a little test that created a property enumerator from a header, nulled the header, did garbage collection using .garbageCollect(), then exercised the propertyEnumerator, but it did not crash. I've also used Venkman since the original patch landed without probelms, so I cannot verify that "Whenever I try to use Venkman" the code crashes.

I think we should do proposed fix if you are seeing the problem, and the patch fixes it, but for the sake of the future do you have any suggestions of how one might reproduce the crash?
I see the crash in multiple trees, and have been seeing it for a while. It might have to do with some of the view patches I've been running with, but I suspect not. It happens whenever new mail arrives. 

The problem might be in Mork, actually. Mork has its own ref-counting system, and I had to drag it kicking and screaming into the xpcom world. I think the row cell cursor stuff might not have been dragged into the xpcom world correctly, since up until now, no one was using it. I'll poke around that a little.
FWIW, Mork does addref the row cell cursor correctly, so that's not it. And if I run Venkman w/o this patch, I do keep crashing. I'll try to find some time to come up with an explanation of what's going on.
There is something funny going on with the mork row references, that manifests itself in much more serious situations than this including several top crashers, such as bug 480090. I guess I am still hoping that if the root cause gets solved in that blocker bug, then perhaps this patch won't be needed. Or if you can develop a reproducible test case from nsMsgPropertyEnumerator issues, then that could be used to understand the underlying problem that might also be used to address those other bugs. But maybe that is hoping for too much.

Just in case anyone is wondering, the crash from bug 480090 started occurring before the patch creating nsMsgPropertyEnumerator landed.
If I'm running Venkman, the msg hdr gets destroyed before the property enumerator (normally I think this doesn't happen since the db caches msg hdrs so you somehow need to get a msg hdr that's not cached - I don't know how that's happening, but we have to deal with it). The msg hdr does a release ref on the mdb row, which only has a single XPCom reference to it, so it gets deleted. But the row cell cursor is still holding onto it, though it doesn't hold an xpcom reference. Mork has its own ref counting system internally, and we only use XPCOM references for the things that get returned outside of Mork. I'm a little leery of pushing xpcom ref counting semantics further into Mork (e.g., making mork row cell cursor hold an xpcom reference to the row object) because I really don't want to introduce reference cycles/leaks into Mork. But I can try...
Thanks for the analysis. Do you think that may also be the root cause of some of our more common mork crashers as well?
(In reply to comment #6)
> Thanks for the analysis. Do you think that may also be the root cause of some
> of our more common mork crashers as well?

I suspect the other mork crashers are less subtle (e.g.,  bug 480090). I believe that's a simple case of our non-mork objects (e.g., the offline imap operations) holding references to mork memory past the time the mork db has been closed. If you think of Mork as using memory pools, and closing the db as freeing the pool, then you can see why that's bad.  For msg hdrs, we solve this by having the nsMsgDatabase wrapper around Mork db's keep track of all the extant msg hdrs in a hash table, and clearing their mdb row pointers before we close the mork db. 

I suppose one solution for the offline imap operation issue is to put them in the same hash table, and rejigger the class hierarchy of those objects a little so that they both implement an interface that has a clear method, and they'd clear their mdb rows when we iterate through the hash table.  Or they could share a base class that has the mdb row...That would not fix the property enumerator issue, though, since that's not a case of the db getting closed.

I did try tweaking Mork to use a bit more xp com ref counting, and it does fix the crash, but it also caused some assertions in the case that used to crash that made me a bit nervous...so I'm still inclined to go with some variant of my patch - I'll see if I can simplify it a little.
this is a patch that fixes the crash, but generated a slightly disturbing assertion.
This unit test will crash when run, and is fixed by bienvenu's patch. It crashes in the row object. I'm not sure if this is something we want to promote though, since trying to do a database operation after a forceDBClosed is not really valid.
Attachment #376492 - Flags: review?(kent) → review+
(In reply to comment #9)
> Created an attachment (id=380460) [details]
> Unit test that will crash
> 
> This unit test will crash when run, and is fixed by bienvenu's patch. It
> crashes in the row object. I'm not sure if this is something we want to promote
> though, since trying to do a database operation after a forceDBClosed is not
> really valid.

What does not promoting it amount to?  The semantics of forceDBClosed aren't really compatible with our JS extension world, so we need to wrap everything in foam rubber... naive/lazy JS code shouldn't be able to crash the system.

(I think it's okay for orphaned objects to go into a state where they simply throw exceptions no matter what you try to do to them, but no crasing.)
By "promote" I originally meant "do we try to include this unit test in the testing framework" - but then it sort of morphed into "do we allow in general any DB operation after forceDBClosed". So I am uncertain whether to encourage this unit test to be landed.

On the issue of ForceDBClosed() itself, I hope it is clear that there are some major issues in the whole summary database framework that need revamping. Perhaps we should have a summit at some point to discuss ways forward, but that is not going to happen pre-TB3. The existing framework was never intended to be used with XPCOM - much less modern javascript - and it has been an amazing accomplishment to make it work as well as it does.
pinging for sr - I'd like to land this before I start dealing with the forceDBClosed issue, to avoid conflicts.
Comment on attachment 376492 [details] [diff] [review]
proposed fix

This is causing test_index_messages.js to hang if I apply the patch from bug 438922 to my tree as well (this makes sure we clean up after finishing each unit test). This makes me nervous.

Unfortunately, the patch that would get me more info with check-interactive is broken (bug 384339) so I can't get more info at the moment :-(
there shouldn't be any property enumerators extant at shut-down time - the only code that uses them in the test cases uses stack variables.
Do any of the property enumerators ever get exposed at all to XPConnect?  It can and will hang onto everything it sees until a GC pass happens.  Perhaps forcing a GC collection in that shutdown helper patch?
the property enumerator is accessible via xpconnect, but only venkman, which has a lovely habit of enumerating the properties for fun, actually accesses the property enumerator via xpconnect (other than rkent's extensions, which may also access the property enumerator)
There is an existing unit test for the propertyEnumerator that accesses it via xpconnect.
Standard8, could that have been the test that hung? I don't think the test_index_messages test exposes the property enumerator to xpconnect, but if it copy/moves messages, it will cause a property enumerator to be created and destroyed in c++.
Comment on attachment 376492 [details] [diff] [review]
proposed fix

I guess I hadn't checked the patch in the way I thought I had - I've just tried reproducing and I can reproduce the hang without or without your patch (but with my shutdown patch). So its obviously not this patch that is the issue.

>   nsCOMPtr<nsIMdbEnv> m_mdbEnv;
>   nsCOMPtr<nsIMdbStore> m_mdbStore;
>+  nsRefPtr<nsMsgHdr> m_hdr;

Can you add a short note here saying that we keep hold of it to stop m_hdr being deleted and tidying up the other objects that this enumerator requires?

sr=Standard8
Attachment #376492 - Flags: superreview?(bugzilla) → superreview+
fix checked in with comment added.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Severity: normal → critical
Keywords: crash
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: