Closed Bug 1299920 Opened 3 years ago Closed 3 years ago

release weak references in the IMAP protocol on the main thread

Categories

(MailNews Core :: Networking: IMAP, defect, critical)

defect
Not set
critical

Tracking

(thunderbird51 unaffected, thunderbird52+ fixed, thunderbird53 fixed, seamonkey2.49esr fixed, seamonkey2.50 fixed)

RESOLVED FIXED
Thunderbird 53.0
Tracking Status
thunderbird51 --- unaffected
thunderbird52 + fixed
thunderbird53 --- fixed
seamonkey2.49esr --- fixed
seamonkey2.50 --- fixed

People

(Reporter: jorgk-bmo, Assigned: rkent)

References

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1299116 +++

See bug 1299116 comment #32.
I'm not sure I am interpreting the stack correctly from https://bugzilla.mozilla.org/attachment.cgi?id=8787354

Can anyone tell if this is complaining about the release of nsMsgProtocol itself, or one of the member variables (like m_server)?

Honza Bambas (:mayhemer) can you help this poor old hack interpret this?
Flags: needinfo?(honzab.moz)
My 0,5 cents worth:
Looking at the macro expansion in DXR of NS_IMPL_RELEASE_INHERITED(nsImapProtocol, nsMsgProtocol ), this is about the release of nsMsgProtocol. Equally NS_IMPL_ISUPPORTS(nsMsgProtocol, nsIChannel, nsIStreamListener, ... is about the release of nsMsgProtocol. Where do you see m_server?
m_server is a member variable of the protocol, so it also needs to be released when the protocol is released.
Ok, I can now confirm that Seamonkey crashes every time if I disconnect the network and open Mail & newsgroups with an imap account in the profile. 

Should i do a debug build later and see where it crashes in imap?
(In reply to Frank-Rainer Grahl from comment #4)
> Ok, I can now confirm that Seamonkey crashes every time if I disconnect the
> network and open Mail & newsgroups with an imap account in the profile. 
So you're saying it crashes in a non-debug build? That's not caused by a MOZ_ASSERT then. I will try it later, right now I can't disconnect my network.
Can you write down the exact STR. Do you need to enter a password?
 
> Should i do a debug build later and see where it crashes in imap?
Well, if it's not too much hassle, it would be good to get the crash confirmed. I though you had already done your debug build.
(In reply to Jorg K (GMT+2, PTO during summer) from comment #5)
> (In reply to Frank-Rainer Grahl from comment #4)
> > Ok, I can now confirm that Seamonkey crashes every time if I disconnect the
> > network and open Mail & newsgroups with an imap account in the profile. 
> So you're saying it crashes in a non-debug build? That's not caused by a
> MOZ_ASSERT then. I will try it later, right now I can't disconnect my
> network.
> Can you write down the exact STR. Do you need to enter a password?

Its the assert because its also triggered by Nightly.

> > Should i do a debug build later and see where it crashes in imap?
> Well, if it's not too much hassle, it would be good to get the crash
> confirmed. I though you had already done your debug build.

This was the clobber build. Old lappi just needs 5-6h to do a full compile and I didn't let it run overnight :)

A message box pops up says it can't connect to the server and then it asserts with the weak reference message visible in the crash output.
(In reply to Kent James (:rkent) from comment #3)
> m_server is a member variable of the protocol, so it also needs to be
> released when the protocol is released.
Sorry, I see it now, m_server in nsImapProtocol, subclass of nsMsgProtocol.
(In reply to Frank-Rainer Grahl from comment #6)
> A message box pops up says it can't connect to the server and then it
> asserts with the weak reference message visible in the crash output.
I'll try it later when I can disconnect my network. Maybe this belongs to the other bug though.
I did the following test:
Disconnected my network. Started TB on a profile with IMAP configured. Accessed an IMAP folder with no local storage (I have that for my cache tests), so IMAP needs to go fetching messages from the server, which doesn't work due to no network connection. Result is the very same crash described in attachment 8787354 [details], which is subject of this bug.

At least, we have an easy to reproduce test case now, since sending 30+ messages and hoping for the crash to happen wasn't at all fun (see bug 1299116 comment #30).

(In reply to Frank-Rainer Grahl from comment #6)
> (In reply to Jorg K (GMT+2, PTO during summer) from comment #5)
> > So you're saying it crashes in a non-debug build? That's not caused by a
> > MOZ_ASSERT then.
> Its the assert because its also triggered by Nightly.
Frank-Rainer, that makes no sense. A Nightly, at least for Thunderbird, is a non-debug build, so MOZ_ASSERT does *NOT* trigger there. If you have a crash in a non-debug build, you are seeing a different issue. Again, I'm happy to try stuff, but I need exact STR. With your description (IMAP + no network) I managed to reproduce the crash we're discussing in this bug here.
Jorg see Honzas comment here:

https://bugzilla.mozilla.org/show_bug.cgi?id=1299116#c9

Its triggered in Nightly and debug builds. So all you need is 'NIGHTLY_BUILD': '1' in your config.status which is always there in c-c and it will assert unless you also have enabled profiling:

> #if defined(DEBUG) || (defined(NIGHTLY_BUILD) && !defined(MOZ_PROFILING))

Back at the desktop and happily compiling all trees now. Just holler if a debug build is still needed. Looks like you reproduced the crash.
(In reply to Frank-Rainer Grahl from comment #10)

Sorry, I'm completely confused.

> Jorg see Honzas comment here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1299116#c9
I've read bug 1299116 comment #9 and I can also see that we're not talking about a simple MOZ_ASSERT, but something more sophisticated:
https://dxr.mozilla.org/comm-central/rev/b7f7ae14590aced450bb0b0469dfb38edd2c0ace/mozilla/xpcom/glue/nsWeakReference.cpp#15
https://dxr.mozilla.org/comm-central/rev/b7f7ae14590aced450bb0b0469dfb38edd2c0ace/mozilla/xpcom/glue/nsDebug.h#345

So debug *and* nightly builds should crash here:
https://dxr.mozilla.org/comm-central/rev/b7f7ae14590aced450bb0b0469dfb38edd2c0ace/mozilla/xpcom/glue/nsDebug.h#354

However, observed bahaviour is this:
With the steps described in comment #9, the first click on a message in an IMAP folder without local storages crashes immediately when using my own local debug build.

Doing the very same with a Daily/Nightly build does *NOT* crash. That's why I mistakenly assumed that we were talking about a simple MOZ_ASSERT here, which we are not.

So please explain why the local debug build crashes and the downloaded Daily/Nightly does not crash.

P.S.: config.status is not a file I know or manipulate since it only appears in the object directory.
>> So please explain why the local debug build crashes and the downloaded Daily/Nightly does not crash.

IanN pointed it out to me. My local build has debug and profiling turned off. The Seamonkey and TB nightlies enable profiling so no crash with them. The assert does not happen when profiling is turned on in the Nightly and debug is off.
(In reply to Kent James (:rkent) from comment #1)
> I'm not sure I am interpreting the stack correctly from
> https://bugzilla.mozilla.org/attachment.cgi?id=8787354
> 
> Can anyone tell if this is complaining about the release of nsMsgProtocol
> itself, or one of the member variables (like m_server)?
> 
> Honza Bambas (:mayhemer) can you help this poor old hack interpret this?

According that nsMsgProtocol doesn't impl nsSupportsWeakReference, it's release of last reference to something nsMsgProtocol keeps a strong reference to.

can also be decomposed as:

nsMsgProtocol::Release() -> ~nsMsgProtocol -> member's ~nsCOMPtr() -> referred object's Release() /* apparently the last one */ -> nsSupportsWeakReference::~nsSupportsWeakReference() -> ClearWeakReferences() -> mProxy->NoticeReferentDestruction()

all just optimized by the compiler to be seen only as:

nsMsgProtocol::Release() -> mProxy->NoticeReferentDestruction()


It's unfortunate you can't see the instance object nor type of the comptr being destroyed.  However, apparently this shows the problem at its whole glance - an object that has been taken a weak reference - that weak ref still exists (mProxy would be null otherwise), is released on a different thread than the weak ref has been worked with on.
Flags: needinfo?(honzab.moz)
I got a crash while testing bug 1297368, with this stack:

 	xul.dll!nsWeakReference::NoticeReferentDestruction() Line 65	C++
 	xul.dll!nsSupportsWeakReference::ClearWeakReferences() Line 161	C++
 	xul.dll!nsSupportsWeakReference::~nsSupportsWeakReference() Line 47	C++
>	xul.dll!nsImapProtocol::~nsImapProtocol() Line 587	C++
 	[External Code]	
 	xul.dll!nsMsgProtocol::Release() Line 48	C++
 	xul.dll!nsImapProtocol::Release() Line 297	C++
 	xul.dll!nsCOMPtr<nsIRunnable>::~nsCOMPtr<nsIRunnable>() Line 406	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 1090	C++

Concerning "According that nsMsgProtocol doesn't impl nsSupportsWeakReference, it's release of last reference to something nsMsgProtocol keeps a strong reference to." but note that nsIMapProtocol DOES support weak reference, and that is what crashed here.

Clearly we need to ensure that the imap protocol is only released on the main thread.
(In reply to Kent James (:rkent) from comment #14)
> Concerning "According that nsMsgProtocol doesn't impl
> nsSupportsWeakReference, it's release of last reference to something
> nsMsgProtocol keeps a strong reference to." but note that nsIMapProtocol
> DOES support weak reference, and that is what crashed here.

Ups!  I had to miss that.  If that's so, then it's a bit easier to find a fix.
"it's a bit easier to find a fix"

Honza, I'd be happy to do the coding if you could point me in the correct direction of existing Mozilla code that ensures an object is released on the main thread. I could probably figure it out on my own, but if the direction is obvious to you and you could show me, it would make my life easier.
I coded something like this once:
  nsCOMPtr<nsIThread> mainThread = do_GetMainThread();
    if (mainThread) {
      NS_ProxyRelease(mainThread, static_cast<mozIPersonalDictionary *>(dict));

Maybe completely unrelated. Anyway, Kent's question is in comment #16.
Flags: needinfo?(honzab.moz)
(In reply to Jorg K (GMT+2) from comment #17)
> I coded something like this once:
>   nsCOMPtr<nsIThread> mainThread = do_GetMainThread();
>     if (mainThread) {
>       NS_ProxyRelease(mainThread, static_cast<mozIPersonalDictionary
> *>(dict));
> 
> Maybe completely unrelated. Anyway, Kent's question is in comment #16.

Yes that is the approach. I spent a couple of hours looking at this last week, what I am struggling with is figuring out where the reference is to the protocol that is getting released on the wrong thread. Once I find that, them the proxy release should work.
(In reply to Kent James (:rkent) from comment #16)
> "it's a bit easier to find a fix"
> 
> Honza, I'd be happy to do the coding if you could point me in the correct
> direction of existing Mozilla code that ensures an object is released on the
> main thread. I could probably figure it out on my own, but if the direction
> is obvious to you and you could show me, it would make my life easier.

Oh, I didn't want to say anything like that I would know what exactly to do or point you to anything concrete.  

We also have nsMainThreadPtrHolder et al thing.  It might help to keep the code more clean.

Other option might be to always proxy Release() to the main thread.  I did something like this for localStorage at https://dxr.mozilla.org/mozilla-central/rev/ea104eeb14cc54da9a06c3766da63f73117723a0/dom/storage/DOMStorageCache.cpp#102
Flags: needinfo?(honzab.moz)
Assignee: nobody → rkent
ref also bug 993734 and bug 1005336
Blocks: 1133892
One of the things holding me up looking at this is that I have no way to recreate the crash reliably for testing. Does anyone know how to make this crash happen on demand?
If you have an IMAP account it should be pretty easy. I was able to crash it always in SeaMonkey.

1.) compile Nightly with profiling turned off or just debug on. This satisfies:

> #if defined(DEBUG) || (defined(NIGHTLY_BUILD) && !defined(MOZ_PROFILING))

In SeaMonkey if an IMAP account is configured all I needed to do was open the Mail & Newsgroups window -> Bang. I TB you likely need to only start it. See comment 4 from me and comment 9 from Jorg.

> Ok, I can now confirm that Seamonkey crashes every time if I disconnect the network and open 
> Mail & newsgroups with an imap account in the profile.
(In reply to Kent James (:rkent) from comment #21)
> One of the things holding me up looking at this is that I have no way to
> recreate the crash reliably for testing. Does anyone know how to make this
> crash happen on demand?
Just read comment #9, first paragraph. I'll copy it here for you ;-)
===
Disconnected my network. Started TB on a profile with IMAP configured. Accessed an IMAP folder with no local storage (I have that for my cache tests), so IMAP needs to go fetching messages from the server, which doesn't work due to no network connection. Result is the very same crash described in attachment 8787354 [details], which is subject of this bug.
===

Happy crashing! Make Thunderbird crash again!
I couldn't make it crash with 10-26 nightly build on Windows.  Do I read correctly that a debug build is not required?
No, you don't. A debug build *IS* required to see the crash. FRG does something special to even crash in non-debug builds, see comment #10 and comment #12.

Repeating: This only crashes in TB debug builds (which is very annoying for developers) and shows that something is wrong in our logic. If it crashed in normal builds, this would be a TOP TOP TOP TOP CRASH.
Contrary to rumor, I really have no idea what I am doing in this thread management stuff. But I can run a debugger, and see that the crash is occurring after the protocol runnable completes. Here I have cargo-culted a release to main thread. Hopefully someone with clue will throw up their hands in disgust, and propose the correct fix here.

But this patch has the merit that it stops the crash, at least for the reproducible case.
Attachment #8817302 - Flags: review?(jorgk)
(In reply to Kent James (:rkent) from comment #26)
> Contrary to rumor, I really have no idea what I am doing in this thread
> management stuff.
Well, I think you're the resident IMAP expert and the person who has the best idea about this stuff. "The best idea" doesn't imply that it's a good idea, just the best of all the people available.

> Here I have cargo-culted a release to main thread.
Can someone please explain the term "cargo-cult". I've looked it up but didn't find any good explanation.

> Hopefully someone with clue will throw up their
> hands in disgust, and propose the correct fix here.
If we had such a someone, we would have given them the bug.

> But this patch has the merit that it stops the crash, at least for the
> reproducible case.
Nice.

I was going to complain about the comment (who is "I", why is kludge lower case the second 'this' refers to an object, right?) ...

+  // For reasons that I do not understand, this runnable is resulting in the
+  // destruction of this on the imap thread, which causes grief for weak references.
+  // kludge to allow NS_ReleaseOnMainThread

but let me ask a question first.

'this' is the protocol object which you now release:
+  nsCOMPtr<nsIImapProtocol> releaseOnMain(this);
+  NS_ReleaseOnMainThread(releaseOnMain.forget());

Where was is released before? Or did it just casually die in some garbage collection?

Let me ask Honza for feedback, OK?
Comment on attachment 8817302 [details] [diff] [review]
kludge using NS_ReleaseOnMainThread

Hi Honza, if you're not in Hawaii, can you give us your opinion here.
Thanking you in advance.
Attachment #8817302 - Flags: feedback?(honzab.moz)
I shouldn't work at 2 AM:
Where was *it* released before? Or did it just casually die in some garbage collection?
Where was it released? The imap thread runnable was passed to nsThread.cpp when the protocol started, and at the point it exited all references had gone away, so it was released in Thread.cpp when the runnable exited.

BTW I am travelling for 5 days, I'll have my travel computer but no development environment.

"Cargo cult" means that I copied other code that I did not fully understand. All of the references that I could find to NS_ReleaseOnMainThread were used on .forget() from an nsCOMPtr, so that is what I did.
Kent, thank you so much for your answers, I know, you're a busy man ;-)

Especially ...
===
The imap thread runnable was passed to nsThread.cpp when the protocol started, and at the point it exited all references had gone away, so it was released in Thread.cpp when the runnable exited.
===
... will give Honza a better (not to say "good") understanding.

If I read it correctly, we now pro-actively release the object on the right thread instead of having it being released by something that does sound like some sort of "garbage collection" on the wrong thread. Sounds good to me. Also, there are plenty of calls to |NS_ReleaseOnMainThread()| in that file, so it appears that this one simply was missed(??).

Honza, can I please have your thoughts.
Comment on attachment 8817302 [details] [diff] [review]
kludge using NS_ReleaseOnMainThread

Review of attachment 8817302 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work to locate this!  Looks good.
Attachment #8817302 - Flags: feedback?(honzab.moz) → feedback+
Comment on attachment 8817302 [details] [diff] [review]
kludge using NS_ReleaseOnMainThread

This looks reasonable, fixes the problem and has expert feedback+
I'll land this with a slightly modified comment.
Thanks, Kent!
Attachment #8817302 - Flags: review?(jorgk) → review+
While this patch fixes the reproducible case described in comment #9, it still doesn't eliminate all gremlins from the box. Last night I was running a debug build and it crashed with this patch applied. Sadly I missed to grab the stack.

FRG, can you please try with this (landed) patch and see whether you can still get it to crash, preferably in a reproducible fashion.
Flags: needinfo?(frgrahl)
Looks good for SeaMonkey 2.50a1. I just reverted my backout of the original patch in m-c and with the patches for imap now in the tree I am unable to crash SeaMonkey using my non debug non-profiling mozconfig. When I encounter a new problem I will see that I can reproduce it.

Thanks
Flags: needinfo?(frgrahl)
So far I was unable to crash 2.50 with the patch installed. Could it be taken to TB 52? SeaMonkey may do a 2.49 ESR cycle too and this might add to the overall stability of the mailnews component.
Damn, I meant to uplift this, but my query missed it since there is no Target Milestone set here, grrrrrr.

Here's the landing with DONTBUILD since I've just done uplifts:

Aurora (TB 52):
https://hg.mozilla.org/releases/comm-aurora/rev/11ad0c351b82bf77ac789d966bdd6c0bf46a01fb
Target Milestone: --- → Thunderbird 53.0
Thanks. Just also saw that it already had c-a approval.
You need to log in before you can comment on or make changes to this bug.