Closed Bug 1299116 Opened 8 years ago Closed 8 years ago

nsMsgIncomingServer weak reference used on multiple threads

Categories

(MailNews Core :: Networking: IMAP, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 51.0

People

(Reporter: mayhemer, Assigned: rkent)

References

Details

Attachments

(2 files, 1 obsolete file)

Hit MOZ_CRASH(nsWeakReference not thread-safe) at /builds/slave/tb-c-cen-l64-d-000000000000000/build/mozilla/xpcom/glue/nsWeakReference.cpp:144
#01: nsQueryReferent::operator() [xpcom/glue/nsWeakReference.cpp:77]
#02: nsCOMPtr<nsIMsgIncomingServer>::assign_from_helper [xpcom/glue/nsCOMPtr.h:1157]
#03: nsCOMPtr<nsIMsgIncomingServer>::nsCOMPtr [xpcom/glue/nsCOMPtr.h:560]
#04: nsImapProtocol::GetPassword [/builds/slave/tb-c-cen-l64-d-000000000000000/build/mailnews/imap/src/nsImapProtocol.cpp:8238]
#05: nsImapProtocol::TryToLogon [/builds/slave/tb-c-cen-l64-d-000000000000000/build/mailnews/imap/src/nsImapProtocol.cpp:8421]
#06: nsImapProtocol::ProcessCurrentURL [/builds/slave/tb-c-cen-l64-d-000000000000000/build/mailnews/imap/src/nsImapProtocol.cpp:1734]
#07: nsImapProtocol::ImapThreadMainLoop [/builds/slave/tb-c-cen-l64-d-000000000000000/build/mailnews/imap/src/nsImapProtocol.cpp:1378]
#08: nsImapProtocol::Run [xpcom/glue/nsCOMPtr.h:741]
#09: nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:1058]
#10: NS_ProcessNextEvent [xpcom/glue/nsThreadUtils.cpp:290]
#11: mozilla::ipc::MessagePumpForNonMainThreads::Run [ipc/glue/MessagePump.cpp:339]
#12: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:233]
#13: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:490]
#14: nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:461]
#15: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
#16: libpthread.so.0 + 0x7e9a
#17: libc.so.6 + 0xf338d

From [1].  The only implementation of nsIMsgMailNewsUrl that supports weak reference is [2].

Possible solutions:
- work with weak reference solely on a single thread (i.e. all do_QueryReferent calls and release of the referred object happen on the same thread)
- use strong references instead ; if we know when to release it, otherwise the referred object leaks or in some rare cases may break logic based on objects' deletion being a trigger (I've seen such crazy constructions)
- use raw references instead ; weak-ref can be removed in cases when the referred object is ensured to live longer than the referrers (caution needed!)

Is there anyone who knows this code really well?


[1] https://dxr.mozilla.org/comm-central/rev/8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/imap/src/nsImapProtocol.cpp#681
[2] https://dxr.mozilla.org/comm-central/rev/8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/base/util/nsMsgIncomingServer.h#38

See also: bug 383490
Apparently, there is number of derivatives of this base class:
https://dxr.mozilla.org/comm-central/search?q=%2Bderived%3AnsMsgIncomingServer

Finding all of them and fix the usage may be really hard.
I also forgot to mention one more option:

- implement thread safe weak reference support ; something I have in mind a long time, but never tried to implement it, this should anyway be used only as a last resort option
(In reply to Honza Bambas (:mayhemer) from comment #0)

> Possible solutions:
> - work with weak reference solely on a single thread (i.e. all
> do_QueryReferent calls and release of the referred object happen on the same
> thread)
> - use strong references instead ; if we know when to release it, otherwise
> the referred object leaks or in some rare cases may break logic based on
> objects' deletion being a trigger (I've seen such crazy constructions)
> - use raw references instead ; weak-ref can be removed in cases when the
> referred object is ensured to live longer than the referrers (caution
> needed!)
> 
> Is there anyone who knows this code really well?
The original implementers (like David Bienvenu and Mark Banner) have all left the building. Kent and Joshua would know most about this, but I wouldn't say "really well". NI'ing Kent.

Can you please explain a little more for the laymen how the references [1] and [2] relate to the crash here:
https://dxr.mozilla.org/comm-central/rev/8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/imap/src/nsImapProtocol.cpp#8237
nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(m_server, &rv);

The problem is that 'nsWeakPtr   m_server;' (nsImapProtocol.h) can't be queried like this, right?

Oh, at
https://dxr.mozilla.org/comm-central/rev/8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/imap/src/nsImapProtocol.cpp#682 is another
nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(m_server);
and that can blow up, too? In total, there are six of them in nsImapProtocol.cpp.
Flags: needinfo?(rkent)
(In reply to Jorg K (GMT+2, PTO during summer) from comment #4)
> (In reply to Honza Bambas (:mayhemer) from comment #0)
> 
> > Possible solutions:
> > - work with weak reference solely on a single thread (i.e. all
> > do_QueryReferent calls and release of the referred object happen on the same
> > thread)
> > - use strong references instead ; if we know when to release it, otherwise
> > the referred object leaks or in some rare cases may break logic based on
> > objects' deletion being a trigger (I've seen such crazy constructions)
> > - use raw references instead ; weak-ref can be removed in cases when the
> > referred object is ensured to live longer than the referrers (caution
> > needed!)
> > 
> > Is there anyone who knows this code really well?
> The original implementers (like David Bienvenu and Mark Banner) have all
> left the building. Kent and Joshua would know most about this, but I
> wouldn't say "really well". NI'ing Kent.
> 
> Can you please explain a little more for the laymen how the references [1]
> and [2] relate to the crash here:

(Referring to comment 0) 
[1] is the place the test fails the assertion the first time - apparently, that do_QueryReferrent has been preceded by an earlier do_QueryReferrent made on a different thread.

[2] shows the class we are trying to query for, but later I found out that it's just a base class and there are number of derived classes.  Each of them can be the one we are trying to query at [1].

> https://dxr.mozilla.org/comm-central/rev/
> 8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/imap/src/nsImapProtocol.
> cpp#8237
> nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(m_server, &rv);
> 
> The problem is that 'nsWeakPtr   m_server;' (nsImapProtocol.h) can't be
> queried like this, right?

It's not that simple to say "it's wrong at this place", and I think I've already explained that, but for completeness once more:

There has been a simple rule added in bug 956338 - query for a weak reference + dereference of that weak reference + release of the referred object must all happen strictly only on a single thread.  The internal weak proxy object remembers the thread id on its first access and enforces any other access to happen on that very same thread since.

Hence, the whole architecture of referencing the server objects must be checked and probably overhauled with one of the suggested solutions, because that rule is apparently being broken here.  Question is what and how breaks it.  The task to fix this may not be trivial!

> 
> Oh, at
> https://dxr.mozilla.org/comm-central/rev/
> 8c7ddccf95c4168995fa6e57182133f22d21be03/mailnews/imap/src/nsImapProtocol.
> cpp#682 is another
> nsCOMPtr<nsIMsgIncomingServer> server = do_QueryReferent(m_server);
> and that can blow up, too? In total, there are six of them in
> nsImapProtocol.cpp.

There are even more:
https://dxr.mozilla.org/comm-central/search?q=regexp%3AnsCOMPtr%3CnsIMsgIncomingServer%3E+.*+%3D+do_QueryReferent&redirect=false

First of all, one would need to understand the lifetime of the server objects, and why we are using weak referencing for them at all - maybe there is no need?  According history, this has been added a long ago, so hard to say easily.  Someone has to dive into that code and find out.  If there is no one on the mailnews team, I can definitely help.
(In reply to Honza Bambas (:mayhemer) from comment #5)
> First of all, one would need to understand the lifetime of the server
> objects, and why we are using weak referencing for them at all - maybe there
> is no need?  According history, this has been added a long ago, so hard to
> say easily.  Someone has to dive into that code and find out.  If there is
> no one on the mailnews team, I can definitely help.
I'd say the only semi-available knowledgeable person would be Kent (since Joshua is down to very limited contributions). Kent is tied up in a million other things, so your help would be much appreciated. Perhaps we let Kent say something first.
Well its very very late in the Firefox 51 beta cycle. Personally I would prefer a backout of bug 956338 for now and putting it back in when 52 starts. This breaks both Thunderbird and Seamonkey and I am not sure it can be fixed in the next two weeks if its really that complicated.

The good thing is that after its fixed a lot of potential / real crashes will no longer happen.
The crashes are in debug builds only.
(In reply to Frank-Rainer Grahl from comment #7)
> Well its very very late in the Firefox 51 beta cycle. 

Bug 956338 has landed on m-c only (Nightly), and will not be uplifted.  I'm not sure why you refer to Beta here which is now 49.

The checks are also engaged only in nightly release builds and will not be engaged in aurora+ release builds (#ifdef NIGHTLY_BUILD.)  But will be engaged in all debug builds.

> Personally I would
> prefer a backout of bug 956338 for now and putting it back in when 52
> starts. This breaks both Thunderbird and Seamonkey and I am not sure it can
> be fixed in the next two weeks if its really that complicated.
> 
> The good thing is that after its fixed a lot of potential / real crashes
> will no longer happen.

I really don't want to back this out from m-c (affecting the browser), unless there is no other way for c-c to fix in a reasonable time frame.  Could that patch be also backed out only from m-c clone c-c is linking to?  I mean, a "local" patch.

I still want to wait for Kent that can make a call what/who should do here.

We could also use some existing c-c specific #define as a temp fix to disallow the checks (could land on m-c).
>> #ifdef NIGHTLY_BUILD.) 

Thanks. That explains why my non debug build fails. Then its not so bad. Can be backed out locally until fixed in Nightly.
The only multi-threaded use should be in the imap protocol. There is existing support there for a proxy to the server object for non-thread safe calls. I think that any uses of the weak pointer m_server could be adapted to the server proxy object.

I don't have that much experience with all of this, I'm just the last man standing.
Flags: needinfo?(rkent)
Are comments in bug 455966, bug 383489, bug 410747 comment 11 ...  helpful?

If not, just pose the question(s) (If they are different from comment 0) and I will PM bienvenu.
Attached patch Add server methods to serverSink (obsolete) — Splinter Review
Here is what needs to be done. This mostly works, but unfortunately I am still crashing because I am hitting this in ProcessCurrentURL():

  if (!m_imapMailFolderSink)
    SetupSinkProxy(); // try this again. Evil, but I'm desperate.

which calls methods, on the Imap thread, which is intended as main thread  only. So to make progress, we will need to figure out the source of the Evil mentioned here.
This doesn't seem to crash.

The previously failing point:

  if (!m_imapMailFolderSink)
    SetupSinkProxy(); // try this again. Evil, but I'm desperate.

seems to occur when the connection is reused to process a new URL, that is when

m_imapServerSink->LoadNextQueuedUrl(this, &anotherUrlRun);

returns a true anotherUrlRun. That is a normal situation. I can make that happen reliably by rapidly selecting successive folders in the folder pane, which queues multiple URLs on different folders to open those folders.

Somehow the thread management in this routine seems like one big kludge, but I followed the existing design and extended it to support main thread only weak references.
Assignee: nobody → rkent
Attachment #8786543 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8786632 - Flags: review?(jorgk)
What motivated you to fix this so quickly? ;-)

Try run here:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=4df3a9d3afc2
How silly of me, I submitted and "opt" try when this only fails in debug mode. Grr. Here again:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=902b0f0d42c8
Comment on attachment 8786632 [details] [diff] [review]
rev 2, run proxy setup on main thread

Review of attachment 8786632 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me and works. Thanks!
Attachment #8786632 - Flags: review?(jorgk) → review+
https://hg.mozilla.org/comm-central/rev/e5fd1b18853f
I assume this was ready for landing ;-)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 51.0
Blocks: 1299076
"What motivated you to fix this so quickly? ;-)"

I'm taking a few days break from my C++ -> JS conversion in prep for the Brno trip. It seemed like I was the best candidate to fix it, so it was either now or in two weeks.
(In reply to Kent James (:rkent) from comment #19)
> "What motivated you to fix this so quickly? ;-)"
> 
> I'm taking a few days break from my C++ -> JS conversion in prep for the
> Brno trip. It seemed like I was the best candidate to fix it, so it was
> either now or in two weeks.

Thanks, great work!
Is there any reason to keep this bug hidden?
I don't know why it was a security issue in the first place.
(In reply to Jorg K (GMT+2, PTO during summer) from comment #22)
> I don't know why it was a security issue in the first place.

Because you never know to what this turns out to be?  Could well show being a critical vulnerability.  If you believe it's safe to open this bug, do so.
(In reply to Honza Bambas (:mayhemer) from comment #23)
> If you believe it's safe to open this bug, do so.

(In other words - if you believe this cannot be exploited.)
Group: mail-core-security
Not sure if its 100% fixed. The crash data is unfortunately imho more or less useless because no symbols available:

https://crash-stats.mozilla.com/report/index/309fdb88-e1fd-431c-90fb-a4af72160901

Conditions were like this prior to chrashing:
https://bugzilla.mozilla.org/show_bug.cgi?id=406929

I was asked for the password for sending Sending succeeded and then a message box popped up that the message couldn't be moved. Seamonkey 2.48a1 then crashed. If I can reproduce it I will do a debug build and see if I find the line where it occurs.

But the message was moved to the sent folder.
Hmm, there are a few calls to do_QueryReferent() left:
https://dxr.mozilla.org/comm-central/search?q=regexp%3AnsCOMPtr%3CnsIMsgIncomingServer%3E+.*+%3D+do_QueryReferent&redirect=false

What about those, Kent? In the context of asking for a password (see previous comment), the one in
nsImapProtocol::OnPromptStart() looks suspicious.
Flags: needinfo?(rkent)
I got a second crash by just opening mail & newsgroups with the network in the vm cut so it might be related to the error prompt. Will do a clobber build first to be sure but this takes some time on the laptop.
Calls to do_QueryReferent are OK when called from the UI thread, in fact I did not remove the weak reference so you should expect it is still used in some cases.

To justify a removal, you need to show a path where the method can be called from the IMAP thread.

Folder is known not-thread-safe, so call there should be OK.

The IMAP protocol calls mostly appear in methods that are flagged as UI-thread only. Concerning OnPromptStart, you would need to show a path to that from the IMAP thread. If so, it would need fixing by adding a proxy call to the main thread like the other cases.
Flags: needinfo?(rkent)
OK, this crashes:
IMAP account with stored SMTP password, no IMAP password stored. Send a message. Message gets sent, then it crashes. I'll attach a stack dump in a minute.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached file 1299116-crash.txt
Well, I wanted to attach the crash "in a minute", but it crashed on one message and then on the 31st (!!). Anyway, I was persistent and here is the crash.
Kent, this is a different problem, right?

Frank-Rainer, have you been able to capture any crashes?
Flags: needinfo?(rkent)
Well it is a related problem. We need to make sure that all weak references in the protocol release on the main thread.

See nsLDAPConnectionRunnable::~nsLDAPConnectionRunnable() for en example of how this is done. Care to give it a try?
Flags: needinfo?(rkent)
(In reply to Kent James (:rkent) from comment #32)
> See nsLDAPConnectionRunnable::~nsLDAPConnectionRunnable() for en example of
> how this is done. Care to give it a try?
Not really. I have other bugs I want to look into where my time is more effectively spent. I'd rather watch the master at work here, the last man standing ;-)
Let's continue this in bug 1299920. Honza predicted a few problems ;-)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Why isn't this one exploitable, while the m-c equivalents of these fixes are considered potentially risky?
Flags: needinfo?(vseerror)
>> Frank-Rainer, have you been able to capture any crashes?

Lappi takes ages to compile suite so no. I am back at the desk later today and do a debug build. If something still needed I will post it in bug 1299920.
(In reply to aleth [:aleth] from comment #35)
> Why isn't this one exploitable, while the m-c equivalents of these fixes are
> considered potentially risky?

I don't know, but it does not have a sec-xxx designation and two key devs and questioned that it's exploitable. So it seems the only thing favoring explioitability is the refernce to m-c.
Flags: needinfo?(vseerror)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: