Closed Bug 902158 Opened 11 years ago Closed 10 years ago

crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a

Categories

(MailNews Core :: Networking, defect)

All
Windows 7
defect
Not set
critical

Tracking

(firefox23 affected, firefox24+ affected, firefox25 affected, thunderbird34 fixed, thunderbird35 fixed, thunderbird_esr3134+ fixed)

VERIFIED FIXED
Thunderbird 36.0
Tracking Status
firefox23 --- affected
firefox24 + affected
firefox25 --- affected
thunderbird34 --- fixed
thunderbird35 --- fixed
thunderbird_esr31 34+ fixed

People

(Reporter: scoobidiver, Assigned: rkent)

References

Details

(4 keywords, Whiteboard: [tbird betatopcrash][[tbird topcrash])

Crash Data

Attachments

(3 files, 3 obsolete files)

It's #2 top crasher in TB 23.0b1.

There are mainly two kinds of stack traces:
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	moz_abort 	memory/build/jemalloc_config.c
1 	mozglue.dll 	arena_run_reg_dalloc 	memory/mozjemalloc/jemalloc.c
2 	mozglue.dll 	arena_dalloc_small 	memory/mozjemalloc/jemalloc.c
3 	mozglue.dll 	arena_dalloc 	memory/mozjemalloc/jemalloc.c
4 	mozglue.dll 	je_free 	memory/mozjemalloc/jemalloc.c
5 	xul.dll 	nsInterfaceRequestorAgg::Release() 	xpcom/base/nsInterfaceRequestorAgg.cpp
6 	xul.dll 	nsCOMPtr_base::assign_with_AddRef(nsISupports *) 	objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp
7 	xul.dll 	nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) 	security/manager/ssl/src/nsNSSIOLayer.cpp
8 	xul.dll 	nsSocketTransport::OnSocketDetached(PRFileDesc *) 	netwerk/base/src/nsSocketTransport2.cpp
9 	xul.dll 	nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext *,nsSocketTransportService::SocketContext *) 	netwerk/base/src/nsSocketTransportService2.cpp
10 	xul.dll 	nsSocketTransportService::DoPollIteration(bool) 	netwerk/base/src/nsSocketTransportService2.cpp
11 	xul.dll 	nsSocketTransportService::Run() 	netwerk/base/src/nsSocketTransportService2.cpp
12 	xul.dll 	nsThread::ProcessNextEvent(bool,bool *) 	xpcom/threads/nsThread.cpp
13 	xul.dll 	NS_ProcessNextEvent(nsIThread *,bool) 	objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp
14 	xul.dll 	nsThread::ThreadFunc(void *) 	xpcom/threads/nsThread.cpp
15 	nspr4.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
16 	nspr4.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
17 	msvcr100.dll 	_callthreadstartex 	f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c
18 	msvcr100.dll 	_threadstartex 	f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c
19 	kernel32.dll 	BaseThreadInitThunk 	
20 	ntdll.dll 	__RtlUserThreadStart 	
21 	ntdll.dll 	_RtlUserThreadStart

Frame 	Module 	Signature 	Source
0 	mozglue.dll 	moz_abort 	memory/mozjemalloc/jemalloc.c
1 	mozglue.dll 	arena_run_reg_dalloc 	memory/mozjemalloc/jemalloc.c
2 	mozglue.dll 	arena_dalloc_small 	memory/mozjemalloc/jemalloc.c
3 	mozglue.dll 	arena_dalloc 	memory/mozjemalloc/jemalloc.c
4 	mozglue.dll 	je_free 	memory/mozjemalloc/jemalloc.c
5 	xul.dll 	nsInterfaceRequestorAgg::Release() 	xpcom/base/nsInterfaceRequestorAgg.cpp
6 	xul.dll 	nsCOMPtr_base::assign_with_AddRef(nsISupports *) 	objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp
7 	xul.dll 	nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) 	security/manager/ssl/src/nsNSSIOLayer.cpp
8 	xul.dll 	nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor *) 	netwerk/base/src/nsSocketTransport2.cpp
9 	xul.dll 	nsMsgProtocol::CloseSocket() 	mailnews/base/util/nsMsgProtocol.cpp
10 	xul.dll 	nsPop3Protocol::ProcessProtocolState(nsIURI *,nsIInputStream *,unsigned __int64,unsigned int) 	mailnews/local/src/nsPop3Protocol.cpp
11 	xul.dll 	nsMsgProtocol::OnDataAvailable(nsIRequest *,nsISupports *,nsIInputStream *,unsigned __int64,unsigned int) 	mailnews/base/util/nsMsgProtocol.cpp
12 	xul.dll 	nsInputStreamPump::OnStateTransfer() 	netwerk/base/src/nsInputStreamPump.cpp
13 	xul.dll 	nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) 	netwerk/base/src/nsInputStreamPump.cpp
14 	xul.dll 	nsInputStreamReadyEvent::Run() 	xpcom/io/nsStreamUtils.cpp
15 	xul.dll 	nsThread::ProcessNextEvent(bool,bool *) 	xpcom/threads/nsThread.cpp
16 	xul.dll 	NS_ProcessNextEvent(nsIThread *,bool) 	objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp
17 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) 	ipc/glue/MessagePump.cpp
18 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
19 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
20 	xul.dll 	nsBaseAppShell::Run() 	widget/xpwidgets/nsBaseAppShell.cpp
21 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp
22 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp
23 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp
24 	xul.dll 	XREMain::XRE_main(int,char * * const,nsXREAppData const *) 	toolkit/xre/nsAppRunner.cpp
25 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp
26 	thunderbird.exe 	do_main 	mail/app/nsMailApp.cpp
27 	thunderbird.exe 	NS_internal_main(int,char * *) 	mail/app/nsMailApp.cpp
28 	thunderbird.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp
29 	thunderbird.exe 	__tmainCRTStartup 	f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c
30 	kernel32.dll 	BaseThreadInitThunk 	
31 	ntdll.dll 	__RtlUserThreadStart 	
32 	ntdll.dll 	_RtlUserThreadStart

More reports at:
https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=moz_abort+|+arena_run_reg_dalloc+|+arena_dalloc_small+|+arena_dalloc+|+je_free+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29
https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=jemalloc_crash+|+arena_dalloc+|+arena_run_reg_dalloc+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29
Is this a regression from a Firefox landing ?
(In reply to bhavana bajaj [:bajaj] (on vacation until 08/14/2013) from comment #1)
> Is this a regression from a Firefox landing ?
TB 17.0 is unaffected so it's regression. In addition, the first stack trace has no TB frame so it might be a Networking regression.

The second stack trace (with TB frames) first showed up in 19.0a1/20121119. The regression range might be (this kind of signature can morph easily and no crash stats are available from the advanced search in 2012):
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=87703fb491e0&tochange=be3a0b4edebd
http://hg.mozilla.org/comm-central/pushloghtml?fromchange=1824e8daac72&tochange=7a8e57d2fbc2
It might be a regression from bug 812203.
Version: Trunk → 19 Branch
Whiteboard: [tbird topcrash] → [tbird betatopcrash]
Tracking and needinfo'ing mark to have this on his radar.
Flags: needinfo?(mbanner)
Ok, thanks.
Flags: needinfo?(mbanner)
stack is along is similar to bug 692981 / bug 673438
user geo frequently crashes using current daily 
example bp-4b298a1d-64bf-4841-a35f-159ba2130811
Thanks!
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
Keywords: topcrash-win
Odd...why does this signature not appear in TB24 release? [1]
sig appears for alpha and beta 19-25. #1 crash for TB25.0b1, just as it was topcrash for prior betas

Also, is this really an NSS crash?


Matt's crash with TB27.0a1 bp-bf69d93f-0c77-43b8-86e7-aa4dd2131025

perhaps an odd manifestation of bug 673438?  [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState]

[1] <12 crashes in 1 week of 24.x releases for any sig containing nsInterfaceRequestorAgg::Release - four are https://crash-stats.mozilla.com/report/list?signature=nsInterfaceRequestorAgg%3A%3ARelease%28%29&product=Thunderbird&query_type=contains&range_unit=weeks&process_type=any&version=Thunderbird%3A24.1.0&version=Thunderbird%3A24.0.1&version=Thunderbird%3A24.0&hang_type=any&date=2013-10-31+15%3A00%3A00&range_value=1#reports
Flags: needinfo?(brian)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][dev-only]
Not really sure what I can do here without a regression  range or STR.
Flags: needinfo?(brian)
NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) is perhaps related https://crash-stats.mozilla.com/report/list?signature=NS_ProxyRelease%28nsIEventTarget*%2C+nsISupports*%2C+bool%29&product=Thunderbird&query_type=is_exactly&range_unit=weeks&process_type=any&hang_type=any&date=2014-04-22+17%3A00%3A00&range_value=4#tab-sigsummary

TB31.0a1 bp-870f8b8c-b384-4217-9dc6-f6a332140419
Crash Signature: [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()] [@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ] → [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()] [@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ] [@ NS_ProxyRelease(nsIEventTarget*, ns…
See Also: → 673438
See Also: → 701533
#1 crash for TB31, now that version 31 is out for a month.
22% of all versoin 31 crashes
50% are startup

crashes with comment and emai address - https://crash-stats.mozilla.com/search/?signature=~NS_ProxyRelease%28nsIEventTarget*%2C+nsISupports*%2C+bool%29&email=~%40&email=!~.de&email=!~artistress%40gmail.com&email=!~.pl&user_comments=~a&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Summary: crash in nsInterfaceRequestorAgg::Release → crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a
Whiteboard: [tbird betatopcrash][dev-only] → [tbird betatopcrash]
Should we take this back to pop component?

at 25% of crashes I don't think we can update users to version 31.x automatically until this gets resolved - perhaps it deserves a different bug since behavior/crash rate is different in version 31 compared to version 24?

v31 crashes I sampled are all via nsPop3Protocol. 

For example bp-fc740d5a-8d45-4743-add4-0df322140828
 0 	xul.dll	NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool)	xpcom/glue/nsProxyRelease.cpp
1 	xul.dll	nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg()	xpcom/base/nsInterfaceRequestorAgg.cpp
2 	xul.dll	nsInterfaceRequestorAgg::Release()	xpcom/base/nsInterfaceRequestorAgg.cpp
3 	xul.dll	nsCOMPtr_base::assign_with_AddRef(nsISupports*)	xpcom/glue/nsCOMPtr.cpp
4 	xul.dll	nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor*)	security/manager/ssl/src/nsNSSIOLayer.cpp
5 	xul.dll	nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor*)	netwerk/base/src/nsSocketTransport2.cpp
6 	xul.dll	nsMsgProtocol::CloseSocket()	mailnews/base/util/nsMsgProtocol.cpp
7 	xul.dll	nsPop3Protocol::ProcessProtocolState(nsIURI*, nsIInputStream*, unsigned __int64, unsigned int)	mailnews/local/src/nsPop3Protocol.cpp
8 	xul.dll	nsMsgProtocol::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned __int64, unsigned int)	mailnews/base/util/nsMsgProtocol.cpp
Flags: needinfo?(ludovic)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][[tbird topcrash]
(In reply to Wayne Mery (:wsmwk) from comment #13)
> Should we take this back to pop component?
> 
> at 25% of crashes I don't think we can update users to version 31.x
> automatically until this gets resolved - perhaps it deserves a different bug
> since behavior/crash rate is different in version 31 compared to version 24?
> 

It' still when we ask infor to neck that we crash. What would be nice would be to find which server makes us crash.
Flags: needinfo?(ludovic)
For some users seeing NS_ProxyRelease in version 31 seems to be a continuation of version 24 OOM | Small aka bug 1028720.  (And yet some OOM | Small users have completely stopped crashing in version 31)


(In reply to Ludovic Hirlimann [:Usul] from comment #14)
> (In reply to Wayne Mery (:wsmwk) from comment #13)
> > Should we take this back to pop component?
> > 
> > at 25% of crashes I don't think we can update users to version 31.x
> > automatically until this gets resolved - perhaps it deserves a different bug
> > since behavior/crash rate is different in version 31 compared to version 24?
> > 
> 
> It' still when we ask infor to neck that we crash. 

what?

> What would be nice would be to find which server makes us crash.

I'm not so sure there is a "server testcase" - are you thinking there are specific servers?  Although it's all pop, so far according to users it seems not to be predictable.
Blocks: 1028720
Flags: needinfo?(ludovic)
Product: Core → MailNews Core
Version: 19 Branch → 31
in 31.1.1 NS_ProxyRelease is #1 at 40% of crashes

Unclear if user n8kpl is a representative example, but he/she started crashing in version 31.1.1
bp-d5ddc241-d5b5-4262-81b4-19e782140915
bp-36d323d9-d035-41f9-9254-04c4f2140915
bp-6a069cac-0692-41a0-bbf4-001ff2140915
bp-70f3ff00-1882-4b73-b8ee-626b22140915
bp-e54d16d9-33f4-4b6f-a81a-681e72140916 

some of the users with
 NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828

are also crashing in version 31 with 
 nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828

Do we need a new bug report for what's happening/regressed in 31.1.1?
Flags: needinfo?(irving)
(In reply to Wayne Mery (:wsmwk) from comment #16)
> some of the users with
>  NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg
> bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828
> 
> are also crashing in version 31 with 
>  nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828
> 
> Do we need a new bug report for what's happening/regressed in 31.1.1?

I don't think so. Both of the crash addresses are an address of already freed region in jemalloc. So I am guessing the reason of those crashes are the same. I have no idea how did it happen though.
The object that is being release prematurely is the nsInterfaceRequestorAgg object being created in nsPop3Protocol here as "ir":

          nsCOMPtr<nsIInterfaceRequestor> aggregrateIR;
          MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR));
          ir = aggregrateIR;

That ir is passed to the network socket, and its local reference discarded, here:

    rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir);
    (which calls  strans->SetSecurityCallbacks(callbacks); with callbacks = ir)

We are now relying on the socket transport to hold onto the reference. Something seems to be going wrong with that release and destructor, so that the members of nsInterfaceRequestorAgg point to released memory. I can't figure out what, but one possible approach would be to add an additional reference to that object in the nsPOP3Protocol object.

I'll show a patch that shows this.
Attached patch 902158.patchSplinter Review
What do you think, Irving? Since we don't really understand the root cause this may just move the error, but it also is pretty low risk.
Attachment #8498466 - Flags: review?(irving)
I think rkent is right. I could imagine a possible crash scenario.

507     nsCOMPtr<nsIInterfaceRequestor> ir;
508     if (m_socketType != nsMsgSocketType::plain)
509     {
510       nsCOMPtr<nsIMsgWindow> msgwin;
511       mailnewsUrl->GetMsgWindow(getter_AddRefs(msgwin));
512       if (!msgwin)
513         GetTopmostMsgWindow(getter_AddRefs(msgwin));
514       if (msgwin)
515       {
516         nsCOMPtr<nsIDocShell> docshell;
517         msgwin->GetRootDocShell(getter_AddRefs(docshell));
518         ir = do_QueryInterface(docshell);
519         nsCOMPtr<nsIInterfaceRequestor> notificationCallbacks;
520         msgwin->GetNotificationCallbacks(getter_AddRefs(notificationCallbacks));
521         if (notificationCallbacks)
522         {
523           nsCOMPtr<nsIInterfaceRequestor> aggregrateIR;
524           MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR));

ir is now held by aggregrateIR,
 
525           ir = aggregrateIR;

ir is dereferenced. Now ir is held only by aggregrateIR as mSecond in nsInterfaceRequestorAgg.

526         }
527       }
528     }
529 
530     int32_t port = 0;
531     nsCString hostName;
532     aURL->GetPort(&port);
533     nsCOMPtr<nsIMsgIncomingServer> server = do_QueryInterface(m_pop3Server);
534     if (server)
535       server->GetRealHostName(hostName);
536 
537     nsCOMPtr<nsIProxyInfo> proxyInfo;
538     rv = MsgExamineForProxy("pop", hostName.get(), port, getter_AddRefs(proxyInfo));
539     if (NS_FAILED(rv)) proxyInfo = nullptr;
540 
541     const char *connectionType = nullptr;
542     if (m_socketType == nsMsgSocketType::SSL)
543       connectionType = "ssl";
544     else if (m_socketType == nsMsgSocketType::trySTARTTLS ||
545           m_socketType == nsMsgSocketType::alwaysSTARTTLS)
546         connectionType = "starttls";
547 
548     rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir);
549     if (NS_FAILED(rv) && m_socketType == nsMsgSocketType::trySTARTTLS)
550     {
551       m_socketType = nsMsgSocketType::plain;
552       rv = OpenNetworkSocketWithInfo(hostName.get(), port, nullptr, proxyInfo, ir);
553     }
554 
555     if(NS_FAILED(rv))
556       return rv;

If this error occurs here, aggregrateIR is dereferenced. Then, 

55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg()
56 {
57   nsIInterfaceRequestor* iir = nullptr;
58   mFirst.swap(iir);
59   if (iir) {
60     NS_ProxyRelease(mConsumerTarget, iir);
61   }
62   iir = nullptr;
63   mSecond.swap(iir);

iir does not hold the reference here because iir is just a pointer.
Who is in charge of the reference of the original ir?
So I am guessing the address of iir is now 0x5a5a5a5a.


But I can't understand  why the crash line is nsProxyRelease.cpp#l45?

45        rv = target->IsOnCurrentThread(&onCurrentThread);

target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced...
Hiro, thanks for the detailed analysis.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #20)
> I think rkent is right. I could imagine a possible crash scenario.
> ...
> If this error occurs here, aggregrateIR is dereferenced. Then, 
> 
> 55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg()
> 56 {
> 57   nsIInterfaceRequestor* iir = nullptr;
> 58   mFirst.swap(iir);
> 59   if (iir) {
> 60     NS_ProxyRelease(mConsumerTarget, iir);
> 61   }
> 62   iir = nullptr;
> 63   mSecond.swap(iir);
> 
> iir does not hold the reference here because iir is just a pointer.
> Who is in charge of the reference of the original ir?
> So I am guessing the address of iir is now 0x5a5a5a5a.

nsCOMPtr<>.swap() doesn't decrement the reference count of the object (see http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCOMPtr.h#775), so 'iir' should point to an object that still has refcount == 1; NS_ProxyRelease decrements the refcount.

> But I can't understand  why the crash line is nsProxyRelease.cpp#l45?
> 
> 45        rv = target->IsOnCurrentThread(&onCurrentThread);
> 
> target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced...

The crash signature makes it look like the whole nsInterfaceRequestorAgg object is being freed and nulled out; I thin the the pointer stored in mConsumerTarget is 0x5a..., not the memory pointed to by mConsumerTarget. That makes it look like 'this' (the nsInterfaceRequestorAgg) is being freed and overwritten with 0xA5.. between line 58 and line 65 of ~nsInterfaceRequestorAgg(). That suggests to me that we're double-freeing the nsInterfaceRequestorAgg either because of a thread race, or recursively when we free mFirst.

The constructor is public, which makes me nervous that somebody could be creating an nsInterfaceRequestorAgg on the stack unless there's some other mechanism preventing that. However, a DXR search doesn't find any suspicious uses.
Assignee: nobody → kent
Flags: needinfo?(irving)
OK, now I hate the MOZILLA_INTERNAL_API split even more.

Here's another possibility: when we create these interface aggregators in mailnews, we call msgNewInterfaceAggregation(), which is defined in nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API:

true at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#337

false at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#454

The external API version of the factory calls through http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1992 to http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1967, which constructs a version of the object that's not compatible with the one from xpcom/base/nsInterfaceRequestorAgg.cpp. The 0x5a5a5a5a we're reading for mConsumerTarget is because that member is after then end of the memory allocated for the MsgInterfaceRequestorAgg object in nsMsgUtils.cpp.

My *strong* preference is to burn the external API version of this definition all the way to the ground, and always use the definition in xpcom/base/nsInterfaceRequestorAgg.cpp
Flags: needinfo?(josh)
(In reply to Irving Reid from comment #22)
> OK, now I hate the MOZILLA_INTERNAL_API split even more.
Well, it wouldn't have been so bad if more of the internal symbols had been exposed, but there you go, it got forked at the time because it was the only way we could build against the libxul SDK.

> Here's another possibility: when we create these interface aggregators in
> mailnews, we call msgNewInterfaceAggregation(), which is defined in
> nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API:
Indeed, but all Mozilla releases are compiled with MOZILLA_INTERNAL_API, and building without it fails on trunk (although it's possible that some Linux distros shipped xulrunner versions at the time that it did build).

> The 0x5a5a5a5a we're reading for mConsumerTarget
And of course mConsumerTarget was added when bug 804655/bug 812203 changed the behaviour of nsInterfaceRequestorAgg, as mentioned in comment #2.
Blocks: 804655
Flags: needinfo?(ludovic)
The MOZILLA_INTERNAL_API distinction is still necessary because things are different if you're libxul and if you're not (if you're for example, a XPCOM binary component or an executable loading libxul with the xpcom standalone glue).
I was so proud of myself for figuring out Comment 22, and then I went off and thought about it more and convinced myself I was wrong.

1) If the mConsumerTarget field was 0x5a... because the object had been aliased to the wrong type and the code was accessing out of bounds, the *first* NS_ProxyRelease in line 60 would fail, but we're seeing the *second* call fail.

2) As far as I can tell, we're calling everything through virtual interfaces, so we wouldn't end up calling the nsInterfaceRequestorAgg destructor on an *msg*InterfaceRequestorAgg anyway.

Now I'm coming back around to what I said in comment 21 - it seems likely to me that either the NS_ProxyRelease of mFirst is somehow recursing around and causing a double free of the nsInterfaceRequestorAgg itself, or a thread race is doing it - except that unless code somewhere in that path is deliberately manipulating the reference count outside the standard implementation macros, I can't see anything that would trigger the extra free.

That said, the mailnews version of MsgInterfaceRequestorAgg could still cause us unrelated thread safety problems, since it doesn't meet xpcom's assumptions about the interface requestor object taking care of making sure things are destroyed on the correct thread.
I was racking my brains to try and figure out why it could be possible for a refcounted object to get destroyed twice, and I think I have a plausible story.

In the refcount 'Release' definition http://dxr.mozilla.org/comm-central/source/mozilla/xpcom/glue/nsISupportsImpl.h#621 (the definition of NS_IMPL_RELEASE_WITH_DESTROY), after we've decided that an object's refcount has gone to 0, we change it back to 1 to protect us against double-frees in the case that some part of the destructor briefly grabs a (counted) reference. Unfortunately, that makes us vulnerable to a double-free if someone has an extra reference to that object and only *decrements* the reference count. I don't know of a way for that to happen without there being a mistake in reference counting somewhere in the code (otherwise the first release wouldn't have changed the count to 0).

I haven't traced through code to see when and where that might happen, though based on comment 25 I'd guess the offending reference is held by whatever is in the mFirst field of the nsInterfaceRequestorAgg. At this point I really need to work on some other things; I hope someone can move forward from here...
Attached patch Possible fixSplinter Review
I hope I am getting closer to the cause of the double free.
One is of course secCtrl->SetNotificationCallbacks(threadsafeCallbacks); in nsSocketTransport::SetSecurityCallbacks from nsMsgProtocol::CloseSocket().

The other is secCtrl->SetNotificationCallbacks(nullptr) at
http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890 from nsSocketTransportService::DoPollIteration().

I don't check all crash reports but I am guessing all of reports have DoPollIteration symbols.

A typical report is:
https://crash-stats.mozilla.com/report/index/878f5c39-14fb-4cdb-991d-7d7652140929
The stack trace of thread 3 in this reports shows us exactly the other double free position.
Attachment #8500253 - Flags: feedback?(irving)
Attachment #8500253 - Flags: feedback?(kent)
Comment on attachment 8500253 [details] [diff] [review]
Possible fix

Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for finding the crash report evidence of the release happening on the other thread, Hiro, that's a great catch. Normally a race condition like that would happen really rarely, but the activity on the main thread is triggered by the onSocketReady calls just a few lines above where the I/O thread tries to clear the callback (http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1882 and then http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890) so the two threads always try to null out the security callbacks at around the same time.

Looking at how the security callbacks are handled elsewhere in mozilla-central, nsHTTPConnection only calls nsSocketTransport.SetSecurityCallbacks on the I/O thread, see http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#144 and look around that file for socketTransport->SetSecurityCallbacks. It also only seems to do so when it's doing particularly tricky things.

The MDN documentation for nsISocketTransport says that the 'securityCallbacks' attribute should not be changed after the socket is opened: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsISocketTransport#Attributes. I'd like to get feedback from an NSS / Networking expert, but I wonder if the fix is to *not* clear the security callbacks in our destructor?
Attachment #8500253 - Flags: feedback?(irving) → feedback+
Looks like my presence isn't required here.
Flags: needinfo?(josh)
Comment on attachment 8500253 [details] [diff] [review]
Possible fix

briansmith, can you please give us some feedback or suggest a proper person to step forward to fix this bug.
Please see comment #27 and comment #28.
Attachment #8500253 - Flags: feedback?(brian)
Comment on attachment 8500253 [details] [diff] [review]
Possible fix

Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to let Richard handle this.

See also https://bugzilla.mozilla.org/buglist.cgi?quicksearch=nsNSSSocketInfo.

Usually if you need a lock in a setter, you also need the same lock in the corresponding getter.
Attachment #8500253 - Flags: feedback?(brian) → feedback?(rlb)
Would we be able to fix this bug by requiring that MsgNewInterfaceRequestorAggregation proxy releases through the main thread? Is that what nsMainThreadPtrHandle is trying to do?
Comment on attachment 8500253 [details] [diff] [review]
Possible fix

Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------

My threading foo is not really strong enough to comment on the approach of locking versus other methods of insuring that the release does not occur on both threads. I don't believe though that we should simply " if the fix is to *not* clear the security callbacks in our destructor?" as other examples, such as nsHTTPConnection, also clear the security callbacks (at least on error).
Attachment #8500253 - Flags: feedback?(kent)
(In reply to Kent James (:rkent) from comment #32)
> Would we be able to fix this bug by requiring that
> MsgNewInterfaceRequestorAggregation proxy releases through the main thread?
> Is that what nsMainThreadPtrHandle is trying to do?

Yes, that is the purpose of nsMainThreadPtrHandle (as well as optionally preventing any dereferencing of the pointer on non-main threads).
But we can't fix the crash on NSS side. Am I right?

See for example:
https://crash-stats.mozilla.com/report/index/b10f8b61-ce27-4af2-8397-9b2c52140929

You mean nsMainThreadPtrHandle should be used on both threads?
To be clear, I have no idea if nsMainThreadPtrHandle is a good solution here. I was answering the second question in comment 34.
Thanks briansmith and jdm. I did not know that nsNSSSocketInfo is not thread-safe actually.

For the record, I am putting bug 427948 into dependency list.

Anyway I am waiting for feedbacks from Richard.
Depends on: 427948
Blocks: 783998
A little history on the nulling of securityCallbacks, as a comment to "but I wonder if the fix is to *not* clear the security callbacks in our destructor?" in comment 28.

The nulling of the security callbacks in nsMsgProtocol were set in bug 176919 in 2003, but that was replacing and even earlier call that nulled the notification callbacks directly. So this is ancient, original code.

Meanwhile, in 2005 bug 285991 adds the nulling of securityCallbacks in nsSocketTransport::OnSocketDetached because of an nsHTTPS leak.

So one possible interpretation is that the nulling of the securityCallbacks in the msg destructor was breaking the cycle, that subsequently was broken at a different place, hence negating then need for that nulling as Irving suggests.
This is another possible fix what we can do only in comm-central without any modifications of mozilla-central.

Luckily nsSocketTransport::SetSecurityCallbacks uses NS_GetCurrentThread() instead of NS_GetMainThread() so we can call the method off the main thread.

Josh, what do you think about this?
Attachment #8502913 - Flags: feedback?(josh)
The previous patch contains a horrible mistake passed always nullptr to SetSecurityCallbacks().
Attachment #8502913 - Attachment is obsolete: true
Attachment #8502913 - Flags: feedback?(josh)
Attachment #8502932 - Flags: feedback?(josh)
Comment on attachment 8502932 [details] [diff] [review]
SetSecurityCallbacks on socket thread v2

Clearing feedback? flag.

The patch causes an issue that exception dialog is not open.
Attachment #8502932 - Flags: feedback?(josh)
Attached patch Use nsMainThreadPtrHolder (obsolete) — Splinter Review
I did totally mistake. We can protect us from the crash if we use nsMainThreadPtrHolder in comm-central because the callbacks, which is passing to SetSecurityCallbacks, is only created in our side.
Attachment #8502932 - Attachment is obsolete: true
Attachment #8502968 - Flags: review?(kent)
Next week Hiro, rkent, and Irving will all be in Toronto together. At that point we will review this bug and decide the best course of action. Currently I am leaning to simply removing the offending call to SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call that was added later in necko code.
(In reply to Kent James (:rkent) from comment #43)
> Next week Hiro, rkent, and Irving will all be in Toronto together. At that
> point we will review this bug and decide the best course of action.
> Currently I am leaning to simply removing the offending call to
> SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call
> that was added later in necko code.

Perhaps the best approach is create a try build without it, and we can look for a memory leak, (If I understand the background I am reading here). I do not see the crash,  but a memory leak should become apparent here in a couple of days if there were one.
(In reply to Matt from comment #44)
> (In reply to Kent James (:rkent) from comment #43)
> > Next week Hiro, rkent, and Irving will all be in Toronto together. At that
> > point we will review this bug and decide the best course of action.
> > Currently I am leaning to simply removing the offending call to
> > SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call
> > that was added later in necko code.
> 
> Perhaps the best approach is create a try build without it, and we can look
> for a memory leak, (If I understand the background I am reading here). I do
> not see the crash,  but a memory leak should become apparent here in a
> couple of days if there were one.

I don't think we can check all possible cases of nulling out the security callbacks because we can not reproduce the crash case yet.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #45)
> I don't think we can check all possible cases of nulling out the security
> callbacks because we can not reproduce the crash case yet.

Here is another user with plenty of these crashes.
https://support.mozilla.org/en-US/questions/1024501
Example: bp-953500eb-82d1-4c9d-b03f-293ae2141010
Comment on attachment 8498466 [details] [diff] [review]
902158.patch

Review of attachment 8498466 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like we've settled on a different approach.
Attachment #8498466 - Flags: review?(irving)
Comment on attachment 8500253 [details] [diff] [review]
Possible fix

Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------

It's good general practice to specify what resource the lock is controlling access to.  So a more descriptive name than "mLock" would be helpful.  (mCallbackLock?)  

Then you need to acquire the lock before you do anything with the protected resource.  It seems like the resource in this case is mCallbacks, so you should lock at least in GetNotificationCallbacks.
Attachment #8500253 - Flags: feedback?(rlb)
The fix we agreed on at the TB summit. Not well tested because of the broken tree.
Attachment #8507116 - Flags: review?(kent)
Attachment #8507116 - Flags: review?(hiikezoe)
Attachment #8502968 - Attachment is obsolete: true
Attachment #8502968 - Flags: review?(kent)
Attachment #8507116 - Flags: review?(hiikezoe) → review+
Attachment #8507116 - Flags: review?(kent) → review+
Given Nightlies are broken at the moment, is their a test version that those affected could try out.
https://hg.mozilla.org/comm-central/rev/1e2c95e9daf3

We'll either need to sort out the rest of the Nightly build issues or uplift this to Aurora to get a version people can test.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment on attachment 8507116 [details] [diff] [review]
Don't null the socket security callbacks from the main thread

[Triage Comment]
Lets get this onto aurora and beta now - that'll give us a wider audience. Then we might be able to get onto esr in time for the next release.
Attachment #8507116 - Flags: approval-comm-esr31?
Attachment #8507116 - Flags: approval-comm-beta+
Attachment #8507116 - Flags: approval-comm-aurora+
as is typical*, the only way to know whether we have a good fix for this crash is A. have it ship in release, or B. directly query users who are frequently crashing.  We aren't doing A. and B. will require a week of feedback from those users.  I've emailed about a dozen users and expect roughly half will reply within several days, so I will report results by end of next week.

* despite this being a topcrash, the crash rate (5 per week) is not sufficient in auora or nightly to get a statistically significant result. But if we were shipping a new beta today (which we currently are not) then we'd know within a 2-3 days, because the beta crash rate of 500 per week is sufficient to get quick statistical feedback.
Flags: needinfo?(vseerror)
good results - all (early) reports from about 12 users I have testing the 36.0a1 nightly build are positive - so far no crashes and no ill effects.  But as expected, general statistics out of crash-stats have not been helpful.

I'd prefer to see bigger testing via beta before shipping this in an ESR release, but that won't happen unless we build a new one before Nov 24. But *if* we deem the patch is low risk then I guess it doesn't matter much if it gets tested in beta.
Flags: needinfo?(vseerror)
Comment on attachment 8507116 [details] [diff] [review]
Don't null the socket security callbacks from the main thread

At this stage, I think we should just go for it.
Attachment #8507116 - Flags: approval-comm-esr31? → approval-comm-esr31+
Thanks irving, hiro, kent, briansmith, jdm ...

v.fixed in TB31.3.0 per https://crash-stats.mozilla.com/query/?product=Thunderbird&version=Thunderbird%3A31.3.0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_search=signature&query_type=contains&query=&reason=&release_channels=&build_id=&process_type=any&hang_type=any

So this 10x rate crash signature is gone.  But it masked to some extent bug 902158.   So signature nsInterfaceRequestorAgg::GetInterface(nsID const&, void**) returns to its glory as #1 crasher, as it is in version 24 - at a rate of 2x compared to the #2 crash signature OOM | small.
Status: RESOLVED → VERIFIED
See Also: → 907096
(In reply to Wayne Mery (:wsmwk) from comment #58)
> Thanks irving, hiro, kent, briansmith, jdm ...
> 
> v.fixed in TB31.3.0 per
> https://crash-stats.mozilla.com/query/
> ?product=Thunderbird&version=Thunderbird%3A31.3.
> 0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_searc
> h=signature&query_type=contains&query=&reason=&release_channels=&build_id=&pr
> ocess_type=any&hang_type=any
> 
> So this 10x rate crash signature is gone.  But it masked to some extent bug
> 902158.   So signature nsInterfaceRequestorAgg::GetInterface(nsID const&,
> void**) returns to its glory as #1 crasher, as it is in version 24 - at a
> rate of 2x compared to the #2 crash signature OOM | small.

I think I meant to say, "masked by bug 907096".
Blocks: 955947
Target Milestone: --- → Thunderbird 36.0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: