Closed Bug 902158 Opened 12 years ago Closed 10 years ago

crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a

Categories

(MailNews Core :: Networking, defect)

All
Windows 7
defect
Not set
critical

Tracking

(firefox23 affected, firefox24+ affected, firefox25 affected, thunderbird34 fixed, thunderbird35 fixed, thunderbird_esr3134+ fixed)

VERIFIED FIXED
Thunderbird 36.0
Tracking Status
firefox23 --- affected
firefox24 + affected
firefox25 --- affected
thunderbird34 --- fixed
thunderbird35 --- fixed
thunderbird_esr31 34+ fixed

People

(Reporter: scoobidiver, Assigned: rkent)

References

Details

(4 keywords, Whiteboard: [tbird betatopcrash][[tbird topcrash])

Crash Data

Attachments

(3 files, 3 obsolete files)

It's #2 top crasher in TB 23.0b1. There are mainly two kinds of stack traces: Frame Module Signature Source 0 mozglue.dll moz_abort memory/build/jemalloc_config.c 1 mozglue.dll arena_run_reg_dalloc memory/mozjemalloc/jemalloc.c 2 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c 3 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 4 mozglue.dll je_free memory/mozjemalloc/jemalloc.c 5 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp 6 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports *) objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp 7 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) security/manager/ssl/src/nsNSSIOLayer.cpp 8 xul.dll nsSocketTransport::OnSocketDetached(PRFileDesc *) netwerk/base/src/nsSocketTransport2.cpp 9 xul.dll nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext *,nsSocketTransportService::SocketContext *) netwerk/base/src/nsSocketTransportService2.cpp 10 xul.dll nsSocketTransportService::DoPollIteration(bool) netwerk/base/src/nsSocketTransportService2.cpp 11 xul.dll nsSocketTransportService::Run() netwerk/base/src/nsSocketTransportService2.cpp 12 xul.dll nsThread::ProcessNextEvent(bool,bool *) xpcom/threads/nsThread.cpp 13 xul.dll NS_ProcessNextEvent(nsIThread *,bool) objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp 14 xul.dll nsThread::ThreadFunc(void *) xpcom/threads/nsThread.cpp 15 nspr4.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c 16 nspr4.dll pr_root nsprpub/pr/src/md/windows/w95thred.c 17 msvcr100.dll _callthreadstartex f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c 18 msvcr100.dll _threadstartex f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c 19 kernel32.dll BaseThreadInitThunk 20 ntdll.dll __RtlUserThreadStart 21 ntdll.dll _RtlUserThreadStart Frame Module Signature Source 0 mozglue.dll moz_abort memory/mozjemalloc/jemalloc.c 1 mozglue.dll arena_run_reg_dalloc memory/mozjemalloc/jemalloc.c 2 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c 3 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c 4 mozglue.dll je_free memory/mozjemalloc/jemalloc.c 5 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp 6 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports *) objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp 7 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) security/manager/ssl/src/nsNSSIOLayer.cpp 8 xul.dll nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor *) netwerk/base/src/nsSocketTransport2.cpp 9 xul.dll nsMsgProtocol::CloseSocket() mailnews/base/util/nsMsgProtocol.cpp 10 xul.dll nsPop3Protocol::ProcessProtocolState(nsIURI *,nsIInputStream *,unsigned __int64,unsigned int) mailnews/local/src/nsPop3Protocol.cpp 11 xul.dll nsMsgProtocol::OnDataAvailable(nsIRequest *,nsISupports *,nsIInputStream *,unsigned __int64,unsigned int) mailnews/base/util/nsMsgProtocol.cpp 12 xul.dll nsInputStreamPump::OnStateTransfer() netwerk/base/src/nsInputStreamPump.cpp 13 xul.dll nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) netwerk/base/src/nsInputStreamPump.cpp 14 xul.dll nsInputStreamReadyEvent::Run() xpcom/io/nsStreamUtils.cpp 15 xul.dll nsThread::ProcessNextEvent(bool,bool *) xpcom/threads/nsThread.cpp 16 xul.dll NS_ProcessNextEvent(nsIThread *,bool) objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp 17 xul.dll mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) ipc/glue/MessagePump.cpp 18 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc 19 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc 20 xul.dll nsBaseAppShell::Run() widget/xpwidgets/nsBaseAppShell.cpp 21 xul.dll nsAppShell::Run() widget/windows/nsAppShell.cpp 22 xul.dll nsAppStartup::Run() toolkit/components/startup/nsAppStartup.cpp 23 xul.dll XREMain::XRE_mainRun() toolkit/xre/nsAppRunner.cpp 24 xul.dll XREMain::XRE_main(int,char * * const,nsXREAppData const *) toolkit/xre/nsAppRunner.cpp 25 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp 26 thunderbird.exe do_main mail/app/nsMailApp.cpp 27 thunderbird.exe NS_internal_main(int,char * *) mail/app/nsMailApp.cpp 28 thunderbird.exe wmain toolkit/xre/nsWindowsWMain.cpp 29 thunderbird.exe __tmainCRTStartup f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c 30 kernel32.dll BaseThreadInitThunk 31 ntdll.dll __RtlUserThreadStart 32 ntdll.dll _RtlUserThreadStart More reports at: https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=moz_abort+|+arena_run_reg_dalloc+|+arena_dalloc_small+|+arena_dalloc+|+je_free+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29 https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=jemalloc_crash+|+arena_dalloc+|+arena_run_reg_dalloc+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29
Is this a regression from a Firefox landing ?
(In reply to bhavana bajaj [:bajaj] (on vacation until 08/14/2013) from comment #1) > Is this a regression from a Firefox landing ? TB 17.0 is unaffected so it's regression. In addition, the first stack trace has no TB frame so it might be a Networking regression. The second stack trace (with TB frames) first showed up in 19.0a1/20121119. The regression range might be (this kind of signature can morph easily and no crash stats are available from the advanced search in 2012): http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=87703fb491e0&tochange=be3a0b4edebd http://hg.mozilla.org/comm-central/pushloghtml?fromchange=1824e8daac72&tochange=7a8e57d2fbc2 It might be a regression from bug 812203.
Version: Trunk → 19 Branch
Whiteboard: [tbird topcrash] → [tbird betatopcrash]
Tracking and needinfo'ing mark to have this on his radar.
Flags: needinfo?(mbanner)
Ok, thanks.
Flags: needinfo?(mbanner)
stack is along is similar to bug 692981 / bug 673438
user geo frequently crashes using current daily example bp-4b298a1d-64bf-4841-a35f-159ba2130811
Thanks!
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
Keywords: topcrash-win
Odd...why does this signature not appear in TB24 release? [1] sig appears for alpha and beta 19-25. #1 crash for TB25.0b1, just as it was topcrash for prior betas Also, is this really an NSS crash? Matt's crash with TB27.0a1 bp-bf69d93f-0c77-43b8-86e7-aa4dd2131025 perhaps an odd manifestation of bug 673438? [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState] [1] <12 crashes in 1 week of 24.x releases for any sig containing nsInterfaceRequestorAgg::Release - four are https://crash-stats.mozilla.com/report/list?signature=nsInterfaceRequestorAgg%3A%3ARelease%28%29&product=Thunderbird&query_type=contains&range_unit=weeks&process_type=any&version=Thunderbird%3A24.1.0&version=Thunderbird%3A24.0.1&version=Thunderbird%3A24.0&hang_type=any&date=2013-10-31+15%3A00%3A00&range_value=1#reports
Flags: needinfo?(brian)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][dev-only]
Not really sure what I can do here without a regression range or STR.
Flags: needinfo?(brian)
Crash Signature: [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()] [@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ] → [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()] [@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ] [@ NS_ProxyRelease(nsIEventTarget*, ns…
See Also: → 673438
See Also: → 701533
Summary: crash in nsInterfaceRequestorAgg::Release → crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a
Whiteboard: [tbird betatopcrash][dev-only] → [tbird betatopcrash]
Should we take this back to pop component? at 25% of crashes I don't think we can update users to version 31.x automatically until this gets resolved - perhaps it deserves a different bug since behavior/crash rate is different in version 31 compared to version 24? v31 crashes I sampled are all via nsPop3Protocol. For example bp-fc740d5a-8d45-4743-add4-0df322140828 0 xul.dll NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) xpcom/glue/nsProxyRelease.cpp 1 xul.dll nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg() xpcom/base/nsInterfaceRequestorAgg.cpp 2 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp 3 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports*) xpcom/glue/nsCOMPtr.cpp 4 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor*) security/manager/ssl/src/nsNSSIOLayer.cpp 5 xul.dll nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor*) netwerk/base/src/nsSocketTransport2.cpp 6 xul.dll nsMsgProtocol::CloseSocket() mailnews/base/util/nsMsgProtocol.cpp 7 xul.dll nsPop3Protocol::ProcessProtocolState(nsIURI*, nsIInputStream*, unsigned __int64, unsigned int) mailnews/local/src/nsPop3Protocol.cpp 8 xul.dll nsMsgProtocol::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned __int64, unsigned int) mailnews/base/util/nsMsgProtocol.cpp
Flags: needinfo?(ludovic)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][[tbird topcrash]
(In reply to Wayne Mery (:wsmwk) from comment #13) > Should we take this back to pop component? > > at 25% of crashes I don't think we can update users to version 31.x > automatically until this gets resolved - perhaps it deserves a different bug > since behavior/crash rate is different in version 31 compared to version 24? > It' still when we ask infor to neck that we crash. What would be nice would be to find which server makes us crash.
Flags: needinfo?(ludovic)
For some users seeing NS_ProxyRelease in version 31 seems to be a continuation of version 24 OOM | Small aka bug 1028720. (And yet some OOM | Small users have completely stopped crashing in version 31) (In reply to Ludovic Hirlimann [:Usul] from comment #14) > (In reply to Wayne Mery (:wsmwk) from comment #13) > > Should we take this back to pop component? > > > > at 25% of crashes I don't think we can update users to version 31.x > > automatically until this gets resolved - perhaps it deserves a different bug > > since behavior/crash rate is different in version 31 compared to version 24? > > > > It' still when we ask infor to neck that we crash. what? > What would be nice would be to find which server makes us crash. I'm not so sure there is a "server testcase" - are you thinking there are specific servers? Although it's all pop, so far according to users it seems not to be predictable.
Blocks: 1028720
Flags: needinfo?(ludovic)
Product: Core → MailNews Core
Version: 19 Branch → 31
in 31.1.1 NS_ProxyRelease is #1 at 40% of crashes Unclear if user n8kpl is a representative example, but he/she started crashing in version 31.1.1 bp-d5ddc241-d5b5-4262-81b4-19e782140915 bp-36d323d9-d035-41f9-9254-04c4f2140915 bp-6a069cac-0692-41a0-bbf4-001ff2140915 bp-70f3ff00-1882-4b73-b8ee-626b22140915 bp-e54d16d9-33f4-4b6f-a81a-681e72140916 some of the users with NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828 are also crashing in version 31 with nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828 Do we need a new bug report for what's happening/regressed in 31.1.1?
Flags: needinfo?(irving)
(In reply to Wayne Mery (:wsmwk) from comment #16) > some of the users with > NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg > bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828 > > are also crashing in version 31 with > nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828 > > Do we need a new bug report for what's happening/regressed in 31.1.1? I don't think so. Both of the crash addresses are an address of already freed region in jemalloc. So I am guessing the reason of those crashes are the same. I have no idea how did it happen though.
The object that is being release prematurely is the nsInterfaceRequestorAgg object being created in nsPop3Protocol here as "ir": nsCOMPtr<nsIInterfaceRequestor> aggregrateIR; MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR)); ir = aggregrateIR; That ir is passed to the network socket, and its local reference discarded, here: rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir); (which calls strans->SetSecurityCallbacks(callbacks); with callbacks = ir) We are now relying on the socket transport to hold onto the reference. Something seems to be going wrong with that release and destructor, so that the members of nsInterfaceRequestorAgg point to released memory. I can't figure out what, but one possible approach would be to add an additional reference to that object in the nsPOP3Protocol object. I'll show a patch that shows this.
Attached patch 902158.patchSplinter Review
What do you think, Irving? Since we don't really understand the root cause this may just move the error, but it also is pretty low risk.
Attachment #8498466 - Flags: review?(irving)
I think rkent is right. I could imagine a possible crash scenario. 507 nsCOMPtr<nsIInterfaceRequestor> ir; 508 if (m_socketType != nsMsgSocketType::plain) 509 { 510 nsCOMPtr<nsIMsgWindow> msgwin; 511 mailnewsUrl->GetMsgWindow(getter_AddRefs(msgwin)); 512 if (!msgwin) 513 GetTopmostMsgWindow(getter_AddRefs(msgwin)); 514 if (msgwin) 515 { 516 nsCOMPtr<nsIDocShell> docshell; 517 msgwin->GetRootDocShell(getter_AddRefs(docshell)); 518 ir = do_QueryInterface(docshell); 519 nsCOMPtr<nsIInterfaceRequestor> notificationCallbacks; 520 msgwin->GetNotificationCallbacks(getter_AddRefs(notificationCallbacks)); 521 if (notificationCallbacks) 522 { 523 nsCOMPtr<nsIInterfaceRequestor> aggregrateIR; 524 MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR)); ir is now held by aggregrateIR, 525 ir = aggregrateIR; ir is dereferenced. Now ir is held only by aggregrateIR as mSecond in nsInterfaceRequestorAgg. 526 } 527 } 528 } 529 530 int32_t port = 0; 531 nsCString hostName; 532 aURL->GetPort(&port); 533 nsCOMPtr<nsIMsgIncomingServer> server = do_QueryInterface(m_pop3Server); 534 if (server) 535 server->GetRealHostName(hostName); 536 537 nsCOMPtr<nsIProxyInfo> proxyInfo; 538 rv = MsgExamineForProxy("pop", hostName.get(), port, getter_AddRefs(proxyInfo)); 539 if (NS_FAILED(rv)) proxyInfo = nullptr; 540 541 const char *connectionType = nullptr; 542 if (m_socketType == nsMsgSocketType::SSL) 543 connectionType = "ssl"; 544 else if (m_socketType == nsMsgSocketType::trySTARTTLS || 545 m_socketType == nsMsgSocketType::alwaysSTARTTLS) 546 connectionType = "starttls"; 547 548 rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir); 549 if (NS_FAILED(rv) && m_socketType == nsMsgSocketType::trySTARTTLS) 550 { 551 m_socketType = nsMsgSocketType::plain; 552 rv = OpenNetworkSocketWithInfo(hostName.get(), port, nullptr, proxyInfo, ir); 553 } 554 555 if(NS_FAILED(rv)) 556 return rv; If this error occurs here, aggregrateIR is dereferenced. Then, 55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg() 56 { 57 nsIInterfaceRequestor* iir = nullptr; 58 mFirst.swap(iir); 59 if (iir) { 60 NS_ProxyRelease(mConsumerTarget, iir); 61 } 62 iir = nullptr; 63 mSecond.swap(iir); iir does not hold the reference here because iir is just a pointer. Who is in charge of the reference of the original ir? So I am guessing the address of iir is now 0x5a5a5a5a. But I can't understand why the crash line is nsProxyRelease.cpp#l45? 45 rv = target->IsOnCurrentThread(&onCurrentThread); target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced...
Hiro, thanks for the detailed analysis. (In reply to Hiroyuki Ikezoe (:hiro) from comment #20) > I think rkent is right. I could imagine a possible crash scenario. > ... > If this error occurs here, aggregrateIR is dereferenced. Then, > > 55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg() > 56 { > 57 nsIInterfaceRequestor* iir = nullptr; > 58 mFirst.swap(iir); > 59 if (iir) { > 60 NS_ProxyRelease(mConsumerTarget, iir); > 61 } > 62 iir = nullptr; > 63 mSecond.swap(iir); > > iir does not hold the reference here because iir is just a pointer. > Who is in charge of the reference of the original ir? > So I am guessing the address of iir is now 0x5a5a5a5a. nsCOMPtr<>.swap() doesn't decrement the reference count of the object (see http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCOMPtr.h#775), so 'iir' should point to an object that still has refcount == 1; NS_ProxyRelease decrements the refcount. > But I can't understand why the crash line is nsProxyRelease.cpp#l45? > > 45 rv = target->IsOnCurrentThread(&onCurrentThread); > > target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced... The crash signature makes it look like the whole nsInterfaceRequestorAgg object is being freed and nulled out; I thin the the pointer stored in mConsumerTarget is 0x5a..., not the memory pointed to by mConsumerTarget. That makes it look like 'this' (the nsInterfaceRequestorAgg) is being freed and overwritten with 0xA5.. between line 58 and line 65 of ~nsInterfaceRequestorAgg(). That suggests to me that we're double-freeing the nsInterfaceRequestorAgg either because of a thread race, or recursively when we free mFirst. The constructor is public, which makes me nervous that somebody could be creating an nsInterfaceRequestorAgg on the stack unless there's some other mechanism preventing that. However, a DXR search doesn't find any suspicious uses.
Assignee: nobody → kent
Flags: needinfo?(irving)
OK, now I hate the MOZILLA_INTERNAL_API split even more. Here's another possibility: when we create these interface aggregators in mailnews, we call msgNewInterfaceAggregation(), which is defined in nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API: true at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#337 false at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#454 The external API version of the factory calls through http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1992 to http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1967, which constructs a version of the object that's not compatible with the one from xpcom/base/nsInterfaceRequestorAgg.cpp. The 0x5a5a5a5a we're reading for mConsumerTarget is because that member is after then end of the memory allocated for the MsgInterfaceRequestorAgg object in nsMsgUtils.cpp. My *strong* preference is to burn the external API version of this definition all the way to the ground, and always use the definition in xpcom/base/nsInterfaceRequestorAgg.cpp
Flags: needinfo?(josh)
(In reply to Irving Reid from comment #22) > OK, now I hate the MOZILLA_INTERNAL_API split even more. Well, it wouldn't have been so bad if more of the internal symbols had been exposed, but there you go, it got forked at the time because it was the only way we could build against the libxul SDK. > Here's another possibility: when we create these interface aggregators in > mailnews, we call msgNewInterfaceAggregation(), which is defined in > nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API: Indeed, but all Mozilla releases are compiled with MOZILLA_INTERNAL_API, and building without it fails on trunk (although it's possible that some Linux distros shipped xulrunner versions at the time that it did build). > The 0x5a5a5a5a we're reading for mConsumerTarget And of course mConsumerTarget was added when bug 804655/bug 812203 changed the behaviour of nsInterfaceRequestorAgg, as mentioned in comment #2.
Blocks: 804655
Flags: needinfo?(ludovic)
The MOZILLA_INTERNAL_API distinction is still necessary because things are different if you're libxul and if you're not (if you're for example, a XPCOM binary component or an executable loading libxul with the xpcom standalone glue).
I was so proud of myself for figuring out Comment 22, and then I went off and thought about it more and convinced myself I was wrong. 1) If the mConsumerTarget field was 0x5a... because the object had been aliased to the wrong type and the code was accessing out of bounds, the *first* NS_ProxyRelease in line 60 would fail, but we're seeing the *second* call fail. 2) As far as I can tell, we're calling everything through virtual interfaces, so we wouldn't end up calling the nsInterfaceRequestorAgg destructor on an *msg*InterfaceRequestorAgg anyway. Now I'm coming back around to what I said in comment 21 - it seems likely to me that either the NS_ProxyRelease of mFirst is somehow recursing around and causing a double free of the nsInterfaceRequestorAgg itself, or a thread race is doing it - except that unless code somewhere in that path is deliberately manipulating the reference count outside the standard implementation macros, I can't see anything that would trigger the extra free. That said, the mailnews version of MsgInterfaceRequestorAgg could still cause us unrelated thread safety problems, since it doesn't meet xpcom's assumptions about the interface requestor object taking care of making sure things are destroyed on the correct thread.
I was racking my brains to try and figure out why it could be possible for a refcounted object to get destroyed twice, and I think I have a plausible story. In the refcount 'Release' definition http://dxr.mozilla.org/comm-central/source/mozilla/xpcom/glue/nsISupportsImpl.h#621 (the definition of NS_IMPL_RELEASE_WITH_DESTROY), after we've decided that an object's refcount has gone to 0, we change it back to 1 to protect us against double-frees in the case that some part of the destructor briefly grabs a (counted) reference. Unfortunately, that makes us vulnerable to a double-free if someone has an extra reference to that object and only *decrements* the reference count. I don't know of a way for that to happen without there being a mistake in reference counting somewhere in the code (otherwise the first release wouldn't have changed the count to 0). I haven't traced through code to see when and where that might happen, though based on comment 25 I'd guess the offending reference is held by whatever is in the mFirst field of the nsInterfaceRequestorAgg. At this point I really need to work on some other things; I hope someone can move forward from here...
Attached patch Possible fixSplinter Review
I hope I am getting closer to the cause of the double free. One is of course secCtrl->SetNotificationCallbacks(threadsafeCallbacks); in nsSocketTransport::SetSecurityCallbacks from nsMsgProtocol::CloseSocket(). The other is secCtrl->SetNotificationCallbacks(nullptr) at http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890 from nsSocketTransportService::DoPollIteration(). I don't check all crash reports but I am guessing all of reports have DoPollIteration symbols. A typical report is: https://crash-stats.mozilla.com/report/index/878f5c39-14fb-4cdb-991d-7d7652140929 The stack trace of thread 3 in this reports shows us exactly the other double free position.
Attachment #8500253 - Flags: feedback?(irving)
Attachment #8500253 - Flags: feedback?(kent)
Comment on attachment 8500253 [details] [diff] [review] Possible fix Review of attachment 8500253 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for finding the crash report evidence of the release happening on the other thread, Hiro, that's a great catch. Normally a race condition like that would happen really rarely, but the activity on the main thread is triggered by the onSocketReady calls just a few lines above where the I/O thread tries to clear the callback (http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1882 and then http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890) so the two threads always try to null out the security callbacks at around the same time. Looking at how the security callbacks are handled elsewhere in mozilla-central, nsHTTPConnection only calls nsSocketTransport.SetSecurityCallbacks on the I/O thread, see http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#144 and look around that file for socketTransport->SetSecurityCallbacks. It also only seems to do so when it's doing particularly tricky things. The MDN documentation for nsISocketTransport says that the 'securityCallbacks' attribute should not be changed after the socket is opened: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsISocketTransport#Attributes. I'd like to get feedback from an NSS / Networking expert, but I wonder if the fix is to *not* clear the security callbacks in our destructor?
Attachment #8500253 - Flags: feedback?(irving) → feedback+
Looks like my presence isn't required here.
Flags: needinfo?(josh)
Comment on attachment 8500253 [details] [diff] [review] Possible fix briansmith, can you please give us some feedback or suggest a proper person to step forward to fix this bug. Please see comment #27 and comment #28.
Attachment #8500253 - Flags: feedback?(brian)
Comment on attachment 8500253 [details] [diff] [review] Possible fix Review of attachment 8500253 [details] [diff] [review]: ----------------------------------------------------------------- I'm going to let Richard handle this. See also https://bugzilla.mozilla.org/buglist.cgi?quicksearch=nsNSSSocketInfo. Usually if you need a lock in a setter, you also need the same lock in the corresponding getter.
Attachment #8500253 - Flags: feedback?(brian) → feedback?(rlb)
Would we be able to fix this bug by requiring that MsgNewInterfaceRequestorAggregation proxy releases through the main thread? Is that what nsMainThreadPtrHandle is trying to do?
Comment on attachment 8500253 [details] [diff] [review] Possible fix Review of attachment 8500253 [details] [diff] [review]: ----------------------------------------------------------------- My threading foo is not really strong enough to comment on the approach of locking versus other methods of insuring that the release does not occur on both threads. I don't believe though that we should simply " if the fix is to *not* clear the security callbacks in our destructor?" as other examples, such as nsHTTPConnection, also clear the security callbacks (at least on error).
Attachment #8500253 - Flags: feedback?(kent)
(In reply to Kent James (:rkent) from comment #32) > Would we be able to fix this bug by requiring that > MsgNewInterfaceRequestorAggregation proxy releases through the main thread? > Is that what nsMainThreadPtrHandle is trying to do? Yes, that is the purpose of nsMainThreadPtrHandle (as well as optionally preventing any dereferencing of the pointer on non-main threads).
But we can't fix the crash on NSS side. Am I right? See for example: https://crash-stats.mozilla.com/report/index/b10f8b61-ce27-4af2-8397-9b2c52140929 You mean nsMainThreadPtrHandle should be used on both threads?
To be clear, I have no idea if nsMainThreadPtrHandle is a good solution here. I was answering the second question in comment 34.
Thanks briansmith and jdm. I did not know that nsNSSSocketInfo is not thread-safe actually. For the record, I am putting bug 427948 into dependency list. Anyway I am waiting for feedbacks from Richard.
Depends on: 427948
Blocks: 783998
A little history on the nulling of securityCallbacks, as a comment to "but I wonder if the fix is to *not* clear the security callbacks in our destructor?" in comment 28. The nulling of the security callbacks in nsMsgProtocol were set in bug 176919 in 2003, but that was replacing and even earlier call that nulled the notification callbacks directly. So this is ancient, original code. Meanwhile, in 2005 bug 285991 adds the nulling of securityCallbacks in nsSocketTransport::OnSocketDetached because of an nsHTTPS leak. So one possible interpretation is that the nulling of the securityCallbacks in the msg destructor was breaking the cycle, that subsequently was broken at a different place, hence negating then need for that nulling as Irving suggests.
This is another possible fix what we can do only in comm-central without any modifications of mozilla-central. Luckily nsSocketTransport::SetSecurityCallbacks uses NS_GetCurrentThread() instead of NS_GetMainThread() so we can call the method off the main thread. Josh, what do you think about this?
Attachment #8502913 - Flags: feedback?(josh)
The previous patch contains a horrible mistake passed always nullptr to SetSecurityCallbacks().
Attachment #8502913 - Attachment is obsolete: true
Attachment #8502913 - Flags: feedback?(josh)
Attachment #8502932 - Flags: feedback?(josh)
Comment on attachment 8502932 [details] [diff] [review] SetSecurityCallbacks on socket thread v2 Clearing feedback? flag. The patch causes an issue that exception dialog is not open.
Attachment #8502932 - Flags: feedback?(josh)
Attached patch Use nsMainThreadPtrHolder (obsolete) — Splinter Review
I did totally mistake. We can protect us from the crash if we use nsMainThreadPtrHolder in comm-central because the callbacks, which is passing to SetSecurityCallbacks, is only created in our side.
Attachment #8502932 - Attachment is obsolete: true
Attachment #8502968 - Flags: review?(kent)
Next week Hiro, rkent, and Irving will all be in Toronto together. At that point we will review this bug and decide the best course of action. Currently I am leaning to simply removing the offending call to SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call that was added later in necko code.
(In reply to Kent James (:rkent) from comment #43) > Next week Hiro, rkent, and Irving will all be in Toronto together. At that > point we will review this bug and decide the best course of action. > Currently I am leaning to simply removing the offending call to > SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call > that was added later in necko code. Perhaps the best approach is create a try build without it, and we can look for a memory leak, (If I understand the background I am reading here). I do not see the crash, but a memory leak should become apparent here in a couple of days if there were one.
(In reply to Matt from comment #44) > (In reply to Kent James (:rkent) from comment #43) > > Next week Hiro, rkent, and Irving will all be in Toronto together. At that > > point we will review this bug and decide the best course of action. > > Currently I am leaning to simply removing the offending call to > > SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call > > that was added later in necko code. > > Perhaps the best approach is create a try build without it, and we can look > for a memory leak, (If I understand the background I am reading here). I do > not see the crash, but a memory leak should become apparent here in a > couple of days if there were one. I don't think we can check all possible cases of nulling out the security callbacks because we can not reproduce the crash case yet.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #45) > I don't think we can check all possible cases of nulling out the security > callbacks because we can not reproduce the crash case yet. Here is another user with plenty of these crashes. https://support.mozilla.org/en-US/questions/1024501 Example: bp-953500eb-82d1-4c9d-b03f-293ae2141010
Comment on attachment 8498466 [details] [diff] [review] 902158.patch Review of attachment 8498466 [details] [diff] [review]: ----------------------------------------------------------------- Looks like we've settled on a different approach.
Attachment #8498466 - Flags: review?(irving)
Comment on attachment 8500253 [details] [diff] [review] Possible fix Review of attachment 8500253 [details] [diff] [review]: ----------------------------------------------------------------- It's good general practice to specify what resource the lock is controlling access to. So a more descriptive name than "mLock" would be helpful. (mCallbackLock?) Then you need to acquire the lock before you do anything with the protected resource. It seems like the resource in this case is mCallbacks, so you should lock at least in GetNotificationCallbacks.
Attachment #8500253 - Flags: feedback?(rlb)
The fix we agreed on at the TB summit. Not well tested because of the broken tree.
Attachment #8507116 - Flags: review?(kent)
Attachment #8507116 - Flags: review?(hiikezoe)
Attachment #8502968 - Attachment is obsolete: true
Attachment #8502968 - Flags: review?(kent)
Attachment #8507116 - Flags: review?(hiikezoe) → review+
Attachment #8507116 - Flags: review?(kent) → review+
Given Nightlies are broken at the moment, is their a test version that those affected could try out.
https://hg.mozilla.org/comm-central/rev/1e2c95e9daf3 We'll either need to sort out the rest of the Nightly build issues or uplift this to Aurora to get a version people can test.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment on attachment 8507116 [details] [diff] [review] Don't null the socket security callbacks from the main thread [Triage Comment] Lets get this onto aurora and beta now - that'll give us a wider audience. Then we might be able to get onto esr in time for the next release.
Attachment #8507116 - Flags: approval-comm-esr31?
Attachment #8507116 - Flags: approval-comm-beta+
Attachment #8507116 - Flags: approval-comm-aurora+
as is typical*, the only way to know whether we have a good fix for this crash is A. have it ship in release, or B. directly query users who are frequently crashing. We aren't doing A. and B. will require a week of feedback from those users. I've emailed about a dozen users and expect roughly half will reply within several days, so I will report results by end of next week. * despite this being a topcrash, the crash rate (5 per week) is not sufficient in auora or nightly to get a statistically significant result. But if we were shipping a new beta today (which we currently are not) then we'd know within a 2-3 days, because the beta crash rate of 500 per week is sufficient to get quick statistical feedback.
Flags: needinfo?(vseerror)
good results - all (early) reports from about 12 users I have testing the 36.0a1 nightly build are positive - so far no crashes and no ill effects. But as expected, general statistics out of crash-stats have not been helpful. I'd prefer to see bigger testing via beta before shipping this in an ESR release, but that won't happen unless we build a new one before Nov 24. But *if* we deem the patch is low risk then I guess it doesn't matter much if it gets tested in beta.
Flags: needinfo?(vseerror)
Comment on attachment 8507116 [details] [diff] [review] Don't null the socket security callbacks from the main thread At this stage, I think we should just go for it.
Attachment #8507116 - Flags: approval-comm-esr31? → approval-comm-esr31+
Thanks irving, hiro, kent, briansmith, jdm ... v.fixed in TB31.3.0 per https://crash-stats.mozilla.com/query/?product=Thunderbird&version=Thunderbird%3A31.3.0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_search=signature&query_type=contains&query=&reason=&release_channels=&build_id=&process_type=any&hang_type=any So this 10x rate crash signature is gone. But it masked to some extent bug 902158. So signature nsInterfaceRequestorAgg::GetInterface(nsID const&, void**) returns to its glory as #1 crasher, as it is in version 24 - at a rate of 2x compared to the #2 crash signature OOM | small.
Status: RESOLVED → VERIFIED
See Also: → 907096
(In reply to Wayne Mery (:wsmwk) from comment #58) > Thanks irving, hiro, kent, briansmith, jdm ... > > v.fixed in TB31.3.0 per > https://crash-stats.mozilla.com/query/ > ?product=Thunderbird&version=Thunderbird%3A31.3. > 0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_searc > h=signature&query_type=contains&query=&reason=&release_channels=&build_id=&pr > ocess_type=any&hang_type=any > > So this 10x rate crash signature is gone. But it masked to some extent bug > 902158. So signature nsInterfaceRequestorAgg::GetInterface(nsID const&, > void**) returns to its glory as #1 crasher, as it is in version 24 - at a > rate of 2x compared to the #2 crash signature OOM | small. I think I meant to say, "masked by bug 907096".
Blocks: 955947
Target Milestone: --- → Thunderbird 36.0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: