Closed
Bug 902158
Opened 12 years ago
Closed 10 years ago
crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a
Categories
(MailNews Core :: Networking, defect)
Tracking
(firefox23 affected, firefox24+ affected, firefox25 affected, thunderbird34 fixed, thunderbird35 fixed, thunderbird_esr3134+ fixed)
People
(Reporter: scoobidiver, Assigned: rkent)
References
Details
(4 keywords, Whiteboard: [tbird betatopcrash][[tbird topcrash])
Crash Data
Attachments
(3 files, 3 obsolete files)
4.25 KB,
patch
|
Details | Diff | Splinter Review | |
1.93 KB,
patch
|
Irving
:
feedback+
|
Details | Diff | Splinter Review |
1.00 KB,
patch
|
rkent
:
review+
hiro
:
review+
standard8
:
approval-comm-aurora+
standard8
:
approval-comm-beta+
standard8
:
approval-comm-esr31+
|
Details | Diff | Splinter Review |
It's #2 top crasher in TB 23.0b1.
There are mainly two kinds of stack traces:
Frame Module Signature Source
0 mozglue.dll moz_abort memory/build/jemalloc_config.c
1 mozglue.dll arena_run_reg_dalloc memory/mozjemalloc/jemalloc.c
2 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c
3 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c
4 mozglue.dll je_free memory/mozjemalloc/jemalloc.c
5 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp
6 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports *) objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp
7 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) security/manager/ssl/src/nsNSSIOLayer.cpp
8 xul.dll nsSocketTransport::OnSocketDetached(PRFileDesc *) netwerk/base/src/nsSocketTransport2.cpp
9 xul.dll nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext *,nsSocketTransportService::SocketContext *) netwerk/base/src/nsSocketTransportService2.cpp
10 xul.dll nsSocketTransportService::DoPollIteration(bool) netwerk/base/src/nsSocketTransportService2.cpp
11 xul.dll nsSocketTransportService::Run() netwerk/base/src/nsSocketTransportService2.cpp
12 xul.dll nsThread::ProcessNextEvent(bool,bool *) xpcom/threads/nsThread.cpp
13 xul.dll NS_ProcessNextEvent(nsIThread *,bool) objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp
14 xul.dll nsThread::ThreadFunc(void *) xpcom/threads/nsThread.cpp
15 nspr4.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c
16 nspr4.dll pr_root nsprpub/pr/src/md/windows/w95thred.c
17 msvcr100.dll _callthreadstartex f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c
18 msvcr100.dll _threadstartex f:\\dd\\vctools\\crt_bld\\self_x86\\crt\\src\\threadex.c
19 kernel32.dll BaseThreadInitThunk
20 ntdll.dll __RtlUserThreadStart
21 ntdll.dll _RtlUserThreadStart
Frame Module Signature Source
0 mozglue.dll moz_abort memory/mozjemalloc/jemalloc.c
1 mozglue.dll arena_run_reg_dalloc memory/mozjemalloc/jemalloc.c
2 mozglue.dll arena_dalloc_small memory/mozjemalloc/jemalloc.c
3 mozglue.dll arena_dalloc memory/mozjemalloc/jemalloc.c
4 mozglue.dll je_free memory/mozjemalloc/jemalloc.c
5 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp
6 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports *) objdir-tb/mozilla/xpcom/build/nsCOMPtr.cpp
7 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor *) security/manager/ssl/src/nsNSSIOLayer.cpp
8 xul.dll nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor *) netwerk/base/src/nsSocketTransport2.cpp
9 xul.dll nsMsgProtocol::CloseSocket() mailnews/base/util/nsMsgProtocol.cpp
10 xul.dll nsPop3Protocol::ProcessProtocolState(nsIURI *,nsIInputStream *,unsigned __int64,unsigned int) mailnews/local/src/nsPop3Protocol.cpp
11 xul.dll nsMsgProtocol::OnDataAvailable(nsIRequest *,nsISupports *,nsIInputStream *,unsigned __int64,unsigned int) mailnews/base/util/nsMsgProtocol.cpp
12 xul.dll nsInputStreamPump::OnStateTransfer() netwerk/base/src/nsInputStreamPump.cpp
13 xul.dll nsInputStreamPump::OnInputStreamReady(nsIAsyncInputStream *) netwerk/base/src/nsInputStreamPump.cpp
14 xul.dll nsInputStreamReadyEvent::Run() xpcom/io/nsStreamUtils.cpp
15 xul.dll nsThread::ProcessNextEvent(bool,bool *) xpcom/threads/nsThread.cpp
16 xul.dll NS_ProcessNextEvent(nsIThread *,bool) objdir-tb/mozilla/xpcom/build/nsThreadUtils.cpp
17 xul.dll mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) ipc/glue/MessagePump.cpp
18 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc
19 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc
20 xul.dll nsBaseAppShell::Run() widget/xpwidgets/nsBaseAppShell.cpp
21 xul.dll nsAppShell::Run() widget/windows/nsAppShell.cpp
22 xul.dll nsAppStartup::Run() toolkit/components/startup/nsAppStartup.cpp
23 xul.dll XREMain::XRE_mainRun() toolkit/xre/nsAppRunner.cpp
24 xul.dll XREMain::XRE_main(int,char * * const,nsXREAppData const *) toolkit/xre/nsAppRunner.cpp
25 xul.dll XRE_main toolkit/xre/nsAppRunner.cpp
26 thunderbird.exe do_main mail/app/nsMailApp.cpp
27 thunderbird.exe NS_internal_main(int,char * *) mail/app/nsMailApp.cpp
28 thunderbird.exe wmain toolkit/xre/nsWindowsWMain.cpp
29 thunderbird.exe __tmainCRTStartup f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c
30 kernel32.dll BaseThreadInitThunk
31 ntdll.dll __RtlUserThreadStart
32 ntdll.dll _RtlUserThreadStart
More reports at:
https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=moz_abort+|+arena_run_reg_dalloc+|+arena_dalloc_small+|+arena_dalloc+|+je_free+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29
https://crash-stats.mozilla.com/report/list?product=Thunderbird&signature=jemalloc_crash+|+arena_dalloc+|+arena_run_reg_dalloc+|+nsInterfaceRequestorAgg%3A%3ARelease%28%29
Reporter | ||
Updated•12 years ago
|
tracking-firefox24:
--- → ?
Comment 1•12 years ago
|
||
Is this a regression from a Firefox landing ?
Updated•12 years ago
|
Keywords: regressionwindow-wanted
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to bhavana bajaj [:bajaj] (on vacation until 08/14/2013) from comment #1)
> Is this a regression from a Firefox landing ?
TB 17.0 is unaffected so it's regression. In addition, the first stack trace has no TB frame so it might be a Networking regression.
The second stack trace (with TB frames) first showed up in 19.0a1/20121119. The regression range might be (this kind of signature can morph easily and no crash stats are available from the advanced search in 2012):
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=87703fb491e0&tochange=be3a0b4edebd
http://hg.mozilla.org/comm-central/pushloghtml?fromchange=1824e8daac72&tochange=7a8e57d2fbc2
It might be a regression from bug 812203.
Keywords: regressionwindow-wanted → regression
Version: Trunk → 19 Branch
Reporter | ||
Updated•12 years ago
|
Whiteboard: [tbird topcrash] → [tbird betatopcrash]
Comment 3•12 years ago
|
||
Tracking and needinfo'ing mark to have this on his radar.
Comment 5•12 years ago
|
||
stack is along is similar to bug 692981 / bug 673438
Comment 6•12 years ago
|
||
user geo frequently crashes using current daily
example bp-4b298a1d-64bf-4841-a35f-159ba2130811
Comment 7•11 years ago
|
||
Thanks!
Comment 8•11 years ago
|
||
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
Updated•11 years ago
|
Keywords: topcrash-win
Comment 9•11 years ago
|
||
Odd...why does this signature not appear in TB24 release? [1]
sig appears for alpha and beta 19-25. #1 crash for TB25.0b1, just as it was topcrash for prior betas
Also, is this really an NSS crash?
Matt's crash with TB27.0a1 bp-bf69d93f-0c77-43b8-86e7-aa4dd2131025
perhaps an odd manifestation of bug 673438? [@ nsMsgProtocol::CloseSocket()] | [@ nsPop3Protocol::ProcessProtocolState]
[1] <12 crashes in 1 week of 24.x releases for any sig containing nsInterfaceRequestorAgg::Release - four are https://crash-stats.mozilla.com/report/list?signature=nsInterfaceRequestorAgg%3A%3ARelease%28%29&product=Thunderbird&query_type=contains&range_unit=weeks&process_type=any&version=Thunderbird%3A24.1.0&version=Thunderbird%3A24.0.1&version=Thunderbird%3A24.0&hang_type=any&date=2013-10-31+15%3A00%3A00&range_value=1#reports
Flags: needinfo?(brian)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][dev-only]
Comment 10•11 years ago
|
||
Not really sure what I can do here without a regression range or STR.
Flags: needinfo?(brian)
Comment 11•11 years ago
|
||
NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) is perhaps related https://crash-stats.mozilla.com/report/list?signature=NS_ProxyRelease%28nsIEventTarget*%2C+nsISupports*%2C+bool%29&product=Thunderbird&query_type=is_exactly&range_unit=weeks&process_type=any&hang_type=any&date=2014-04-22+17%3A00%3A00&range_value=4#tab-sigsummary
TB31.0a1 bp-870f8b8c-b384-4217-9dc6-f6a332140419
Crash Signature: [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()]
[@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ] → [@ moz_abort | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | je_free | nsInterfaceRequestorAgg::Release()]
[@ jemalloc_crash | arena_dalloc | arena_run_reg_dalloc | nsInterfaceRequestorAgg::Release() ]
[@ NS_ProxyRelease(nsIEventTarget*, ns…
Comment 12•10 years ago
|
||
#1 crash for TB31, now that version 31 is out for a month.
22% of all versoin 31 crashes
50% are startup
crashes with comment and emai address - https://crash-stats.mozilla.com/search/?signature=~NS_ProxyRelease%28nsIEventTarget*%2C+nsISupports*%2C+bool%29&email=~%40&email=!~.de&email=!~artistress%40gmail.com&email=!~.pl&user_comments=~a&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Summary: crash in nsInterfaceRequestorAgg::Release → crash in nsInterfaceRequestorAgg::Release / NS_ProxyRelease at 0x5a5a5a5a
Whiteboard: [tbird betatopcrash][dev-only] → [tbird betatopcrash]
Comment 13•10 years ago
|
||
Should we take this back to pop component?
at 25% of crashes I don't think we can update users to version 31.x automatically until this gets resolved - perhaps it deserves a different bug since behavior/crash rate is different in version 31 compared to version 24?
v31 crashes I sampled are all via nsPop3Protocol.
For example bp-fc740d5a-8d45-4743-add4-0df322140828
0 xul.dll NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) xpcom/glue/nsProxyRelease.cpp
1 xul.dll nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg() xpcom/base/nsInterfaceRequestorAgg.cpp
2 xul.dll nsInterfaceRequestorAgg::Release() xpcom/base/nsInterfaceRequestorAgg.cpp
3 xul.dll nsCOMPtr_base::assign_with_AddRef(nsISupports*) xpcom/glue/nsCOMPtr.cpp
4 xul.dll nsNSSSocketInfo::SetNotificationCallbacks(nsIInterfaceRequestor*) security/manager/ssl/src/nsNSSIOLayer.cpp
5 xul.dll nsSocketTransport::SetSecurityCallbacks(nsIInterfaceRequestor*) netwerk/base/src/nsSocketTransport2.cpp
6 xul.dll nsMsgProtocol::CloseSocket() mailnews/base/util/nsMsgProtocol.cpp
7 xul.dll nsPop3Protocol::ProcessProtocolState(nsIURI*, nsIInputStream*, unsigned __int64, unsigned int) mailnews/local/src/nsPop3Protocol.cpp
8 xul.dll nsMsgProtocol::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned __int64, unsigned int) mailnews/base/util/nsMsgProtocol.cpp
Flags: needinfo?(ludovic)
Whiteboard: [tbird betatopcrash] → [tbird betatopcrash][[tbird topcrash]
Comment 14•10 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #13)
> Should we take this back to pop component?
>
> at 25% of crashes I don't think we can update users to version 31.x
> automatically until this gets resolved - perhaps it deserves a different bug
> since behavior/crash rate is different in version 31 compared to version 24?
>
It' still when we ask infor to neck that we crash. What would be nice would be to find which server makes us crash.
Flags: needinfo?(ludovic)
Comment 15•10 years ago
|
||
For some users seeing NS_ProxyRelease in version 31 seems to be a continuation of version 24 OOM | Small aka bug 1028720. (And yet some OOM | Small users have completely stopped crashing in version 31)
(In reply to Ludovic Hirlimann [:Usul] from comment #14)
> (In reply to Wayne Mery (:wsmwk) from comment #13)
> > Should we take this back to pop component?
> >
> > at 25% of crashes I don't think we can update users to version 31.x
> > automatically until this gets resolved - perhaps it deserves a different bug
> > since behavior/crash rate is different in version 31 compared to version 24?
> >
>
> It' still when we ask infor to neck that we crash.
what?
> What would be nice would be to find which server makes us crash.
I'm not so sure there is a "server testcase" - are you thinking there are specific servers? Although it's all pop, so far according to users it seems not to be predictable.
Comment 16•10 years ago
|
||
in 31.1.1 NS_ProxyRelease is #1 at 40% of crashes
Unclear if user n8kpl is a representative example, but he/she started crashing in version 31.1.1
bp-d5ddc241-d5b5-4262-81b4-19e782140915
bp-36d323d9-d035-41f9-9254-04c4f2140915
bp-6a069cac-0692-41a0-bbf4-001ff2140915
bp-70f3ff00-1882-4b73-b8ee-626b22140915
bp-e54d16d9-33f4-4b6f-a81a-681e72140916
some of the users with
NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828
are also crashing in version 31 with
nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828
Do we need a new bug report for what's happening/regressed in 31.1.1?
tracking-thunderbird_esr31:
--- → ?
Flags: needinfo?(irving)
Comment 17•10 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #16)
> some of the users with
> NS_ProxyRelease(nsIEventTarget*, nsISupports*, bool) eg
> bp-f4bd6a0b-7b3a-4211-adf9-a9f372140828
>
> are also crashing in version 31 with
> nsProxyReleaseEvent::Run() eg bp-c17cdb9c-8167-49d9-915b-b79a72140828
>
> Do we need a new bug report for what's happening/regressed in 31.1.1?
I don't think so. Both of the crash addresses are an address of already freed region in jemalloc. So I am guessing the reason of those crashes are the same. I have no idea how did it happen though.
Assignee | ||
Comment 18•10 years ago
|
||
The object that is being release prematurely is the nsInterfaceRequestorAgg object being created in nsPop3Protocol here as "ir":
nsCOMPtr<nsIInterfaceRequestor> aggregrateIR;
MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR));
ir = aggregrateIR;
That ir is passed to the network socket, and its local reference discarded, here:
rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir);
(which calls strans->SetSecurityCallbacks(callbacks); with callbacks = ir)
We are now relying on the socket transport to hold onto the reference. Something seems to be going wrong with that release and destructor, so that the members of nsInterfaceRequestorAgg point to released memory. I can't figure out what, but one possible approach would be to add an additional reference to that object in the nsPOP3Protocol object.
I'll show a patch that shows this.
Assignee | ||
Comment 19•10 years ago
|
||
What do you think, Irving? Since we don't really understand the root cause this may just move the error, but it also is pretty low risk.
Attachment #8498466 -
Flags: review?(irving)
Comment 20•10 years ago
|
||
I think rkent is right. I could imagine a possible crash scenario.
507 nsCOMPtr<nsIInterfaceRequestor> ir;
508 if (m_socketType != nsMsgSocketType::plain)
509 {
510 nsCOMPtr<nsIMsgWindow> msgwin;
511 mailnewsUrl->GetMsgWindow(getter_AddRefs(msgwin));
512 if (!msgwin)
513 GetTopmostMsgWindow(getter_AddRefs(msgwin));
514 if (msgwin)
515 {
516 nsCOMPtr<nsIDocShell> docshell;
517 msgwin->GetRootDocShell(getter_AddRefs(docshell));
518 ir = do_QueryInterface(docshell);
519 nsCOMPtr<nsIInterfaceRequestor> notificationCallbacks;
520 msgwin->GetNotificationCallbacks(getter_AddRefs(notificationCallbacks));
521 if (notificationCallbacks)
522 {
523 nsCOMPtr<nsIInterfaceRequestor> aggregrateIR;
524 MsgNewInterfaceRequestorAggregation(notificationCallbacks, ir, getter_AddRefs(aggregrateIR));
ir is now held by aggregrateIR,
525 ir = aggregrateIR;
ir is dereferenced. Now ir is held only by aggregrateIR as mSecond in nsInterfaceRequestorAgg.
526 }
527 }
528 }
529
530 int32_t port = 0;
531 nsCString hostName;
532 aURL->GetPort(&port);
533 nsCOMPtr<nsIMsgIncomingServer> server = do_QueryInterface(m_pop3Server);
534 if (server)
535 server->GetRealHostName(hostName);
536
537 nsCOMPtr<nsIProxyInfo> proxyInfo;
538 rv = MsgExamineForProxy("pop", hostName.get(), port, getter_AddRefs(proxyInfo));
539 if (NS_FAILED(rv)) proxyInfo = nullptr;
540
541 const char *connectionType = nullptr;
542 if (m_socketType == nsMsgSocketType::SSL)
543 connectionType = "ssl";
544 else if (m_socketType == nsMsgSocketType::trySTARTTLS ||
545 m_socketType == nsMsgSocketType::alwaysSTARTTLS)
546 connectionType = "starttls";
547
548 rv = OpenNetworkSocketWithInfo(hostName.get(), port, connectionType, proxyInfo, ir);
549 if (NS_FAILED(rv) && m_socketType == nsMsgSocketType::trySTARTTLS)
550 {
551 m_socketType = nsMsgSocketType::plain;
552 rv = OpenNetworkSocketWithInfo(hostName.get(), port, nullptr, proxyInfo, ir);
553 }
554
555 if(NS_FAILED(rv))
556 return rv;
If this error occurs here, aggregrateIR is dereferenced. Then,
55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg()
56 {
57 nsIInterfaceRequestor* iir = nullptr;
58 mFirst.swap(iir);
59 if (iir) {
60 NS_ProxyRelease(mConsumerTarget, iir);
61 }
62 iir = nullptr;
63 mSecond.swap(iir);
iir does not hold the reference here because iir is just a pointer.
Who is in charge of the reference of the original ir?
So I am guessing the address of iir is now 0x5a5a5a5a.
But I can't understand why the crash line is nsProxyRelease.cpp#l45?
45 rv = target->IsOnCurrentThread(&onCurrentThread);
target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced...
Comment 21•10 years ago
|
||
Hiro, thanks for the detailed analysis.
(In reply to Hiroyuki Ikezoe (:hiro) from comment #20)
> I think rkent is right. I could imagine a possible crash scenario.
> ...
> If this error occurs here, aggregrateIR is dereferenced. Then,
>
> 55 nsInterfaceRequestorAgg::~nsInterfaceRequestorAgg()
> 56 {
> 57 nsIInterfaceRequestor* iir = nullptr;
> 58 mFirst.swap(iir);
> 59 if (iir) {
> 60 NS_ProxyRelease(mConsumerTarget, iir);
> 61 }
> 62 iir = nullptr;
> 63 mSecond.swap(iir);
>
> iir does not hold the reference here because iir is just a pointer.
> Who is in charge of the reference of the original ir?
> So I am guessing the address of iir is now 0x5a5a5a5a.
nsCOMPtr<>.swap() doesn't decrement the reference count of the object (see http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCOMPtr.h#775), so 'iir' should point to an object that still has refcount == 1; NS_ProxyRelease decrements the refcount.
> But I can't understand why the crash line is nsProxyRelease.cpp#l45?
>
> 45 rv = target->IsOnCurrentThread(&onCurrentThread);
>
> target, mConsumerTarget in nsInterfaceRequestorAgg, is also dereferenced...
The crash signature makes it look like the whole nsInterfaceRequestorAgg object is being freed and nulled out; I thin the the pointer stored in mConsumerTarget is 0x5a..., not the memory pointed to by mConsumerTarget. That makes it look like 'this' (the nsInterfaceRequestorAgg) is being freed and overwritten with 0xA5.. between line 58 and line 65 of ~nsInterfaceRequestorAgg(). That suggests to me that we're double-freeing the nsInterfaceRequestorAgg either because of a thread race, or recursively when we free mFirst.
The constructor is public, which makes me nervous that somebody could be creating an nsInterfaceRequestorAgg on the stack unless there's some other mechanism preventing that. However, a DXR search doesn't find any suspicious uses.
Assignee: nobody → kent
Flags: needinfo?(irving)
Comment 22•10 years ago
|
||
OK, now I hate the MOZILLA_INTERNAL_API split even more.
Here's another possibility: when we create these interface aggregators in mailnews, we call msgNewInterfaceAggregation(), which is defined in nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API:
true at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#337
false at http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.h#454
The external API version of the factory calls through http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1992 to http://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgUtils.cpp#1967, which constructs a version of the object that's not compatible with the one from xpcom/base/nsInterfaceRequestorAgg.cpp. The 0x5a5a5a5a we're reading for mConsumerTarget is because that member is after then end of the memory allocated for the MsgInterfaceRequestorAgg object in nsMsgUtils.cpp.
My *strong* preference is to burn the external API version of this definition all the way to the ground, and always use the definition in xpcom/base/nsInterfaceRequestorAgg.cpp
Flags: needinfo?(josh)
Comment 23•10 years ago
|
||
(In reply to Irving Reid from comment #22)
> OK, now I hate the MOZILLA_INTERNAL_API split even more.
Well, it wouldn't have been so bad if more of the internal symbols had been exposed, but there you go, it got forked at the time because it was the only way we could build against the libxul SDK.
> Here's another possibility: when we create these interface aggregators in
> mailnews, we call msgNewInterfaceAggregation(), which is defined in
> nsMsgUtils.h with different definitions #ifdef MOZILLA_INTERNAL_API:
Indeed, but all Mozilla releases are compiled with MOZILLA_INTERNAL_API, and building without it fails on trunk (although it's possible that some Linux distros shipped xulrunner versions at the time that it did build).
> The 0x5a5a5a5a we're reading for mConsumerTarget
And of course mConsumerTarget was added when bug 804655/bug 812203 changed the behaviour of nsInterfaceRequestorAgg, as mentioned in comment #2.
Comment 24•10 years ago
|
||
The MOZILLA_INTERNAL_API distinction is still necessary because things are different if you're libxul and if you're not (if you're for example, a XPCOM binary component or an executable loading libxul with the xpcom standalone glue).
Comment 25•10 years ago
|
||
I was so proud of myself for figuring out Comment 22, and then I went off and thought about it more and convinced myself I was wrong.
1) If the mConsumerTarget field was 0x5a... because the object had been aliased to the wrong type and the code was accessing out of bounds, the *first* NS_ProxyRelease in line 60 would fail, but we're seeing the *second* call fail.
2) As far as I can tell, we're calling everything through virtual interfaces, so we wouldn't end up calling the nsInterfaceRequestorAgg destructor on an *msg*InterfaceRequestorAgg anyway.
Now I'm coming back around to what I said in comment 21 - it seems likely to me that either the NS_ProxyRelease of mFirst is somehow recursing around and causing a double free of the nsInterfaceRequestorAgg itself, or a thread race is doing it - except that unless code somewhere in that path is deliberately manipulating the reference count outside the standard implementation macros, I can't see anything that would trigger the extra free.
That said, the mailnews version of MsgInterfaceRequestorAgg could still cause us unrelated thread safety problems, since it doesn't meet xpcom's assumptions about the interface requestor object taking care of making sure things are destroyed on the correct thread.
Comment 26•10 years ago
|
||
I was racking my brains to try and figure out why it could be possible for a refcounted object to get destroyed twice, and I think I have a plausible story.
In the refcount 'Release' definition http://dxr.mozilla.org/comm-central/source/mozilla/xpcom/glue/nsISupportsImpl.h#621 (the definition of NS_IMPL_RELEASE_WITH_DESTROY), after we've decided that an object's refcount has gone to 0, we change it back to 1 to protect us against double-frees in the case that some part of the destructor briefly grabs a (counted) reference. Unfortunately, that makes us vulnerable to a double-free if someone has an extra reference to that object and only *decrements* the reference count. I don't know of a way for that to happen without there being a mistake in reference counting somewhere in the code (otherwise the first release wouldn't have changed the count to 0).
I haven't traced through code to see when and where that might happen, though based on comment 25 I'd guess the offending reference is held by whatever is in the mFirst field of the nsInterfaceRequestorAgg. At this point I really need to work on some other things; I hope someone can move forward from here...
Comment 27•10 years ago
|
||
I hope I am getting closer to the cause of the double free.
One is of course secCtrl->SetNotificationCallbacks(threadsafeCallbacks); in nsSocketTransport::SetSecurityCallbacks from nsMsgProtocol::CloseSocket().
The other is secCtrl->SetNotificationCallbacks(nullptr) at
http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890 from nsSocketTransportService::DoPollIteration().
I don't check all crash reports but I am guessing all of reports have DoPollIteration symbols.
A typical report is:
https://crash-stats.mozilla.com/report/index/878f5c39-14fb-4cdb-991d-7d7652140929
The stack trace of thread 3 in this reports shows us exactly the other double free position.
Attachment #8500253 -
Flags: feedback?(irving)
Updated•10 years ago
|
Attachment #8500253 -
Flags: feedback?(kent)
Comment 28•10 years ago
|
||
Comment on attachment 8500253 [details] [diff] [review]
Possible fix
Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for finding the crash report evidence of the release happening on the other thread, Hiro, that's a great catch. Normally a race condition like that would happen really rarely, but the activity on the main thread is triggered by the onSocketReady calls just a few lines above where the I/O thread tries to clear the callback (http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1882 and then http://mxr.mozilla.org/mozilla-esr31/source/netwerk/base/src/nsSocketTransport2.cpp#1890) so the two threads always try to null out the security callbacks at around the same time.
Looking at how the security callbacks are handled elsewhere in mozilla-central, nsHTTPConnection only calls nsSocketTransport.SetSecurityCallbacks on the I/O thread, see http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#144 and look around that file for socketTransport->SetSecurityCallbacks. It also only seems to do so when it's doing particularly tricky things.
The MDN documentation for nsISocketTransport says that the 'securityCallbacks' attribute should not be changed after the socket is opened: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsISocketTransport#Attributes. I'd like to get feedback from an NSS / Networking expert, but I wonder if the fix is to *not* clear the security callbacks in our destructor?
Attachment #8500253 -
Flags: feedback?(irving) → feedback+
Comment 30•10 years ago
|
||
Comment on attachment 8500253 [details] [diff] [review]
Possible fix
briansmith, can you please give us some feedback or suggest a proper person to step forward to fix this bug.
Please see comment #27 and comment #28.
Attachment #8500253 -
Flags: feedback?(brian)
Comment 31•10 years ago
|
||
Comment on attachment 8500253 [details] [diff] [review]
Possible fix
Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------
I'm going to let Richard handle this.
See also https://bugzilla.mozilla.org/buglist.cgi?quicksearch=nsNSSSocketInfo.
Usually if you need a lock in a setter, you also need the same lock in the corresponding getter.
Attachment #8500253 -
Flags: feedback?(brian) → feedback?(rlb)
Assignee | ||
Comment 32•10 years ago
|
||
Would we be able to fix this bug by requiring that MsgNewInterfaceRequestorAggregation proxy releases through the main thread? Is that what nsMainThreadPtrHandle is trying to do?
Assignee | ||
Comment 33•10 years ago
|
||
Comment on attachment 8500253 [details] [diff] [review]
Possible fix
Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------
My threading foo is not really strong enough to comment on the approach of locking versus other methods of insuring that the release does not occur on both threads. I don't believe though that we should simply " if the fix is to *not* clear the security callbacks in our destructor?" as other examples, such as nsHTTPConnection, also clear the security callbacks (at least on error).
Attachment #8500253 -
Flags: feedback?(kent)
Comment 34•10 years ago
|
||
(In reply to Kent James (:rkent) from comment #32)
> Would we be able to fix this bug by requiring that
> MsgNewInterfaceRequestorAggregation proxy releases through the main thread?
> Is that what nsMainThreadPtrHandle is trying to do?
Yes, that is the purpose of nsMainThreadPtrHandle (as well as optionally preventing any dereferencing of the pointer on non-main threads).
Comment 35•10 years ago
|
||
But we can't fix the crash on NSS side. Am I right?
See for example:
https://crash-stats.mozilla.com/report/index/b10f8b61-ce27-4af2-8397-9b2c52140929
You mean nsMainThreadPtrHandle should be used on both threads?
Comment 36•10 years ago
|
||
To be clear, I have no idea if nsMainThreadPtrHandle is a good solution here. I was answering the second question in comment 34.
Comment 37•10 years ago
|
||
Thanks briansmith and jdm. I did not know that nsNSSSocketInfo is not thread-safe actually.
For the record, I am putting bug 427948 into dependency list.
Anyway I am waiting for feedbacks from Richard.
Depends on: 427948
Assignee | ||
Comment 38•10 years ago
|
||
A little history on the nulling of securityCallbacks, as a comment to "but I wonder if the fix is to *not* clear the security callbacks in our destructor?" in comment 28.
The nulling of the security callbacks in nsMsgProtocol were set in bug 176919 in 2003, but that was replacing and even earlier call that nulled the notification callbacks directly. So this is ancient, original code.
Meanwhile, in 2005 bug 285991 adds the nulling of securityCallbacks in nsSocketTransport::OnSocketDetached because of an nsHTTPS leak.
So one possible interpretation is that the nulling of the securityCallbacks in the msg destructor was breaking the cycle, that subsequently was broken at a different place, hence negating then need for that nulling as Irving suggests.
Comment 39•10 years ago
|
||
This is another possible fix what we can do only in comm-central without any modifications of mozilla-central.
Luckily nsSocketTransport::SetSecurityCallbacks uses NS_GetCurrentThread() instead of NS_GetMainThread() so we can call the method off the main thread.
Josh, what do you think about this?
Attachment #8502913 -
Flags: feedback?(josh)
Comment 40•10 years ago
|
||
The previous patch contains a horrible mistake passed always nullptr to SetSecurityCallbacks().
Attachment #8502913 -
Attachment is obsolete: true
Attachment #8502913 -
Flags: feedback?(josh)
Attachment #8502932 -
Flags: feedback?(josh)
Comment 41•10 years ago
|
||
Comment on attachment 8502932 [details] [diff] [review]
SetSecurityCallbacks on socket thread v2
Clearing feedback? flag.
The patch causes an issue that exception dialog is not open.
Attachment #8502932 -
Flags: feedback?(josh)
Comment 42•10 years ago
|
||
I did totally mistake. We can protect us from the crash if we use nsMainThreadPtrHolder in comm-central because the callbacks, which is passing to SetSecurityCallbacks, is only created in our side.
Attachment #8502932 -
Attachment is obsolete: true
Attachment #8502968 -
Flags: review?(kent)
Assignee | ||
Comment 43•10 years ago
|
||
Next week Hiro, rkent, and Irving will all be in Toronto together. At that point we will review this bug and decide the best course of action. Currently I am leaning to simply removing the offending call to SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call that was added later in necko code.
Comment 44•10 years ago
|
||
(In reply to Kent James (:rkent) from comment #43)
> Next week Hiro, rkent, and Irving will all be in Toronto together. At that
> point we will review this bug and decide the best course of action.
> Currently I am leaning to simply removing the offending call to
> SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call
> that was added later in necko code.
Perhaps the best approach is create a try build without it, and we can look for a memory leak, (If I understand the background I am reading here). I do not see the crash, but a memory leak should become apparent here in a couple of days if there were one.
Comment 45•10 years ago
|
||
(In reply to Matt from comment #44)
> (In reply to Kent James (:rkent) from comment #43)
> > Next week Hiro, rkent, and Irving will all be in Toronto together. At that
> > point we will review this bug and decide the best course of action.
> > Currently I am leaning to simply removing the offending call to
> > SetSecurityCallbacks in CloseSocket, as I believe that it duplicates a call
> > that was added later in necko code.
>
> Perhaps the best approach is create a try build without it, and we can look
> for a memory leak, (If I understand the background I am reading here). I do
> not see the crash, but a memory leak should become apparent here in a
> couple of days if there were one.
I don't think we can check all possible cases of nulling out the security callbacks because we can not reproduce the crash case yet.
Comment 46•10 years ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #45)
> I don't think we can check all possible cases of nulling out the security
> callbacks because we can not reproduce the crash case yet.
Here is another user with plenty of these crashes.
https://support.mozilla.org/en-US/questions/1024501
Example: bp-953500eb-82d1-4c9d-b03f-293ae2141010
Comment 47•10 years ago
|
||
Comment on attachment 8498466 [details] [diff] [review]
902158.patch
Review of attachment 8498466 [details] [diff] [review]:
-----------------------------------------------------------------
Looks like we've settled on a different approach.
Attachment #8498466 -
Flags: review?(irving)
Comment 48•10 years ago
|
||
Comment on attachment 8500253 [details] [diff] [review]
Possible fix
Review of attachment 8500253 [details] [diff] [review]:
-----------------------------------------------------------------
It's good general practice to specify what resource the lock is controlling access to. So a more descriptive name than "mLock" would be helpful. (mCallbackLock?)
Then you need to acquire the lock before you do anything with the protected resource. It seems like the resource in this case is mCallbacks, so you should lock at least in GetNotificationCallbacks.
Attachment #8500253 -
Flags: feedback?(rlb)
Comment 49•10 years ago
|
||
The fix we agreed on at the TB summit. Not well tested because of the broken tree.
Attachment #8507116 -
Flags: review?(kent)
Attachment #8507116 -
Flags: review?(hiikezoe)
Updated•10 years ago
|
Attachment #8502968 -
Attachment is obsolete: true
Attachment #8502968 -
Flags: review?(kent)
Updated•10 years ago
|
Attachment #8507116 -
Flags: review?(hiikezoe) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8507116 -
Flags: review?(kent) → review+
Comment 50•10 years ago
|
||
Given Nightlies are broken at the moment, is their a test version that those affected could try out.
Comment 51•10 years ago
|
||
https://hg.mozilla.org/comm-central/rev/1e2c95e9daf3
We'll either need to sort out the rest of the Nightly build issues or uplift this to Aurora to get a version people can test.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 52•10 years ago
|
||
Comment on attachment 8507116 [details] [diff] [review]
Don't null the socket security callbacks from the main thread
[Triage Comment]
Lets get this onto aurora and beta now - that'll give us a wider audience. Then we might be able to get onto esr in time for the next release.
Attachment #8507116 -
Flags: approval-comm-esr31?
Attachment #8507116 -
Flags: approval-comm-beta+
Attachment #8507116 -
Flags: approval-comm-aurora+
Comment 53•10 years ago
|
||
https://hg.mozilla.org/releases/comm-aurora/rev/3d197173417b
https://hg.mozilla.org/releases/comm-beta/rev/0781ca74f44a
status-thunderbird34:
--- → fixed
status-thunderbird35:
--- → fixed
Comment 54•10 years ago
|
||
as is typical*, the only way to know whether we have a good fix for this crash is A. have it ship in release, or B. directly query users who are frequently crashing. We aren't doing A. and B. will require a week of feedback from those users. I've emailed about a dozen users and expect roughly half will reply within several days, so I will report results by end of next week.
* despite this being a topcrash, the crash rate (5 per week) is not sufficient in auora or nightly to get a statistically significant result. But if we were shipping a new beta today (which we currently are not) then we'd know within a 2-3 days, because the beta crash rate of 500 per week is sufficient to get quick statistical feedback.
Flags: needinfo?(vseerror)
Comment 55•10 years ago
|
||
good results - all (early) reports from about 12 users I have testing the 36.0a1 nightly build are positive - so far no crashes and no ill effects. But as expected, general statistics out of crash-stats have not been helpful.
I'd prefer to see bigger testing via beta before shipping this in an ESR release, but that won't happen unless we build a new one before Nov 24. But *if* we deem the patch is low risk then I guess it doesn't matter much if it gets tested in beta.
Flags: needinfo?(vseerror)
Comment 56•10 years ago
|
||
Comment on attachment 8507116 [details] [diff] [review]
Don't null the socket security callbacks from the main thread
At this stage, I think we should just go for it.
Attachment #8507116 -
Flags: approval-comm-esr31? → approval-comm-esr31+
Comment 57•10 years ago
|
||
Comment 58•10 years ago
|
||
Thanks irving, hiro, kent, briansmith, jdm ...
v.fixed in TB31.3.0 per https://crash-stats.mozilla.com/query/?product=Thunderbird&version=Thunderbird%3A31.3.0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_search=signature&query_type=contains&query=&reason=&release_channels=&build_id=&process_type=any&hang_type=any
So this 10x rate crash signature is gone. But it masked to some extent bug 902158. So signature nsInterfaceRequestorAgg::GetInterface(nsID const&, void**) returns to its glory as #1 crasher, as it is in version 24 - at a rate of 2x compared to the #2 crash signature OOM | small.
Status: RESOLVED → VERIFIED
See Also: → 907096
Comment 59•10 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #58)
> Thanks irving, hiro, kent, briansmith, jdm ...
>
> v.fixed in TB31.3.0 per
> https://crash-stats.mozilla.com/query/
> ?product=Thunderbird&version=Thunderbird%3A31.3.
> 0&range_value=1&range_unit=weeks&date=12%2F03%2F2014+14%3A00%3A00&query_searc
> h=signature&query_type=contains&query=&reason=&release_channels=&build_id=&pr
> ocess_type=any&hang_type=any
>
> So this 10x rate crash signature is gone. But it masked to some extent bug
> 902158. So signature nsInterfaceRequestorAgg::GetInterface(nsID const&,
> void**) returns to its glory as #1 crasher, as it is in version 24 - at a
> rate of 2x compared to the #2 crash signature OOM | small.
I think I meant to say, "masked by bug 907096".
Updated•9 years ago
|
Keywords: regressionwindow-wanted
Updated•9 years ago
|
Target Milestone: --- → Thunderbird 36.0
You need to log in
before you can comment on or make changes to this bug.
Description
•