Crash [@ mozilla::DataChannelConnection::Close]

VERIFIED FIXED in Firefox 20

Status

()

defect
P1
critical
VERIFIED FIXED
7 years ago
6 years ago

People

(Reporter: florian, Assigned: jesup)

Tracking

({crash, reproducible, sec-moderate})

Trunk
mozilla20
x86
macOS
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox19 disabled, firefox20 fixed, firefox-esr17 unaffected, b2g18 unaffected)

Details

(Whiteboard: [WebRTC], [blocking-webrtc+][adv-main20-], crash signature, )

Attachments

(3 attachments, 1 obsolete attachment)

Crash Signature: [@ mozilla::DataChannelConnection::Close(unsigned short)]
Keywords: crash
Severity: normal → critical
Florian, do you have a testcase for it?
(In reply to Henrik Skupin (:whimboo) from comment #1)
> Florian, do you have a testcase for it?

Not really, but it seems fairly reproducible (it doesn't crash all the time, but I suspect running this under valgrind on linux would likely show errors all the time).

The code I was using is https://github.com/anantn/socialapi-demo/tree/312321224543ea0d249f697dd7c58708842100d9 with local changes (I'm attaching here the local changes I currently have).


I had established a call, and then typed something in the input box of the chat window. Restarting the browser (I'm using the restartless restart add-on for that, but I don't think it really matters, just quitting Firefox would probably give the same result).
The steps to reproduce are:
- use this version of the test social provider: https://github.com/anantn/socialapi-demo/tree/b7790d36f500c58a8278d1cd2f42efc6be7a411d
(reading the README.md file will definitely be required to understand how this works)

- Remove the "return false;" of the onsubmit handler (this line: https://github.com/anantn/socialapi-demo/commit/b7790d36f500c58a8278d1cd2f42efc6be7a411d#L0R178)

- Start a call

- Type something in the chat textbox (it will make the remote video disappear, that's "expected").

- Restart the browser.

In this situation, it crashes about 2 out of 3 times for me.
Duplicate of this bug: 805742
I'm not on facebook, and it appears I can't use this without a facebook account.  Can someone catch a crash in gdb and give me the dumps of the relevant objects and params on the stack (and line number!)

I there any way to use this without Facebook?
(In reply to Randell Jesup [:jesup] from comment #6)
> I'm not on facebook, and it appears I can't use this without a facebook
> account.

This demo provider is completely unrelated to Facebook. If you are seeing Facebook instead of this, it's probably that you haven't restarted your browser after tweaking the pref (the social provider pref seems to only be read at startup).
Note you need the fix from bug 805871 to test this today
Florian - Can you help us get accurate STR for the crash?
Flags: needinfo?(florian)
Whiteboard: [WebRTC], [blocking-webrtc+]
I strongly believe this is a dup of bug 807929, which now has a patch.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 807929
Flags: needinfo?(florian)
Duplicate of this bug: 824711
As given by bug 824711 this is not fixed yet.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Assignee: nobody → rjesup
Priority: -- → P1
Definitely still reproducible on 12/29 build. I managed to reproduce when playing around with the data channels test page.

https://crash-stats.mozilla.com/report/index/bp-1021fd0c-63b4-48e0-b271-8389b2121230
Keywords: reproducible
Actually, I think this crash has migrated onto windows as well now:

https://crash-stats.mozilla.com/report/index/bp-1021fd0c-63b4-48e0-b271-8389b2121230
https://crash-stats.mozilla.com/report/index/9e5d2c6d-33d8-40dc-811c-2a04d2130102

Example stack being:

0 	xul.dll 	mozilla::BaseAutoLock<mozilla::Mutex>::BaseAutoLock<mozilla::Mutex> 	obj-firefox/dist/include/mozilla/Mutex.h:153
1 	xul.dll 	mozilla::DataChannelConnection::Close 	netwerk/sctp/datachannel/DataChannel.cpp:2008
2 	xul.dll 	mozilla::DataChannel::Close 	netwerk/sctp/datachannel/DataChannel.cpp:2075
3 	xul.dll 	mozilla::DataChannel::~DataChannel 	netwerk/sctp/datachannel/DataChannel.cpp:2049
4 	xul.dll 	mozilla::DataChannel::`scalar deleting destructor' 	
5 	xul.dll 	mozilla::DataChannel::Release 	netwerk/sctp/datachannel/DataChannel.h:297
6 	xul.dll 	nsRefPtr<mozilla::DataChannel>::~nsRefPtr<mozilla::DataChannel> 	obj-firefox/dist/include/nsAutoPtr.h:876
7 	xul.dll 	nsDOMDataChannel::~nsDOMDataChannel 	
8 	xul.dll 	nsDOMDataChannel::`scalar deleting destructor' 	
9 	xul.dll 	nsDOMEventTargetHelper::Release 	content/events/src/nsDOMEventTargetHelper.cpp:74
10 	xul.dll 	ReleaseSliceNow 	js/xpconnect/src/XPCJSRuntime.cpp:576
11 	xul.dll 	XPCIncrementalReleaseRunnable::ReleaseNow 	js/xpconnect/src/XPCJSRuntime.cpp:631
12 	winmm.dll 	timeGetTime 	
13 	xul.dll 	XPCIncrementalReleaseRunnable::Run 	js/xpconnect/src/XPCJSRuntime.cpp:660
14 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:627
15 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:82
16 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:208
17 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:182
18 	xul.dll 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:163
19 	xul.dll 	nsAppShell::Run 	widget/windows/nsAppShell.cpp:232
20 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/nsAppStartup.cpp:288
21 	xul.dll 	XREMain::XRE_mainRun 	toolkit/xre/nsAppRunner.cpp:3823
22 	xul.dll 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3890
23 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:4088
24 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:105
25 	firefox.exe 	__tmainCRTStartup 	crtexe.c:552
26 	kernel32.dll 	BaseThreadInitThunk 	
27 	ntdll.dll 	__RtlUserThreadStart 	
28 	ntdll.dll 	_RtlUserThreadStart
Comment on attachment 698279 [details] [diff] [review]
Clean up DataChannel close logic, fix creation refcount

Patrick - this is a high priority for review as the Social API guys are hitting problems with this frequently when reloading pages (bug 814882).  The *main* problem was the incorrect type on NotifyDataChannel causing a refcount imbalance.  

The rest is largely cleanup of Close(), safety checks (of mConnection), and handling the WAITING_TO_OPEN case in SCTP stream reset (which was hitting a MOZ_ASSERT).
Attachment #698279 - Flags: review?(mcmanus)
Simplified patch that just fixes the refcount issue and transport release per ekr's instructions, and one small other bug (not including a state in the incoming reset code) - for enabling review from a non-netwerk peer.  refcount issue was the some DataChannels (from ::Open) were created with an already_AddRefed<> pointer, and some (from NotifyDataChannel) weren't
Comment on attachment 698457 [details] [diff] [review]
Minimal fix for creation refcount and transport release

Asking for review from derf (for the straight XPCOM refcount issues) and ekr (for transport issues) on the simplified patch.  (state issue both/either - it's simple/obvious as are teh debugs).  If/when we get review on the main patch, we'll substitute that for this.
Attachment #698457 - Flags: review?(tterribe)
Attachment #698457 - Flags: review?(ekr)
update main patch with transportflow changes from ekr
Attachment #698279 - Attachment is obsolete: true
Attachment #698279 - Flags: review?(mcmanus)
Attachment #698459 - Flags: review?(mcmanus)
Attachment #698459 - Flags: review?(ekr)
Comment on attachment 698457 [details] [diff] [review]
Minimal fix for creation refcount and transport release

Review of attachment 698457 [details] [diff] [review]:
-----------------------------------------------------------------

I only reviewed the refcount parts.
Attachment #698457 - Flags: review?(tterribe) → review+
Comment on attachment 698457 [details] [diff] [review]
Minimal fix for creation refcount and transport release

Review of attachment 698457 [details] [diff] [review]:
-----------------------------------------------------------------

Depending on the answer to the question below, this is either an r- or an r+

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +27,5 @@
>  #ifdef MOZ_PEERCONNECTION
>  #include "mtransport/runnable_utils.h"
>  #endif
> +
> +#define DATACHANNEL_LOG(args) LOG(args)

Why redefine this here? The defn. seems to be the same as in the .h file.

@@ +191,5 @@
> +      MOZ_ASSERT(NS_IsMainThread());
> +      RUN_ON_THREAD(mSTS, WrapRunnable(nsRefPtr<DataChannelConnection>(this),
> +                                       &DataChannelConnection::disconnect_all),
> +                    NS_DISPATCH_NORMAL);
> +    }

Can you guarantee that the DCConnection will be deleted before returning to the event loop?

The point of this code is to guarantee that the signals are disconnected on the STS
thread. However, if you are on the STS thread now and you don't disconnect
the signals and then you delete the DCConnection on the main thread, you are
going to have problems. Is there a reason to have this code be conditional on IsSTSThread?

@@ +2064,5 @@
>    }
>  
>    // Clean up any pending opens for channels
>    nsRefPtr<DataChannel> channel;
> +  while (nullptr != (channel = dont_AddRef(static_cast<DataChannel *>(mPending.PopFront())))) {

Is there a reason not to just do !(channel = )

::: netwerk/sctp/datachannel/DataChannel.h
@@ +151,5 @@
>      }
>    int32_t SendBlob(uint16_t stream, nsIInputStream *aBlob);
>  
>    // Called on data reception from the SCTP library
>    // must(?) be public so my c->c++ tramploine can call it

sp. trampoline

@@ +262,5 @@
>    nsCOMPtr<nsIThread> mConnectThread;
>  };
>  
> +#define ENSURE_DATACONNECTION \
> +  do { if (!mConnection) { DATACHANNEL_LOG(("%s: %p no connection!",__FUNCTION__, this)); return; } } while (0)

Not that I care, but why not have just ENSURE_DATA_CONNECTION_RET and use /**/ as the argument?
Comment on attachment 698457 [details] [diff] [review]
Minimal fix for creation refcount and transport release

Review of attachment 698457 [details] [diff] [review]:
-----------------------------------------------------------------

Depending on the answer to the question below, this is either an r- or an r+.

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +27,5 @@
>  #ifdef MOZ_PEERCONNECTION
>  #include "mtransport/runnable_utils.h"
>  #endif
> +
> +#define DATACHANNEL_LOG(args) LOG(args)

Why redefine this here? The defn. seems to be the same as in the .h file.

@@ +191,5 @@
> +      MOZ_ASSERT(NS_IsMainThread());
> +      RUN_ON_THREAD(mSTS, WrapRunnable(nsRefPtr<DataChannelConnection>(this),
> +                                       &DataChannelConnection::disconnect_all),
> +                    NS_DISPATCH_NORMAL);
> +    }

Can you guarantee that the DCConnection will be deleted before returning to the event loop?

The point of this code is to guarantee that the signals are disconnected on the STS
thread. However, if you are on the STS thread now and you don't disconnect
the signals and then you delete the DCConnection on the main thread, you are
going to have problems. Is there a reason to have this code be conditional on IsSTSThread?

Reading the wrapper code, it appears that this is only called on the main thread, so I would
suggest removing that conditional entirely and just making this clause unconditional.

@@ +2064,5 @@
>    }
>  
>    // Clean up any pending opens for channels
>    nsRefPtr<DataChannel> channel;
> +  while (nullptr != (channel = dont_AddRef(static_cast<DataChannel *>(mPending.PopFront())))) {

Is there a reason not to just do !(channel = )

::: netwerk/sctp/datachannel/DataChannel.h
@@ +151,5 @@
>      }
>    int32_t SendBlob(uint16_t stream, nsIInputStream *aBlob);
>  
>    // Called on data reception from the SCTP library
>    // must(?) be public so my c->c++ tramploine can call it

sp. trampoline

@@ +262,5 @@
>    nsCOMPtr<nsIThread> mConnectThread;
>  };
>  
> +#define ENSURE_DATACONNECTION \
> +  do { if (!mConnection) { DATACHANNEL_LOG(("%s: %p no connection!",__FUNCTION__, this)); return; } } while (0)

Not that I care, but why not have just ENSURE_DATA_CONNECTION_RET and use /**/ as the argument?
Attachment #698457 - Flags: review?(ekr) → review+
(In reply to Eric Rescorla (:ekr) from comment #24)
> Comment on attachment 698457 [details] [diff] [review]
> Minimal fix for creation refcount and transport release
> 
> Review of attachment 698457 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Depending on the answer to the question below, this is either an r- or an r+.
> 
> ::: netwerk/sctp/datachannel/DataChannel.cpp
> @@ +27,5 @@
> >  #ifdef MOZ_PEERCONNECTION
> >  #include "mtransport/runnable_utils.h"
> >  #endif
> > +
> > +#define DATACHANNEL_LOG(args) LOG(args)
> 
> Why redefine this here? The defn. seems to be the same as in the .h file.

I'm using this #define trickery because DataChannel.h is included by things that don't have DataChannelLog.h/etc, and have their own LOG() define.

> 
> @@ +191,5 @@
> > +      MOZ_ASSERT(NS_IsMainThread());
> > +      RUN_ON_THREAD(mSTS, WrapRunnable(nsRefPtr<DataChannelConnection>(this),
> > +                                       &DataChannelConnection::disconnect_all),
> > +                    NS_DISPATCH_NORMAL);
> > +    }
> 
> Can you guarantee that the DCConnection will be deleted before returning to
> the event loop?
> 
> The point of this code is to guarantee that the signals are disconnected on
> the STS
> thread. However, if you are on the STS thread now and you don't disconnect
> the signals and then you delete the DCConnection on the main thread, you are
> going to have problems. Is there a reason to have this code be conditional
> on IsSTSThread?
> 
> Reading the wrapper code, it appears that this is only called on the main
> thread, so I would
> suggest removing that conditional entirely and just making this clause
> unconditional.

Done.

> 
> @@ +2064,5 @@
> >    }
> >  
> >    // Clean up any pending opens for channels
> >    nsRefPtr<DataChannel> channel;
> > +  while (nullptr != (channel = dont_AddRef(static_cast<DataChannel *>(mPending.PopFront())))) {
> 
> Is there a reason not to just do !(channel = )

No, not really, I think it came with some cargo-culted code, unless I was asked to do it in a review.  Since that was already here (we just added the { at the end), I'd rather leave this issue for McManus' review.

> 
> @@ +262,5 @@
> >    nsCOMPtr<nsIThread> mConnectThread;
> >  };
> >  
> > +#define ENSURE_DATACONNECTION \
> > +  do { if (!mConnection) { DATACHANNEL_LOG(("%s: %p no connection!",__FUNCTION__, this)); return; } } while (0)
> 
> Not that I care, but why not have just ENSURE_DATA_CONNECTION_RET and use
> /**/ as the argument?

Someone just went through the tree removing all those sorts of tricks. :-)
> > > +  while (nullptr != (channel = dont_AddRef(static_cast<DataChannel *>(mPending.PopFront())))) {
> > 
> > Is there a reason not to just do !(channel = )
> 
> No, not really, I think it came with some cargo-culted code, 

That's just review bait :)

if you say (nullptr != (foo = blah()) the reviewer says "don't explicitly compare against null"

if you say (foo = blah()) you get compiler warnings suggesting you can't count = characters.

if you say ((foo = blah()) (my preference) you get review comments about extra parens

if you say (!!(foo = blah())) you get review comments about obtuseness.

I personally think the author should do what they want.
Comment on attachment 698459 [details] [diff] [review]
Clean up DataChannel close logic, fix creation refcount

Review of attachment 698459 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
@@ +136,5 @@
>  
>    // DataConnection observers
>    void NotifyConnection();
>    void NotifyClosedConnection();
> +  void NotifyDataChannel(already_AddRefed<mozilla::DataChannel> aChannel);

this isn't a change request, because this code already uses this idiom so using it a little more is this is fine, but in general I really don't like these already_addrefed<> interfaces.. it makes using the smart pointers considerably harder imo.

this could just be a normal nsrefptr at the interface level and the implementation of NotifyDataChannel could do the forget().. That's really the logic of the NotifyDataChannel doing that work anyhow imo so it shouldn't be part of the calling interface.

Sure - it costs you an extra atomic inc/dec.
Attachment #698459 - Flags: review?(mcmanus) → review+
Thanks

The main problem with already_AddRefed (I suspect) is that it silently converts and loses the already_AddRefed property.  I think it could be modified to not allow this sort of error (at the cost of some explicit typing if you want to drop the status, not that it would be common outside of passing to a LOG() message).

I tend to automatically be wary of AddRef/Release, *especially* for threadsafe objects.  Overly so, I'm sure.  I've also found lots of spurious addref/releases make tracing any sort of leak virtually impossible.  And the original DOM code from khuey used already_AddRefed I think, as does the nsDOMMediaStream stuff which I also play with, and tons more in the tree, mostly for constructors, DOM elements and things that pull something out of a list/datastructure I think.   SVG seems quite littered with them too.

These DataChannel objects (which don't have high traffic) are not places we really need to optimize like this.  I believe however that we're losing a fair bit of perf in places from individually small silent addref/release costs (especially if they get inlined and don't show in profilers or each have an individual impl and so don't get aggregated, and doubly so for threadsafe).
(In reply to Patrick McManus [:mcmanus] from comment #27)
> > > > +  while (nullptr != (channel = dont_AddRef(static_cast<DataChannel *>(mPending.PopFront())))) {
> > > 
> > > Is there a reason not to just do !(channel = )
> > 
> > No, not really, I think it came with some cargo-culted code, 
> 
> That's just review bait :)
> 
> if you say (nullptr != (foo = blah()) the reviewer says "don't explicitly
> compare against null"
> 
> if you say (foo = blah()) you get compiler warnings suggesting you can't
> count = characters.
> 
> if you say ((foo = blah()) (my preference) you get review comments about
> extra parens
> 
> if you say (!!(foo = blah())) you get review comments about obtuseness.
> 
> I personally think the author should do what they want.

Obviously, I don't care what Jesup does here... I was just operating based on the style guidelines.

"When testing a pointer, use (!myPtr) or (myPtr); don't use myPtr != nsnull or myPtr == nsnull."
Comment on attachment 698459 [details] [diff] [review]
Clean up DataChannel close logic, fix creation refcount

Carrying forward r+ from minimal patch (note that there's a followup bug patch (caused by this code) with an r? to ekr)
Attachment #698459 - Flags: review?(ekr) → review+
I'm not sure what security rating to give this.
GC-related refcount incorrect crash, so would normally be high or crit, but it's preffed off by default.  Trying moderate.  (Remaining patch issues are not a sec issue, just need to unbitrot and land.)
Keywords: sec-moderate
Blocks: 827878
https://hg.mozilla.org/mozilla-central/rev/08f1385d7f0f

Marking for uplift as this fixes Close() functionality.  (The first half of this bug landed before the uplift)
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [WebRTC], [blocking-webrtc+] → [WebRTC], [blocking-webrtc+][webrtc-uplift]
Keywords: verifyme
Needs a clearer test case in order to do in-testsuite? - although writing data channel mochitests in separate bugs might actually cover this bug.
Flags: in-testsuite-
Not getting a reproduction of this crash specifically off of the data channels page, so marking as verified.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Whiteboard: [WebRTC], [blocking-webrtc+][webrtc-uplift] → [WebRTC], [blocking-webrtc+][webrtc-uplift][adv-main20-]
Whiteboard: [WebRTC], [blocking-webrtc+][webrtc-uplift][adv-main20-] → [WebRTC], [blocking-webrtc+][adv-main20-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.