Closed Bug 1395246 Opened 3 years ago Closed 2 years ago

Crash in mozilla::TransportFlow::CheckThreadInt

Categories

(Core :: WebRTC: Networking, defect, P2)

49 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 61+ fixed
firefox60 --- wontfix
firefox61 + fixed
firefox62 + fixed

People

(Reporter: drno, Assigned: jesup)

Details

(4 keywords, Whiteboard: [adv-main61+][adv-esr60.1+][post-critsmash-triage])

Crash Data

Attachments

(1 file)

Looks like we have not completely fixed the problem in bug 1294097. I still see it happening with a lot lower frequency as recent as 54.0.1.

+++ This bug was initially created as a clone of Bug #1294097 +++

This bug was filed from the Socorro interface and is 
report bp-85bd0299-8e95-4856-ad71-e371a2160810.
=============================================================
Crashing Thread (6)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::TransportFlow::CheckThreadInt() 	media/mtransport/transportflow.h:116
1 	xul.dll 	mozilla::TransportFlow::CheckThread() 	media/mtransport/transportflow.h:109
2 	xul.dll 	mozilla::TransportFlow::SendPacket(unsigned char const*, unsigned __int64) 	media/mtransport/transportflow.cpp:196
3 	xul.dll 	mozilla::DataChannelConnection::SendPacket(unsigned char* const, unsigned __int64, bool) 	netwerk/sctp/datachannel/DataChannel.cpp:652
4 	xul.dll 	mozilla::runnable_args_memfn<RefPtr<mozilla::DataChannelConnection>, int ( mozilla::DataChannelConnection::*)(unsigned char* const, unsigned __int64, bool), unsigned char*, unsigned __int64, bool>::Run() 	obj-firefox/dist/include/mtransport/runnable_utils.h:169
5 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
6 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:290
7 	xul.dll 	mozilla::net::nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp:911
8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:290
10 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:354
11 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:228
12 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:208
13 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:467
14 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
15 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
16 	ucrtbase.dll 	crt_at_quick_exit 	
17 	kernel32.dll 	BaseThreadInitThunk 	
18 	ntdll.dll 	RtlUserThreadStart

crashes with this signature are regressing in number since the 49 nightly cycle. it's a rather low volume crash, currently making up 0.07% of browser crashes in 49.0b1.
Whiteboard: [adv-main49-]
Group: core-security → media-core-security
sec-high, clear UAF.  callback after the transport was released it appears.
Rank: 10
Mass change P1->P2 to align with new Mozilla triage process.
Priority: P1 → P2
Assignee: nobody → drno
From looking at this one recent crash https://crash-stats.mozilla.com/report/index/5823d540-77fb-402b-a40b-de06d0170928#allthreads

It looks like one unidentified webrtc thread is still hanging around #100, the SCTP timer thread #66, the SCTP interator thread #65 and then STS thread #7 executes a runable to send outgoing data channel data although the underlying transport flow has been freed already.
This one https://crash-stats.mozilla.com/report/index/eabccd8d-9c51-453a-8de6-c59ef0170509#tab-details actually encountered the same error while trying to send out an RT(C)P packet instead of a data channel.
(In reply to Nils Ohlmeier [:drno] from comment #4)
> This one
> https://crash-stats.mozilla.com/report/index/eabccd8d-9c51-453a-8de6-
> c59ef0170509#tab-details actually encountered the same error while trying to
> send out an RT(C)P packet instead of a data channel.

Useful - this points clearly at the mtransport/TransportFlow code, or whatever is supposed to keep it alive.  Detach seems to always occur on STS thread. 

However, that's a 53.0.2 crash... while this code hasn't changed that much, it's increasingly suspect.  Likely the same flaw, though.


DataChannels (for those crashes) always holds a RefPtr to the transportflow.  All but 2 crashes are from DataChannels, and most have clear UAF signatures.

I wonder if the DataChannels crashes are the RUN_ON_THREAD thing...
Note that DataChannel cleanup is using RUN_ON_THREAD which queue-jumps, but SendPacket is using Dispatch() which doesn't (even though it is being called on STS).  In fact, it's dispatching to the same thread on purpose to ensure it executes without a lock held; queue-jumping here would cause it to be run with the lock held.

However, the RUN_ON_THREAD is run from MainThread, and dispatches to STS, so it shouldn't be jumping.  So likely this is unrelated.
Nils, any news on this one?
Flags: needinfo?(drno)
Nils seems busy.
Randell, I notice you fixed bug 1294097, which seems related. Can you drive this further?
Flags: needinfo?(rjesup)
So the bad news is that this happens on 60 where we have a new version of libsctp and RUN_ON_THREAD is no longer queue jumping.

E.g.
https://crash-stats.mozilla.com/report/index/0bac552f-8b8a-4c73-b1dc-33b540180213#allthreads
https://crash-stats.mozilla.com/report/index/ef958152-c66a-401e-bf57-afcb30180215#allthreads

I thought the presence of the sctp timer and iterater threads would indicate that SCTP didn't get shutdown properly. But it looks like once these threads are started they are kept around until Firefox shuts down:
https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/netwerk/sctp/datachannel/DataChannel.cpp#138

Question 1: how can the transportflow shut down while the data channel holds a refptr?
Question 2: why does the data channel keep sending, when the peerconnection is gone already?
Flags: needinfo?(drno)
1. In RAWRTC, I use a circular reference (e.g. the transport has an array of channels but the channels have a reference to the transport as well) and carefully tear that down by flushing this array and closing each channel when the transport is being stopped. When the channel is closed, it removes itself from the transport. I guess this would work here as well (if it's not already the case).
2. Not sure what you mean by 'gone' but theoretically it should send all pending data (even though there is bug 	1413198), so this is to be expected.
One big change:  in 57 and later, every crash has been a null-deref.  The one 52esr crash is a UAF.  Hard to be sure without looking at the ASM, but I think this means that mTransportFlow in DataChannel.cpp is null.

This implies strongly that a change I landed and uplifted for 57 as guess has converted the UAF into a nullptr crash.  This also means we can patch 52esr with that patch to make this particular crash safe.  The patch that changed it appears to be

Bug 1400563: more DataChannelConnection shutdown cleanup r=drno
Flags: needinfo?(rjesup)
(In reply to Randell Jesup [:jesup] from comment #11)
> One big change:  in 57 and later, every crash has been a null-deref.  The
> one 52esr crash is a UAF.  Hard to be sure without looking at the ASM, but I
> think this means that mTransportFlow in DataChannel.cpp is null.
> 
> This implies strongly that a change I landed and uplifted for 57 as guess
> has converted the UAF into a nullptr crash.  This also means we can patch
> 52esr with that patch to make this particular crash safe.  The patch that
> changed it appears to be
> 
> Bug 1400563: more DataChannelConnection shutdown cleanup r=drno


So, do we want to add a null check?
Flags: needinfo?(drno)
(In reply to Frederik Braun [:freddyb] from comment #12)
> So, do we want to add a null check?

While Randell is right that the pattern has changed, the crash address is unfortunately not always zero. So I think a null check wouldn't solve the problem completely.
I'm trying to figure out by looking at the code how we can get into this situation.
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #13)
> (In reply to Frederik Braun [:freddyb] from comment #12)
> > So, do we want to add a null check?
> 
> While Randell is right that the pattern has changed, the crash address is
> unfortunately not always zero. So I think a null check wouldn't solve the
> problem completely.
> I'm trying to figure out by looking at the code how we can get into this
> situation.

It's been a while, so I'm nudging you again :)
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #13)
> (In reply to Frederik Braun [:freddyb] from comment #12)
> > So, do we want to add a null check?
> 
> While Randell is right that the pattern has changed, the crash address is
> unfortunately not always zero. So I think a null check wouldn't solve the
> problem completely.
> I'm trying to figure out by looking at the code how we can get into this
> situation.

There is only a single non-null-deref crash post-57: 
https://crash-stats.mozilla.com/report/index/6074cf2d-385e-498c-9e60-5dc630180304

And that stack has an impossible transition to the transport code - mozilla::gfx::CaptureCommandList::~CaptureCommandList() isn't going to call ~TransportFlow

So I think we're still at "this is a nullptr crash in 57 and later".  Non-nullptr would still affect 52ESR (for a little while)
I pulled a single 61.0bN crash.  'this' is null, which implies that peer->mTransportFlow in the DataChannelConnection is null, which happens in DestroyOnSTSFinal (and I added that nulling in bug 1400563 in shutdown-cleanup work, which fits with a 57 timeframe).

This also means that the runnable must have been queued after we queued DestroyOnSTSFinal(), which we explicitly do before nulling to ensure packets in-flight on STS are processed before nulling it.  And we've called usrsctp_close() and usrsctp_deregister_address() before queuing the DestroyOnSTSFinal().  Note we're already on the STS thread when we call those and then queue the final destruction - we're just trying to ensure after calling close and deregister that we finish sending any packets that were queued before we did that.

This implies that those functions don't synchronize all the internals of sctp (timers?) if we can still get a callback after calling them.  This concerns me, since we need to know when we can release the DataChannelConnection itself - but we do (because of paranoia about this) actively defer that destruction after DestroyOnSTSFinal().

I think we can wallpaper this safely, but it has implications for other issues we still have if these calls can occur after close/deregister.
Assignee: drno → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics

Review of attachment 8980626 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8980626 - Flags: review?(drno) → review+
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics

[Security approval request comment]
How easily could an exploit be constructed based on the patch? Can't be

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? no

Which older supported branches are affected by this flaw? all

If not all supported branches, which bug introduced the flaw? n/a

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? trivial

How likely is this patch to cause regressions; how much testing does it need? no likelihood - adds a null check, and a diagnostic assert to show if we're getting invoked after shutdown (and get a stack showing why, hopefully)
Attachment #8980626 - Flags: sec-approval?
sec-approval+ for trunk. I'd like to get a Beta patch and ones for both ESR branches made and nominated as well.
Attachment #8980626 - Flags: sec-approval? → sec-approval+
https://hg.mozilla.org/mozilla-central/rev/4d7a8d001f08

Please request Beta/ESR60 approval on this when you get a chance. It grafts cleanly as-landed to both. It'll also need a rebased patch and approval request for ESR52.
Group: media-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(rjesup)
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
The patch applies perfectly to beta.  esr52 is behind several patches in this area; if we want to take it we'd want to take at least two patches that cleaned up shutdown post-52.  The patch should apply trivially to ESR60 as well.
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics

Note: I'm putting this up for beta/esr60 since I was asked to, but I don't think there's much reason to do so.  This isn't a stability issue (very low crash rate); it's a nullptr crash almost without exception (though it implies internal issues in the SCTP library).  The UAF aspect of this went away after 57 (due to other changes in the DataChannel shutdown code that made it more predictable).

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: N/A

User impact if declined: low impact - null deref possible

Fix Landed on Version: 62

Risk to taking this patch (and alternatives if risky): zero - in release builds it's just a nullptr check and a MOZ_ASSERT.

String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.

Approval Request Comment
[Feature/Bug causing the regression]: N/A

[User impact if declined]: null deref

[Is this code covered by automated tests?]: yes, but they don't hit this race

[Has the fix been verified in Nightly?]: no STR

[Needs manual test from QE? If yes, steps to reproduce]:  no

[List of other uplifts needed for the feature/fix]: none

[Is the change risky?]: no 

[Why is the change risky/not risky?]: null-check and a diagnostic assert

[String changes made/needed]: none
Flags: needinfo?(rjesup)
Attachment #8980626 - Flags: approval-mozilla-esr60?
Attachment #8980626 - Flags: approval-mozilla-beta?
(In reply to Randell Jesup [:jesup] from comment #23)
> if we want to take it we'd want to take at least two patches that
> cleaned up shutdown post-52.

Based on comment 24, isn't ESR52 the most critical branch to take this on since that's where we're still UAFing?
Flags: needinfo?(rjesup)
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics

Taking for 61.0b13 and ESR 60.1, need an answer on ESR52 still.
Attachment #8980626 - Flags: approval-mozilla-esr60?
Attachment #8980626 - Flags: approval-mozilla-esr60+
Attachment #8980626 - Flags: approval-mozilla-beta?
Attachment #8980626 - Flags: approval-mozilla-beta+
(In reply to Ryan VanderMeulen [:RyanVM] from comment #26)
> Comment on attachment 8980626 [details] [diff] [review]
> nullcheck DataChannel SendPacket calls, add some diagnostics
> 
> Taking for 61.0b13 and ESR 60.1, need an answer on ESR52 still.

We're running out of time here. Randell?
On discussions Randell at the All Hands, I'm going to won't fix this for ESR52. It needs more patches there and could be destabilized.
Whiteboard: [adv-main61+][adv-esr60.1+]
Flags: qe-verify-
Whiteboard: [adv-main61+][adv-esr60.1+] → [adv-main61+][adv-esr60.1+][post-critsmash-triage]
Flags: needinfo?(drno)
Flags: needinfo?(rjesup)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.