Closed
Bug 1395246
Opened 8 years ago
Closed 7 years ago
Crash in mozilla::TransportFlow::CheckThreadInt
Categories
(Core :: WebRTC: Networking, defect, P2)
Tracking
()
People
(Reporter: drno, Assigned: jesup)
Details
(4 keywords, Whiteboard: [adv-main61+][adv-esr60.1+][post-critsmash-triage])
Crash Data
Attachments
(1 file)
|
4.52 KB,
patch
|
drno
:
review+
RyanVM
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr60+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
Looks like we have not completely fixed the problem in bug 1294097. I still see it happening with a lot lower frequency as recent as 54.0.1.
+++ This bug was initially created as a clone of Bug #1294097 +++
This bug was filed from the Socorro interface and is
report bp-85bd0299-8e95-4856-ad71-e371a2160810.
=============================================================
Crashing Thread (6)
Frame Module Signature Source
0 xul.dll mozilla::TransportFlow::CheckThreadInt() media/mtransport/transportflow.h:116
1 xul.dll mozilla::TransportFlow::CheckThread() media/mtransport/transportflow.h:109
2 xul.dll mozilla::TransportFlow::SendPacket(unsigned char const*, unsigned __int64) media/mtransport/transportflow.cpp:196
3 xul.dll mozilla::DataChannelConnection::SendPacket(unsigned char* const, unsigned __int64, bool) netwerk/sctp/datachannel/DataChannel.cpp:652
4 xul.dll mozilla::runnable_args_memfn<RefPtr<mozilla::DataChannelConnection>, int ( mozilla::DataChannelConnection::*)(unsigned char* const, unsigned __int64, bool), unsigned char*, unsigned __int64, bool>::Run() obj-firefox/dist/include/mtransport/runnable_utils.h:169
5 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1067
6 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/glue/nsThreadUtils.cpp:290
7 xul.dll mozilla::net::nsSocketTransportService::Run() netwerk/base/nsSocketTransportService2.cpp:911
8 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1067
9 xul.dll NS_ProcessNextEvent(nsIThread*, bool) xpcom/glue/nsThreadUtils.cpp:290
10 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:354
11 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:228
12 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:208
13 xul.dll nsThread::ThreadFunc(void*) xpcom/threads/nsThread.cpp:467
14 nss3.dll PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:397
15 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:95
16 ucrtbase.dll crt_at_quick_exit
17 kernel32.dll BaseThreadInitThunk
18 ntdll.dll RtlUserThreadStart
crashes with this signature are regressing in number since the 49 nightly cycle. it's a rather low volume crash, currently making up 0.07% of browser crashes in 49.0b1.
Updated•8 years ago
|
Whiteboard: [adv-main49-]
Updated•8 years ago
|
Group: core-security → media-core-security
| Assignee | ||
Comment 1•8 years ago
|
||
sec-high, clear UAF. callback after the transport was released it appears.
Rank: 10
| Assignee | ||
Comment 2•8 years ago
|
||
Mass change P1->P2 to align with new Mozilla triage process.
Priority: P1 → P2
| Reporter | ||
Updated•8 years ago
|
Assignee: nobody → drno
| Reporter | ||
Comment 3•8 years ago
|
||
From looking at this one recent crash https://crash-stats.mozilla.com/report/index/5823d540-77fb-402b-a40b-de06d0170928#allthreads
It looks like one unidentified webrtc thread is still hanging around #100, the SCTP timer thread #66, the SCTP interator thread #65 and then STS thread #7 executes a runable to send outgoing data channel data although the underlying transport flow has been freed already.
| Reporter | ||
Comment 4•8 years ago
|
||
This one https://crash-stats.mozilla.com/report/index/eabccd8d-9c51-453a-8de6-c59ef0170509#tab-details actually encountered the same error while trying to send out an RT(C)P packet instead of a data channel.
| Assignee | ||
Comment 5•8 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #4)
> This one
> https://crash-stats.mozilla.com/report/index/eabccd8d-9c51-453a-8de6-
> c59ef0170509#tab-details actually encountered the same error while trying to
> send out an RT(C)P packet instead of a data channel.
Useful - this points clearly at the mtransport/TransportFlow code, or whatever is supposed to keep it alive. Detach seems to always occur on STS thread.
However, that's a 53.0.2 crash... while this code hasn't changed that much, it's increasingly suspect. Likely the same flaw, though.
DataChannels (for those crashes) always holds a RefPtr to the transportflow. All but 2 crashes are from DataChannels, and most have clear UAF signatures.
I wonder if the DataChannels crashes are the RUN_ON_THREAD thing...
| Assignee | ||
Comment 6•8 years ago
|
||
Note that DataChannel cleanup is using RUN_ON_THREAD which queue-jumps, but SendPacket is using Dispatch() which doesn't (even though it is being called on STS). In fact, it's dispatching to the same thread on purpose to ensure it executes without a lock held; queue-jumping here would cause it to be run with the lock held.
However, the RUN_ON_THREAD is run from MainThread, and dispatches to STS, so it shouldn't be jumping. So likely this is unrelated.
Comment 8•7 years ago
|
||
Nils seems busy.
Randell, I notice you fixed bug 1294097, which seems related. Can you drive this further?
Flags: needinfo?(rjesup)
| Reporter | ||
Comment 9•7 years ago
|
||
So the bad news is that this happens on 60 where we have a new version of libsctp and RUN_ON_THREAD is no longer queue jumping.
E.g.
https://crash-stats.mozilla.com/report/index/0bac552f-8b8a-4c73-b1dc-33b540180213#allthreads
https://crash-stats.mozilla.com/report/index/ef958152-c66a-401e-bf57-afcb30180215#allthreads
I thought the presence of the sctp timer and iterater threads would indicate that SCTP didn't get shutdown properly. But it looks like once these threads are started they are kept around until Firefox shuts down:
https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/netwerk/sctp/datachannel/DataChannel.cpp#138
Question 1: how can the transportflow shut down while the data channel holds a refptr?
Question 2: why does the data channel keep sending, when the peerconnection is gone already?
Flags: needinfo?(drno)
Comment 10•7 years ago
|
||
1. In RAWRTC, I use a circular reference (e.g. the transport has an array of channels but the channels have a reference to the transport as well) and carefully tear that down by flushing this array and closing each channel when the transport is being stopped. When the channel is closed, it removes itself from the transport. I guess this would work here as well (if it's not already the case).
2. Not sure what you mean by 'gone' but theoretically it should send all pending data (even though there is bug 1413198), so this is to be expected.
| Assignee | ||
Comment 11•7 years ago
|
||
One big change: in 57 and later, every crash has been a null-deref. The one 52esr crash is a UAF. Hard to be sure without looking at the ASM, but I think this means that mTransportFlow in DataChannel.cpp is null.
This implies strongly that a change I landed and uplifted for 57 as guess has converted the UAF into a nullptr crash. This also means we can patch 52esr with that patch to make this particular crash safe. The patch that changed it appears to be
Bug 1400563: more DataChannelConnection shutdown cleanup r=drno
Flags: needinfo?(rjesup)
Comment 12•7 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #11)
> One big change: in 57 and later, every crash has been a null-deref. The
> one 52esr crash is a UAF. Hard to be sure without looking at the ASM, but I
> think this means that mTransportFlow in DataChannel.cpp is null.
>
> This implies strongly that a change I landed and uplifted for 57 as guess
> has converted the UAF into a nullptr crash. This also means we can patch
> 52esr with that patch to make this particular crash safe. The patch that
> changed it appears to be
>
> Bug 1400563: more DataChannelConnection shutdown cleanup r=drno
So, do we want to add a null check?
Flags: needinfo?(drno)
| Reporter | ||
Comment 13•7 years ago
|
||
(In reply to Frederik Braun [:freddyb] from comment #12)
> So, do we want to add a null check?
While Randell is right that the pattern has changed, the crash address is unfortunately not always zero. So I think a null check wouldn't solve the problem completely.
I'm trying to figure out by looking at the code how we can get into this situation.
Flags: needinfo?(drno)
Comment 14•7 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #13)
> (In reply to Frederik Braun [:freddyb] from comment #12)
> > So, do we want to add a null check?
>
> While Randell is right that the pattern has changed, the crash address is
> unfortunately not always zero. So I think a null check wouldn't solve the
> problem completely.
> I'm trying to figure out by looking at the code how we can get into this
> situation.
It's been a while, so I'm nudging you again :)
Flags: needinfo?(drno)
| Assignee | ||
Comment 15•7 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #13)
> (In reply to Frederik Braun [:freddyb] from comment #12)
> > So, do we want to add a null check?
>
> While Randell is right that the pattern has changed, the crash address is
> unfortunately not always zero. So I think a null check wouldn't solve the
> problem completely.
> I'm trying to figure out by looking at the code how we can get into this
> situation.
There is only a single non-null-deref crash post-57:
https://crash-stats.mozilla.com/report/index/6074cf2d-385e-498c-9e60-5dc630180304
And that stack has an impossible transition to the transport code - mozilla::gfx::CaptureCommandList::~CaptureCommandList() isn't going to call ~TransportFlow
So I think we're still at "this is a nullptr crash in 57 and later". Non-nullptr would still affect 52ESR (for a little while)
| Assignee | ||
Comment 16•7 years ago
|
||
I pulled a single 61.0bN crash. 'this' is null, which implies that peer->mTransportFlow in the DataChannelConnection is null, which happens in DestroyOnSTSFinal (and I added that nulling in bug 1400563 in shutdown-cleanup work, which fits with a 57 timeframe).
This also means that the runnable must have been queued after we queued DestroyOnSTSFinal(), which we explicitly do before nulling to ensure packets in-flight on STS are processed before nulling it. And we've called usrsctp_close() and usrsctp_deregister_address() before queuing the DestroyOnSTSFinal(). Note we're already on the STS thread when we call those and then queue the final destruction - we're just trying to ensure after calling close and deregister that we finish sending any packets that were queued before we did that.
This implies that those functions don't synchronize all the internals of sctp (timers?) if we can still get a callback after calling them. This concerns me, since we need to know when we can release the DataChannelConnection itself - but we do (because of paranoia about this) actively defer that destruction after DestroyOnSTSFinal().
I think we can wallpaper this safely, but it has implications for other issues we still have if these calls can occur after close/deregister.
| Assignee | ||
Comment 17•7 years ago
|
||
Attachment #8980626 -
Flags: review?(drno)
| Assignee | ||
Updated•7 years ago
|
Assignee: drno → rjesup
Status: NEW → ASSIGNED
| Reporter | ||
Comment 18•7 years ago
|
||
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics
Review of attachment 8980626 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM
Attachment #8980626 -
Flags: review?(drno) → review+
| Assignee | ||
Comment 19•7 years ago
|
||
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics
[Security approval request comment]
How easily could an exploit be constructed based on the patch? Can't be
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? no
Which older supported branches are affected by this flaw? all
If not all supported branches, which bug introduced the flaw? n/a
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? trivial
How likely is this patch to cause regressions; how much testing does it need? no likelihood - adds a null check, and a diagnostic assert to show if we're getting invoked after shutdown (and get a stack showing why, hopefully)
Attachment #8980626 -
Flags: sec-approval?
Comment 20•7 years ago
|
||
sec-approval+ for trunk. I'd like to get a Beta patch and ones for both ESR branches made and nominated as well.
status-firefox60:
--- → wontfix
status-firefox61:
--- → affected
status-firefox62:
--- → affected
status-firefox-esr52:
--- → affected
status-firefox-esr60:
--- → affected
tracking-firefox61:
--- → +
tracking-firefox62:
--- → +
tracking-firefox-esr52:
--- → 61+
tracking-firefox-esr60:
--- → 61+
Updated•7 years ago
|
Attachment #8980626 -
Flags: sec-approval? → sec-approval+
| Assignee | ||
Comment 21•7 years ago
|
||
Comment 22•7 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/4d7a8d001f08
Please request Beta/ESR60 approval on this when you get a chance. It grafts cleanly as-landed to both. It'll also need a rebased patch and approval request for ESR52.
Group: media-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(rjesup)
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
| Assignee | ||
Comment 23•7 years ago
|
||
The patch applies perfectly to beta. esr52 is behind several patches in this area; if we want to take it we'd want to take at least two patches that cleaned up shutdown post-52. The patch should apply trivially to ESR60 as well.
| Assignee | ||
Comment 24•7 years ago
|
||
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics
Note: I'm putting this up for beta/esr60 since I was asked to, but I don't think there's much reason to do so. This isn't a stability issue (very low crash rate); it's a nullptr crash almost without exception (though it implies internal issues in the SCTP library). The UAF aspect of this went away after 57 (due to other changes in the DataChannel shutdown code that made it more predictable).
[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: N/A
User impact if declined: low impact - null deref possible
Fix Landed on Version: 62
Risk to taking this patch (and alternatives if risky): zero - in release builds it's just a nullptr check and a MOZ_ASSERT.
String or UUID changes made by this patch: none
See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Approval Request Comment
[Feature/Bug causing the regression]: N/A
[User impact if declined]: null deref
[Is this code covered by automated tests?]: yes, but they don't hit this race
[Has the fix been verified in Nightly?]: no STR
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: null-check and a diagnostic assert
[String changes made/needed]: none
Flags: needinfo?(rjesup)
Attachment #8980626 -
Flags: approval-mozilla-esr60?
Attachment #8980626 -
Flags: approval-mozilla-beta?
Comment 25•7 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #23)
> if we want to take it we'd want to take at least two patches that
> cleaned up shutdown post-52.
Based on comment 24, isn't ESR52 the most critical branch to take this on since that's where we're still UAFing?
Flags: needinfo?(rjesup)
Comment 26•7 years ago
|
||
Comment on attachment 8980626 [details] [diff] [review]
nullcheck DataChannel SendPacket calls, add some diagnostics
Taking for 61.0b13 and ESR 60.1, need an answer on ESR52 still.
Attachment #8980626 -
Flags: approval-mozilla-esr60?
Attachment #8980626 -
Flags: approval-mozilla-esr60+
Attachment #8980626 -
Flags: approval-mozilla-beta?
Attachment #8980626 -
Flags: approval-mozilla-beta+
Comment 27•7 years ago
|
||
| uplift | ||
Comment 28•7 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #26)
> Comment on attachment 8980626 [details] [diff] [review]
> nullcheck DataChannel SendPacket calls, add some diagnostics
>
> Taking for 61.0b13 and ESR 60.1, need an answer on ESR52 still.
We're running out of time here. Randell?
Comment 29•7 years ago
|
||
On discussions Randell at the All Hands, I'm going to won't fix this for ESR52. It needs more patches there and could be destabilized.
tracking-firefox-esr52:
61+ → ---
Whiteboard: [adv-main61+][adv-esr60.1+]
Updated•7 years ago
|
Flags: qe-verify-
Whiteboard: [adv-main61+][adv-esr60.1+] → [adv-main61+][adv-esr60.1+][post-critsmash-triage]
| Reporter | ||
Updated•7 years ago
|
Flags: needinfo?(drno)
Updated•7 years ago
|
Flags: needinfo?(rjesup)
Updated•6 years ago
|
Group: core-security-release
You need to log in
before you can comment on or make changes to this bug.
Description
•