Closed Bug 1400563 Opened 4 years ago Closed 2 years ago

Crash in WinSqmSetIfMaxDWORD called from SCTP sowakeup() on association abort

Categories

(Core :: WebRTC: Networking, defect, P1)

52 Branch
x86
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox-esr68 --- fixed
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 + wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fixed
firefox69 --- fixed

People

(Reporter: jesup, Assigned: ng)

References

Details

(5 keywords, Whiteboard: [adv-main68+])

Crash Data

Attachments

(7 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-f42d248e-262a-4f9a-ac7e-c546c0170916.
=============================================================

UAF (use after free) crash in sowakeup on association abort.
Crashes start in Firefox 54 (there are others with this signature from earlier, but they have very different stacks).

It looks like either sb_cond was 0xe5e5... or the data pointed to by sb_cond was freed (the condition var), which seems more likely.
Michael - is this familiar?
Flags: needinfo?(tuexen)
Unfortunately not...
Flags: needinfo?(tuexen)
URLs are all over the map, but appear to point to usage for ad serving to bypass ad blocking, and I suspect this is happening when the user navigates away from the page with the ad.  One domain I see in the list is serve.popads.net, and the overall pattern is of sites that would not use datachannels for any reason themselves.

This isn't IP address harvesting, since the association has been created.
Crash Signature: [@ WinSqmSetIfMaxDWORD] → [@ WinSqmSetIfMaxDWORD] [@ zzz_AsmCodeRange_End | sowakeup]
Crash Signature: [@ WinSqmSetIfMaxDWORD] [@ zzz_AsmCodeRange_End | sowakeup] → [@ WinSqmSetIfMaxDWORD] [@ zzz_AsmCodeRange_End | sowakeup] [@ RtlWakeAllConditionVariable | sowakeup ]
[@ ?? ?? ::FNODOBFM::`string''] and [@ ?? ::FNODOBFM::`string'' ] are also hit from sowakeup, but that's mixed with other bugs.  Those are all reported as 0xfffffffff errors, so it might or might not be a UAF (likely is)

There's a tail of other crashes with sowakeup which are all likely the same issue.
Crash Signature: [@ WinSqmSetIfMaxDWORD] [@ zzz_AsmCodeRange_End | sowakeup] [@ RtlWakeAllConditionVariable | sowakeup ] → [@ WinSqmSetIfMaxDWORD] [@ WinSqmSetIfMinDWORD] [@ zzz_AsmCodeRange_End | sowakeup] [@ RtlWakeAllConditionVariable | sowakeup] [@ RtlpResetDriveEnvironment | sowakeup] [@ RtlUlonglongByteSwap | sowakeup]
[Tracking Requested - why for this release]:
sec-high rising in frequency
Rank: 8
Priority: -- → P1
Keywords: regression
tracking as sec-high
If we can get a clean fix for this, I'd consider it for a ride-along on 56 point releases
Group: core-security → media-core-security
After checking all of the crash signatures Randell linked to this bug for the last 6 months, all of these crashes appear to be Windows only. So might be a problem in the platform specific code.
OS: Windows 7 → Windows
This seems to appear in Fx54 from the crashes.

Looking at the diffs/changesets for 53->54 for netwerk/sctp, there are really only 3:
1) collapsing else blocks where the if-portion returns/etc
2) %d/u -> %" PRIUINT " (and the like)
3) change to constent ordering of SCTP shutdown (when deregister occurs was dependent on timing before): bug 1311380

This would point moderately strongly at bug 1311380.  Speaking strongly against this analysis is that this patch was uplifted to 52 and 53 (and 45esr).  It would be very odd for this to provoke 0 crashes in 53 and 52 if the patch is there.  That would speak to it being a timing hole triggered by changes elsewhere.
OK, I can track this for 56, for now, and try to help move it along (I can't track every sec-high bug though). 

Maire, can you help find someone to investigate further?
OK, it seems to be a problem only on Windows (as indicated by Nils) and the traces show
that it happens during the ungraceful termination of the association.

Can I assume that this is not during the termination of Firefox?

If that is the case I can try to test this with a small test program on Windows...
Flags: needinfo?(rjesup)
(In reply to Michael Tüxen from comment #11)
> OK, it seems to be a problem only on Windows (as indicated by Nils) and the
> traces show
> that it happens during the ungraceful termination of the association.

The function it's crashing in is a Windows function, so anything that shows up here would be Windows.  It's possible we might get crashes on other platforms under different signatures, though I haven't seen them.  Search crash-stats for a "proto signature" that "contains" sowakeup for example

> 
> Can I assume that this is not during the termination of Firefox?

We are not in shutdown.
Flags: needinfo?(rjesup)
This is a search for all non-Windows crashes (in the last week) with "sctp" in the proto signature (aka the crashing thread's stack).  Note that this will get some false positives.
OK, lets focus on this Windows specific behaviour when the association is terminated by excessive retransmissions. Let me try to test this outside of Firefox with a test app. Will look how firefox handles this and try to imitate this.
It may be a race between abort-due-to-retransmissions and the PeerConnection being torn down.  I.e. after deregister_adddress or after close_socket (or while those are happening, on another thread).
attempt to stop the UAFs - move deregister to the same thread as close().  Is similar to how the webrtc.org code does it now.
Attachment #8916623 - Flags: review?(tuexen)
Attachment #8916623 - Flags: review?(drno)
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8916623 [details] [diff] [review]
clean up SCTP shutdown

Review of attachment 8916623 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8916623 - Flags: review?(drno) → review+
Comment on attachment 8916623 [details] [diff] [review]
clean up SCTP shutdown

[Security approval request comment]
How easily could an exploit be constructed based on the patch? Very hard based on the patch.  On top of that, this is an unproven attempt to bypass the bug - we haven't been able to determine the actual issue for sure.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  Not particularly, beyond that it would land on a sec bug presumably, and moves code calls from one thread to another. 

Which older supported branches are affected by this flaw? at least back to 53, though maybe worse in 55+.  Or it's getting used more.

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?   Trivial backport.  Should apply cleanly

How likely is this patch to cause regressions; how much testing does it need?  Very unlikely.  Plan to land on 58, then request 57 approval if it seems stable.
Attachment #8916623 - Flags: sec-approval?
Attachment #8916623 - Flags: review?(tuexen)
Attachment #8916623 - Flags: feedback?(tuexen)
Just as a note:
* We tried to reproduce the issue with a test app using the SCTP code being used in FF. We assumed that it is related to the association being terminated due to excessive retransmissions. We tested on Windows
* We tested FF on Windows under the situation that an existing association is terminated due to excessive retransmissions.

In both cases we could NOT reproduce the issue.
I suspect it's related to timing of shutdowns/deallocation, thread run order, and association termination.  So hard to repro in an isolated test, even if you duplicate the general threading model here.

That said: running it under rr in chaos mode (varying thread timing) might catch it: rr record -h ....
Comment on attachment 8916623 [details] [diff] [review]
clean up SCTP shutdown

sec-approval=dveditz
Attachment #8916623 - Flags: sec-approval? → sec-approval+
Attachment #8916623 - Flags: feedback?(tuexen)
If we see a drop-off in UAFs after this ships to m-c, I'll probably ask for uplift to 57
https://hg.mozilla.org/mozilla-central/rev/955cbee8d3d7

Please request Beta approval when you're comfortable doing so.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Flags: needinfo?(rjesup)
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Group: media-core-security → core-security-release
Randel: did you wanted to leave this open, since we don't know yet if it fixes the issue?
At least we should not open up this bug until we know if it's fixed, or?
leaving open for now
Status: RESOLVED → REOPENED
Flags: needinfo?(rjesup)
Resolution: FIXED → ---
Note: since we rarely get any hits on this in Nightly, I'm going to ask for beta approval shortly assuming no regressions have appeared (and I expect none).
There are a total of ~20 crashes ever on 58, all before this patch landed. There are only 2 other crashes involving sctp, and those are both a non-shutdown null-deref (not related to this).
Comment on attachment 8916623 [details] [diff] [review]
clean up SCTP shutdown

Approval Request Comment
[Feature/Bug causing the regression]: UNsure; dates back to 53 but has been spiking in frequency (perhaps because of use by ad networks to bypass ublock/etc).

[User impact if declined]: dangerous sec bug may not be fixed -- this is not guaranteed to fix it; we hope it will.

[Is this code covered by automated tests?]:  Yes (though they don't hit this bug)

[Has the fix been verified in Nightly?]: No (but no immediately-apparent negative impacts)

[Needs manual test from QE? If yes, steps to reproduce]: No

[List of other uplifts needed for the feature/fix]: None

[Is the change risky?]:  No - moves all the sctp library shutdown code to happen on a single thread instead of split across two.

[Why is the change risky/not risky?]:  See above.

[String changes made/needed]: none
Attachment #8916623 - Flags: approval-mozilla-beta?
Given the rate of crashes in beta, we should know fairly quickly once in beta if this helps/solves the problem
Comment on attachment 8916623 [details] [diff] [review]
clean up SCTP shutdown

Sec-high, beta57+
Attachment #8916623 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Since this landed and uplifted I may see fewer crashes, though depending on when the beta build with this fix was done I don't believe I'd see beta crashes with this fix yet (I don't think it made b8).  There is one post-landing 58 crash in zzz_AsmCodeRange_End | sowakeup:
https://crash-stats.mozilla.com/report/index/608806ed-bc1f-4be6-82e8-4e3980171015

However, I'm still seeing a few (3) crashes in 58 in RtlWakeAllConditionVariable | sowakeup

It's hard to tell since there are few Nightly crashes in general, but a few post-landing 58 crashes implies that this (while probably a good idea) has not solved the base problem.  We'll see soon in 57b9.
mitigation attempt - avoid UAF by simply delaying connection free until any possible timer has fired.  To land on inbound, and uplift (potentially to release) if the UAFs disappear, while we try to solve the actual problem.  Alternatively we could delay the deletion, but MOZ_RELEASE_ASSERT on a packet send with mTransportFlow of null.  (with this patch, mTransportFlow will be null while in the delay, and this will cause a null-deref crash - and thus we'll know the fix is working to get rid of UAFs, and when we find/land a real fix we can know it fixed the bug before removing the delay)
Attachment #8921456 - Flags: review?(drno)
Comment on attachment 8921456 [details] [diff] [review]
Ensure we wait before shutting down DataChannelConnection

Review of attachment 8921456 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM with the one exception for my one question regarding where to null the transport.

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +285,5 @@
>    ASSERT_WEBRTC(mState == CLOSED);
>    MOZ_ASSERT(!mMasterSocket);
>    MOZ_ASSERT(mPending.GetSize() == 0);
> +  MOZ_ASSERT(!mTransportFlow);
> +  

Remove space.

@@ +365,5 @@
> +}
> +
> +void DataChannelConnection::DestroyOnSTSFinal()
> +{
> +  mTransportFlow = nullptr;

Isn't the transport flow still being used by the queued runnable which tries to send a packet?
I would have expected that you null this after the delay in the DataChannelConnectionShutdown().
Attachment #8921456 - Flags: review?(drno) → review+
> > +void DataChannelConnection::DestroyOnSTSFinal()
> > +{
> > +  mTransportFlow = nullptr;
> 
> Isn't the transport flow still being used by the queued runnable which tries
> to send a packet?
> I would have expected that you null this after the delay in the
> DataChannelConnectionShutdown().

Yes.... So that would actually send the (late) packet.  Nulling it here means that any attempt to SendPacket after we supposedly shut down SCTP would (safely) crash, either null-deref or an assertion (in debug).  Which as my comment 35 says will ensure that we know it's blocking UAFs, and when we actually fix the bug, the nullptr crashes going away will tell us it worked.

If the UAFs go away *and* we get no crashes, then we know it must have been some form of "SendPacket runnable already in the queue and the cleanup jumped the queue via RUN_ON_THREAD" bug.

If we care more about stability, I'd do as you say, but the absolute crash volume here is generally small, and they're already crashing (and have been).
Comment on attachment 8921456 [details] [diff] [review]
Ensure we wait before shutting down DataChannelConnection

[Security approval request comment]
How easily could an exploit be constructed based on the patch?  Not easily

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  Perhaps a bit - a hold-a-ref-for-30-seconds timer on a sec bug is a bit of a giveaway not only that there's a refcount bug there, but that it's not really fixed.

My plan is if this works to request uplift to 57, or as a ride-along on a 57 point release.  It may be hard to know if it works until beta given crash volumes.

Which older supported branches are affected by this flaw? all

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? trivial

How likely is this patch to cause regressions; how much testing does it need? unlikely to cause regressions.  Worst would be some type of shutdown issue (DataChannelConnection closed "late" in shutdown).  I've tested locally with no problems.
Attachment #8921456 - Flags: sec-approval?
sec-approval+ for the new patch on trunk.

Given the new patch, shouldn't we be marking 57 and 58 as "affected" again?
Attachment #8921456 - Flags: sec-approval? → sec-approval+
yeah, they shouldn't have been marked fixed ever.
Oh, not just non-stylo, that just happens to be the only debug mochitest-chrome we run every push.
so this shutdown-leaks (which I was a bit concerned about) if the timer doesn't fire soon enough, and it's a long timer.  Wouldn't hurt real-life use, but breaks tests.  This moves the pattern to be a static list of connections in cleanup, so we can Clear() them in xpcom-will-shutdown.  Verified in timer-expires and quit-before-timer cases using GDB
Attachment #8922047 - Flags: review?(drno)
Comment on attachment 8922047 [details] [diff] [review]
more DataChannelConnection shutdown cleanup

Review of attachment 8922047 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM with the one little renaming for consistency.

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +156,5 @@
>      }
>      return NS_OK;
>    }
> +
> +  void AddConnection(DataChannelConnection* aConnection)

I find it confusing why this is called AddConnection(), but it's counter part RemoveConnectionShutdown(). I think it would make sense to name them in the same way. No strong preference which one is better.
Attachment #8922047 - Flags: review?(drno) → review+
> LGTM with the one little renaming for consistency.
> 

> > +  void AddConnection(DataChannelConnection* aConnection)
> 
> I find it confusing why this is called AddConnection(), but it's counter
> part RemoveConnectionShutdown(). I think it would make sense to name them in
> the same way. No strong preference which one is better.

Well, that's because one takes a DataChannelConnection, and the other takes and removes a DataChannelConnectionShutdown object (which was created by AddConnection())

Perhaps CreateConnectionShutdown() would resolve this without making it less clear.
fix nsTArray_base leaks
Attachment #8922369 - Flags: review?(nfroyd)
Comment on attachment 8922369 [details] [diff] [review]
fix leaked nsTArray

Review of attachment 8922369 [details] [diff] [review]:
-----------------------------------------------------------------

This works for fixing the leak and eliminating a static constructor to boot, thanks.  I assume we're not concerned about RemoveConnectionShutdown being called after we've cleared sConnections?
Attachment #8922369 - Flags: review?(nfroyd) → review+
Target Milestone: mozilla58 → ---
Crash Signature: [@ WinSqmSetIfMaxDWORD] [@ WinSqmSetIfMinDWORD] [@ zzz_AsmCodeRange_End | sowakeup] [@ RtlWakeAllConditionVariable | sowakeup] [@ RtlpResetDriveEnvironment | sowakeup] [@ RtlUlonglongByteSwap | sowakeup] → [@ WinSqmSetIfMaxDWORD] [@ WinSqmSetIfMinDWORD] [@ zzz_AsmCodeRange_End | sowakeup] [@ RtlWakeAllConditionVariable | sowakeup] [@ RtlpResetDriveEnvironment | sowakeup] [@ RtlUlonglongByteSwap | sowakeup] [@ sowakeup ]
Comment on attachment 8922047 [details] [diff] [review]
more DataChannelConnection shutdown cleanup

Approval Request Comment
[Feature/Bug causing the regression]: existing UAF

[User impact if declined]: UAF crashes continue

[Is this code covered by automated tests?]: yes, but they don't catch this bug

[Has the fix been verified in Nightly?]: It's been on nightly for 3 days with no new crashes

[Needs manual test from QE? If yes, steps to reproduce]:  no

[List of other uplifts needed for the feature/fix]: just the other leak-fix patch in this bug

[Is the change risky?]: no

[Why is the change risky/not risky?]:  
This has been on Nightly over the weekend, with no new crashes that I can find.  I see no crashes on nightly since it landed, but the crash rate on Nightly isn't high enough to know anything for weeks.  Give the overall rate of crashes with UAFs, and that this should mitigate all or at least most of them (by delaying releasing the object), I think this is worthwhile uplifting, and avoid a last-second uplift or trying to have it ride-along in a point release.  We can wait a few more days before uplifting to watch Nightly (both for new crashes, and for any sign that this is working/not-working - former UAFs should still crash as safe nullptr crashes (or assertions in debug builds)

[String changes made/needed]: none
Attachment #8922047 - Flags: approval-mozilla-beta?
Attachment #8922369 - Flags: approval-mozilla-beta?
Hi Dan, do these new patches require sec-approval+? Thanks!
Flags: needinfo?(dveditz)
Flags: needinfo?(dveditz)
Attachment #8922047 - Flags: sec-approval+
Attachment #8922369 - Flags: sec-approval+
So far on Nightly since we landed (on 10/27 in m-c) we have 1 crash with an 0xffffffffff address: 
https://crash-stats.mozilla.com/report/index/1b51928f-5365-443e-8991-23c6d0171031

As such, it's unclear if this is a UAF that windows crashreporter couldn't get an address for, or if this mitigation is working.  However, about 50% of crashes with this signature have been showing up as 0xfffffffff, and the patch should convert any UAF crashes into either null-derefs or assertions (on debug builds).  it's possible that the assumption about how a UAF occurs here is simply totally wrong.  The crash volume is so low on nightly it's hard to tell.
Still only the one 58-with-patch crash, and as per before, it's not clear if that's a nullptr or a UAF ptr.
There are no new 58a1 crashes with sctp anywhere on the stack since this landed, so there's no apparent regressions from the landing. (Though nightly doesn't get hit as much with anything as beta)

Let's go ahead with uplift to 57.  At worst it won't fix the problem (and it may not... but even that's info).
Flags: needinfo?(rkothari)
Comment on attachment 8922047 [details] [diff] [review]
more DataChannelConnection shutdown cleanup

Thanks for the due diligence Randell. This has baked on Nightly, sec-high, beta57+
Flags: needinfo?(rkothari)
Attachment #8922047 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8922369 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/b729ddecf1d1
https://hg.mozilla.org/releases/mozilla-beta/rev/ba08d4e02e0c

Are we ready to call this fixed yet? Given where we are in the cycle, it might not be a bad idea to move any follow-up work to a new bug at this point.
Flags: needinfo?(rjesup)
Flags: needinfo?(rjesup)
trivial fix to make this apply to beta
Attachment #8924447 - Flags: review?(jib)
Attachment #8924447 - Flags: review?(padenot)
Attachment #8924447 - Flags: review?(padenot)
Attachment #8924447 - Flags: review?(jib)
Attachment #8924447 - Flags: review+
There are about 6 58a1 crashes so far.  All have 0xffffffff as the address, so no negative or positive confirmation yet, though 6 0xffffffff's is starting to look good, given that some of the signatures hit have a 50% UAF ptr rate.
It appears there's no joy - https://crash-stats.mozilla.com/report/index/3c5c8bbd-ed2f-45b4-8518-3516e0171104 shows we still UAF even with a 30-second timeout on Connection destruction.  Also https://crash-stats.mozilla.com/report/index/7e10d4b3-0472-484c-89ca-762650171104 and https://crash-stats.mozilla.com/report/index/700f4769-7d6a-49df-acdd-3ee3b0171104 (finally hit a non-0xfffffffff crash on that signature).

We could back out the patch, since it doesn't seem to have helped.  OTOH is also doesn't seem to have hurt, and I don't know if the churn is worth it
Ritu, fyi since we may have an RC3 build.
Flags: needinfo?(rkothari)
I chatted with Dan and the crash volume seems to have down a bit so it may have helped some. I don't believe this is worth the churn. Let's re-review post 57 launch if this is still a concern.
Flags: needinfo?(rkothari)
(In reply to Randell Jesup [:jesup] from comment #63)
> We could back out the patch, since it doesn't seem to have helped.

It may have helped a tiny bit. Searching on sowakeup in the proto signature each of the top two signatures were getting about 50 crashes a week on 57beta, and in the last week got only 40 each. Some of the other signatures didn't budge much. Maybe it's just noise? More time will tell.

> OTOH italso doesn't seem to have hurt, and I don't know if the churn is worth it

Churn is never worth it in Release Candidates. The criteria is whether it's point-release-bad hurting us.
Blocks: 1408485
Taking off tracking, but if we get a fix I'd consider it for any point release.
Relanded the SCTP library update, with a patch that I think fixes this bug from bug 1421963.  I plan to uplift that second patch (which would also apply without the SCTP library update from bug 1297418).  That issue would explain these crashes that all seem related to channel resets.
See Also: → 1421963
Depends on: 1421963
Duplicate of this bug: 1426070
From comment 68, looks like this was fixed in 59 by the patch in bug 1421963. 
jesup, is that accurate? Should this be marked fixed in 59, wontfix for 58? Or is there some remaining issue?  Is there a way to check the fixes for this bug and the duplicates?
Flags: needinfo?(rjesup)
No, that fix appears related but did not fix this signature -- this has a different pathway to cause a problem (timer-kicked-off)
Flags: needinfo?(rjesup)
Duplicate of this bug: 1406730
Wontfix for 59 after discussion with jesup on irc. We could still take a patch for 60 though.
Duplicate of this bug: 1426070
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #73)
> Wontfix for 59 after discussion with jesup on irc. We could still take a
> patch for 60 though.

That's becoming more and more unlikely. Any updates, Randell? :)
Flags: needinfo?(rjesup)
No.  WONTFIX for 60 I think.  Sigh.
Flags: needinfo?(rjesup)
Keywords: stalled
Updating tracking flags as we get closer to the 64 release.

Very probably caused Bug 1521304

Keywords: stalled

Does not appear fixed by bug 1521304, as we have no drop in crashes in beta 8 or 9.

This may well be a race condition within the SCTP library, where we're tearing down the stack from socketthread or mainthread (with our lock), and it's racing against an SCTP timer-thread abort.

The e5e5 crashes on accessing sb_cond imply it was deleted. sodealloc() deletes the sb_cond, and also destroys the lock (DeleteCriticalSection, which requires the lock not be owned), so we know sodealloc assumes all access is ended.

in sofree():
/*
* From this point on, we assume that no other references to this
* socket exist anywhere else in the stack. Therefore, no locks need
* to be acquired or held.
*
* We used to do a lot of socket buffer and socket locking here, as
* well as invoke sorflush() and perform wakeups. The direct call to
* dom_dispose() and sbrelease_internal() are an inlining of what was
* necessary from sorflush().
*
* Notice that the socket buffer and kqueue state are torn down
* before calling pru_detach. This means that protocols shold not
* assume they can perform socket wakeups, etc, in their detach code.
*/
sodealloc(so);

When a DataChannel closes, we call usrsctp_close()->soabort(so) (locks) ->sofree(so)->sctp_close(so) (not under lock anymore), which calls SCTP_OS_TIMER_STOP(hb_timer).

sofree() then calls sodealloc(), which frees the sb_cond and then frees so.

SCTP_OS_TIMER_STOP does:
SCTP_TIMERQ_LOCK();
/*
* Don't attempt to delete a callout that's not on the queue.
*/
if (!(c->c_flags & SCTP_CALLOUT_PENDING)) {
c->c_flags &= ~SCTP_CALLOUT_ACTIVE;
SCTP_TIMERQ_UNLOCK();
return (0);
}
c->c_flags &= ~(SCTP_CALLOUT_ACTIVE | SCTP_CALLOUT_PENDING);
if (c == sctp_os_timer_next) {
sctp_os_timer_next = TAILQ_NEXT(c, tqe);
}
TAILQ_REMOVE(&SCTP_BASE_INFO(callqueue), c, tqe);
SCTP_TIMERQ_UNLOCK();

sctp_handle_tick() does:
c->c_flags &= ~SCTP_CALLOUT_PENDING;
SCTP_TIMERQ_UNLOCK();
c_func(c_arg);
SCTP_TIMERQ_LOCK();

so if the timer is sitting in handle_tick() at c_func(), and on another thread we call SCTP_OS_TIMER_STOP(), the OS_TIMER_STOP can get the lock, and finds that the timer is no longer pending, and so unlocks returns (and per above, we can then call sodealloc() and free the cond var and the so. And then the timer thread wakes up, and calls c_func, which is sctp_timeout_handler() (etc...)

We cannot delete a socket (and free it's condvars) until we have positive confirmation that the timer has been removed and cannot be called - including being in the process of being called.

This could be a lock (though I'm sure there are reasons why the timer code unlocks before calling the callbacks; likely because that can manipulate the timer list). There are other ways you could synchronize, but they all amount to "if it's currently processing the timer you're removing, you have to wait".

It's possible I have some details wrong but this seems like a clear hole to me, and almost certainly the cause here.

NI various, including dminor and drno as FYI, since this is a top-UAF bug.

Flags: needinfo?(tuexen)
Flags: needinfo?(lennart.grahl)
Flags: needinfo?(drno)
Flags: needinfo?(dminor)

Nico has been looking at DataChannels lately.

Flags: needinfo?(dminor) → needinfo?(na-g)

Would love to help but I don't fully comprehend the threading model in Firefox.

Flags: needinfo?(lennart.grahl)

This problem shouldn't depend on the threading model; any case where these occur on different threads and aren't covered by locks can cause this problem. The timer code runs on a different thread. Hitting this bug requires unfortunate timing, as is the norm with this sort of bug.

reiterating:
"We cannot delete a socket (and free it's condvars) until we have positive confirmation that the timer has been removed and cannot be called - including being in the process of being called.

This could be a lock (though I'm sure there are reasons why the timer code unlocks before calling the callbacks; likely because that can manipulate the timer list). There are other ways you could synchronize, but they all amount to "if it's currently processing the timer you're removing, you have to wait"."

Flags: needinfo?(peter.lei)

To give a status update: a patch for the problem (which we should be able to cheery pick) is currently actively being worked on by our SCTP friends.

Flags: needinfo?(drno)

... to be more precise: Peter Lei (he is on this bug report) is working on it and testing to make sure it doesn't break anything.

Once that is finished, he will provide the link to the patches required to integrate.

Flags: needinfo?(tuexen)

The latest sctp library code should hopefully fix this- if you're not taking the full library update, you'll want to cherry pick the following commits (in "git log" order, so apply bottom up):

   21943f4 Fix callout stop deadlock
   faf6330 Fix for missing timer wait wakeups
   d225e88 Fix for rescheduling a timer that is currently in progress.
   164ca60 Pull in missing guard for the currently executing callout (v2)
   93fe1b3 Fix windows build break
   a7c2902 Pull in missing guard for the currently executing callout
Flags: needinfo?(peter.lei)

Peter - thanks for diving in on this!

One typo:
+#define SCTP_TIMERWIT_UNLOCK() KASSERT(pthread_mutex_unlock(&sctp_os_timerwait_mtx) == 0, ("%s: sctp_os_timerwait_mtx not locked", func))

I think "TIMERWAIT"

Flags: needinfo?(peter.lei)
Attached patch Apply under cherry pick queue (obsolete) — Splinter Review

This patch can be applied under the cherry pick patch set to get it to cleanly apply (but not compile).

Attached patch Apply under cherry pick queue (obsolete) — Splinter Review

Apply under the cherry pick patch queue.

Attachment #9054599 - Attachment is obsolete: true

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #86)

Peter - thanks for diving in on this!

One typo:
+#define SCTP_TIMERWIT_UNLOCK() KASSERT(pthread_mutex_unlock(&sctp_os_timerwait_mtx) == 0, ("%s: sctp_os_timerwait_mtx not locked", func))

I think "TIMERWAIT"

Ah, yes thanks! I don't think we normally build with INVARIANTS (hence missed that) but will fix that.

Flags: needinfo?(peter.lei)

Peter, is the only additional patch that needs to be pulled in dfe89db Fix compile breakage under INVARIANTS?

Flags: needinfo?(peter.lei)

Nico, I think that is all you need.

We are currently improving our continuous integration tests for usrsctp to catch such kind of compile errors and adding other tests. That is why the status of the tests might be reported as failed on our website.

(In reply to Nico Grunbaum [:ng] from comment #90)

Peter, is the only additional patch that needs to be pulled in dfe89db Fix compile breakage under INVARIANTS?

Not sure if you absolutely require it; if INVARIANTS is defined, you'd need it as the build would be currently breaking. But yes, this is the only other commit that I know of w.r.t. the thread race fixes beyond the original set I posted.

Flags: needinfo?(peter.lei)

One of the patches in this set mentions fixing a timer deadlock, we could perhaps use that as a cover patch.

Attachment #9054624 - Attachment is obsolete: true

timer deadlock fix

Comment on attachment 9059326 [details]
Bug 1400563 - timer deadlock fix

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: It is probably difficult to create an exploit with knowledge of the bug. There are a number of comments in the patch that indicate that there was a timer shutdown race. Removing those comments from the patch wouldn't have much effect as the patch comes from the upstream 3rd party library, and one can see them on GitHub.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: It should be easy as the effected code is altered vary rarely.
  • How likely is this patch to cause regressions; how much testing does it need?: It is possible that it may cause regressions, it needs time to bake on Nightly.
Flags: needinfo?(na-g)
Attachment #9059326 - Flags: sec-approval?
Attachment #9058838 - Flags: sec-approval?
Attachment #9058838 - Flags: sec-approval?

Comment on attachment 9059326 [details]
Bug 1400563 - timer deadlock fix

sec-approval+ for trunk so it can get some bake time.

Attachment #9059326 - Flags: sec-approval? → sec-approval+
Assignee: rjesup → na-g

No crashes in 68 since this landed:
https://crash-stats.mozilla.com/search/?signature=%3DWinSqmSetIfMaxDWORD&signature=%3DWinSqmSetIfMinDWORD&signature=%3Dzzz_AsmCodeRange_End%20%7C%20sowakeup&signature=%3DRtlWakeAllConditionVariable%20%7C%20sowakeup&signature=%3DRtlpResetDriveEnvironment%20%7C%20sowakeup&signature=%3DRtlUlonglongByteSwap%20%7C%20sowakeup&signature=%3Dsowakeup&build_id=%3E20190427000000&version=68.0a1&date=%3E%3D2019-04-13T20%3A53%3A00.000Z&date=%3C2019-05-13T20%3A53%3A00.000Z&_facets=signature&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

No new SCTP crashes since it landed:
https://crash-stats.mozilla.com/search/?proto_signature=~sctp&build_id=%3E20190427000000&product=Firefox&version=68.0a1&date=%3E%3D2019-04-13T20%3A55%3A00.000Z&date=%3C2019-05-13T20%3A55%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Note, however, that we see few of these in Nightly; many more in beta:

67bN and 68a1 for the last month:
https://crash-stats.mozilla.com/search/?proto_signature=~sctp&product=Firefox&version=68.0a1&version=67.0b&date=%3E%3D2019-04-13T20%3A55%3A00.000Z&date=%3C2019-05-13T20%3A55%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

But if you restrict to 68, there's only 1 crash other than OOM.

With not a single crash in any of the 68 beta builds I'm tempted to close this. But we probably want to wait until 68 becomes release before we know for sure.

Not a single reported crash since 68 (which contains the patches) went into release. Looks like we are getting pretty close to calling this fixed!

Target Milestone: --- → mozilla68

I think we can now call this fixed, checking all the signatures above, no crashes in 68 or beyond for the last month.

Status: REOPENED → RESOLVED
Closed: 4 years ago2 years ago
Resolution: --- → FIXED

Al did this ever get a security advisory? I just looked in https://www.mozilla.org/en-US/security/advisories/mfsa2019-21/ and didn't see it there.

Flags: needinfo?(abillings)

(In reply to Liz Henry (:lizzard) from comment #104)

Al did this ever get a security advisory? I just looked in https://www.mozilla.org/en-US/security/advisories/mfsa2019-21/ and didn't see it there.

It did not because the fields were all changed after we shipped on July 9. I've added it to the roll up, where it should go.

Flags: needinfo?(abillings)
Whiteboard: [adv-main68+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.