Crash in WinSqmSetIfMaxDWORD called from SCTP sowakeup() on association abort
Categories
(Core :: WebRTC: Networking, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | wontfix |
firefox-esr68 | --- | fixed |
firefox55 | --- | wontfix |
firefox56 | --- | wontfix |
firefox57 | --- | wontfix |
firefox58 | --- | wontfix |
firefox59 | + | wontfix |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | wontfix |
firefox63 | --- | wontfix |
firefox64 | --- | wontfix |
firefox65 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox68 | --- | fixed |
firefox69 | --- | fixed |
People
(Reporter: jesup, Assigned: ng)
References
Details
(5 keywords, Whiteboard: [adv-main68+])
Crash Data
Attachments
(7 files, 2 obsolete files)
2.80 KB,
patch
|
drno
:
review+
ritu
:
approval-mozilla-beta+
dveditz
:
sec-approval+
|
Details | Diff | Splinter Review |
6.26 KB,
patch
|
drno
:
review+
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
8.73 KB,
patch
|
drno
:
review+
ritu
:
approval-mozilla-beta+
dveditz
:
sec-approval+
|
Details | Diff | Splinter Review |
2.33 KB,
patch
|
froydnj
:
review+
ritu
:
approval-mozilla-beta+
dveditz
:
sec-approval+
|
Details | Diff | Splinter Review |
1.14 KB,
patch
|
padenot
:
review+
|
Details | Diff | Splinter Review |
20.23 KB,
patch
|
Details | Diff | Splinter Review | |
47 bytes,
text/x-phabricator-request
|
abillings
:
sec-approval+
|
Details | Review |
This bug was filed from the Socorro interface and is report bp-f42d248e-262a-4f9a-ac7e-c546c0170916. ============================================================= UAF (use after free) crash in sowakeup on association abort. Crashes start in Firefox 54 (there are others with this signature from earlier, but they have very different stacks). It looks like either sb_cond was 0xe5e5... or the data pointed to by sb_cond was freed (the condition var), which seems more likely.
Reporter | ||
Comment 3•4 years ago
|
||
URLs are all over the map, but appear to point to usage for ad serving to bypass ad blocking, and I suspect this is happening when the user navigates away from the page with the ad. One domain I see in the list is serve.popads.net, and the overall pattern is of sites that would not use datachannels for any reason themselves. This isn't IP address harvesting, since the association has been created.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 4•4 years ago
|
||
[@ ?? ?? ::FNODOBFM::`string''] and [@ ?? ::FNODOBFM::`string'' ] are also hit from sowakeup, but that's mixed with other bugs. Those are all reported as 0xfffffffff errors, so it might or might not be a UAF (likely is) There's a tail of other crashes with sowakeup which are all likely the same issue.
Reporter | ||
Comment 5•4 years ago
|
||
[Tracking Requested - why for this release]: sec-high rising in frequency
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 7•4 years ago
|
||
If we can get a clean fix for this, I'd consider it for a ride-along on 56 point releases
Updated•4 years ago
|
Comment 8•4 years ago
|
||
After checking all of the crash signatures Randell linked to this bug for the last 6 months, all of these crashes appear to be Windows only. So might be a problem in the platform specific code.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 9•4 years ago
|
||
This seems to appear in Fx54 from the crashes. Looking at the diffs/changesets for 53->54 for netwerk/sctp, there are really only 3: 1) collapsing else blocks where the if-portion returns/etc 2) %d/u -> %" PRIUINT " (and the like) 3) change to constent ordering of SCTP shutdown (when deregister occurs was dependent on timing before): bug 1311380 This would point moderately strongly at bug 1311380. Speaking strongly against this analysis is that this patch was uplifted to 52 and 53 (and 45esr). It would be very odd for this to provoke 0 crashes in 53 and 52 if the patch is there. That would speak to it being a timing hole triggered by changes elsewhere.
Updated•4 years ago
|
OK, I can track this for 56, for now, and try to help move it along (I can't track every sec-high bug though). Maire, can you help find someone to investigate further?
Comment 11•4 years ago
|
||
OK, it seems to be a problem only on Windows (as indicated by Nils) and the traces show that it happens during the ungraceful termination of the association. Can I assume that this is not during the termination of Firefox? If that is the case I can try to test this with a small test program on Windows...
Reporter | ||
Comment 12•4 years ago
|
||
(In reply to Michael Tüxen from comment #11) > OK, it seems to be a problem only on Windows (as indicated by Nils) and the > traces show > that it happens during the ungraceful termination of the association. The function it's crashing in is a Windows function, so anything that shows up here would be Windows. It's possible we might get crashes on other platforms under different signatures, though I haven't seen them. Search crash-stats for a "proto signature" that "contains" sowakeup for example > > Can I assume that this is not during the termination of Firefox? We are not in shutdown.
Reporter | ||
Comment 13•4 years ago
|
||
This is a search for all non-Windows crashes (in the last week) with "sctp" in the proto signature (aka the crashing thread's stack). Note that this will get some false positives.
Reporter | ||
Comment 14•4 years ago
|
||
https://crash-stats.mozilla.com/search/?proto_signature=~sctp&platform=%21Windows&product=Firefox&date=%3E%3D2017-06-29T19%3A50%3A31.000Z&date=%3C2017-09-29T19%3A50%3A31.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature Sorry, that's 3 months, not 1 week. easily changed
Comment 15•4 years ago
|
||
OK, lets focus on this Windows specific behaviour when the association is terminated by excessive retransmissions. Let me try to test this outside of Firefox with a test app. Will look how firefox handles this and try to imitate this.
Reporter | ||
Comment 16•4 years ago
|
||
It may be a race between abort-due-to-retransmissions and the PeerConnection being torn down. I.e. after deregister_adddress or after close_socket (or while those are happening, on another thread).
Reporter | ||
Comment 17•4 years ago
|
||
attempt to stop the UAFs - move deregister to the same thread as close(). Is similar to how the webrtc.org code does it now.
Reporter | ||
Updated•4 years ago
|
Comment 18•4 years ago
|
||
Comment on attachment 8916623 [details] [diff] [review] clean up SCTP shutdown Review of attachment 8916623 [details] [diff] [review]: ----------------------------------------------------------------- LGTM
Reporter | ||
Comment 19•4 years ago
|
||
Comment on attachment 8916623 [details] [diff] [review] clean up SCTP shutdown [Security approval request comment] How easily could an exploit be constructed based on the patch? Very hard based on the patch. On top of that, this is an unproven attempt to bypass the bug - we haven't been able to determine the actual issue for sure. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? Not particularly, beyond that it would land on a sec bug presumably, and moves code calls from one thread to another. Which older supported branches are affected by this flaw? at least back to 53, though maybe worse in 55+. Or it's getting used more. If not all supported branches, which bug introduced the flaw? Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Trivial backport. Should apply cleanly How likely is this patch to cause regressions; how much testing does it need? Very unlikely. Plan to land on 58, then request 57 approval if it seems stable.
Comment 20•4 years ago
|
||
Just as a note: * We tried to reproduce the issue with a test app using the SCTP code being used in FF. We assumed that it is related to the association being terminated due to excessive retransmissions. We tested on Windows * We tested FF on Windows under the situation that an existing association is terminated due to excessive retransmissions. In both cases we could NOT reproduce the issue.
Reporter | ||
Comment 21•4 years ago
|
||
I suspect it's related to timing of shutdowns/deallocation, thread run order, and association termination. So hard to repro in an isolated test, even if you duplicate the general threading model here. That said: running it under rr in chaos mode (varying thread timing) might catch it: rr record -h ....
Updated•4 years ago
|
Comment 22•4 years ago
|
||
Comment on attachment 8916623 [details] [diff] [review] clean up SCTP shutdown sec-approval=dveditz
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 23•4 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/955cbee8d3d74868e545e35180776774f24d2834
Reporter | ||
Comment 24•4 years ago
|
||
If we see a drop-off in UAFs after this ships to m-c, I'll probably ask for uplift to 57
Comment 25•4 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/955cbee8d3d7 Please request Beta approval when you're comfortable doing so.
Updated•3 years ago
|
Comment 26•3 years ago
|
||
Randel: did you wanted to leave this open, since we don't know yet if it fixes the issue? At least we should not open up this bug until we know if it's fixed, or?
Reporter | ||
Comment 27•3 years ago
|
||
leaving open for now
Reporter | ||
Comment 28•3 years ago
|
||
Note: since we rarely get any hits on this in Nightly, I'm going to ask for beta approval shortly assuming no regressions have appeared (and I expect none).
Reporter | ||
Comment 29•3 years ago
|
||
There are a total of ~20 crashes ever on 58, all before this patch landed. There are only 2 other crashes involving sctp, and those are both a non-shutdown null-deref (not related to this).
Reporter | ||
Comment 30•3 years ago
|
||
Comment on attachment 8916623 [details] [diff] [review] clean up SCTP shutdown Approval Request Comment [Feature/Bug causing the regression]: UNsure; dates back to 53 but has been spiking in frequency (perhaps because of use by ad networks to bypass ublock/etc). [User impact if declined]: dangerous sec bug may not be fixed -- this is not guaranteed to fix it; we hope it will. [Is this code covered by automated tests?]: Yes (though they don't hit this bug) [Has the fix been verified in Nightly?]: No (but no immediately-apparent negative impacts) [Needs manual test from QE? If yes, steps to reproduce]: No [List of other uplifts needed for the feature/fix]: None [Is the change risky?]: No - moves all the sctp library shutdown code to happen on a single thread instead of split across two. [Why is the change risky/not risky?]: See above. [String changes made/needed]: none
Reporter | ||
Comment 31•3 years ago
|
||
Given the rate of crashes in beta, we should know fairly quickly once in beta if this helps/solves the problem
Comment on attachment 8916623 [details] [diff] [review] clean up SCTP shutdown Sec-high, beta57+
Comment 33•3 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/e38d82884ae8
Reporter | ||
Comment 34•3 years ago
|
||
Since this landed and uplifted I may see fewer crashes, though depending on when the beta build with this fix was done I don't believe I'd see beta crashes with this fix yet (I don't think it made b8). There is one post-landing 58 crash in zzz_AsmCodeRange_End | sowakeup: https://crash-stats.mozilla.com/report/index/608806ed-bc1f-4be6-82e8-4e3980171015 However, I'm still seeing a few (3) crashes in 58 in RtlWakeAllConditionVariable | sowakeup It's hard to tell since there are few Nightly crashes in general, but a few post-landing 58 crashes implies that this (while probably a good idea) has not solved the base problem. We'll see soon in 57b9.
Reporter | ||
Comment 35•3 years ago
|
||
mitigation attempt - avoid UAF by simply delaying connection free until any possible timer has fired. To land on inbound, and uplift (potentially to release) if the UAFs disappear, while we try to solve the actual problem. Alternatively we could delay the deletion, but MOZ_RELEASE_ASSERT on a packet send with mTransportFlow of null. (with this patch, mTransportFlow will be null while in the delay, and this will cause a null-deref crash - and thus we'll know the fix is working to get rid of UAFs, and when we find/land a real fix we can know it fixed the bug before removing the delay)
Comment 36•3 years ago
|
||
Comment on attachment 8921456 [details] [diff] [review] Ensure we wait before shutting down DataChannelConnection Review of attachment 8921456 [details] [diff] [review]: ----------------------------------------------------------------- LGTM with the one exception for my one question regarding where to null the transport. ::: netwerk/sctp/datachannel/DataChannel.cpp @@ +285,5 @@ > ASSERT_WEBRTC(mState == CLOSED); > MOZ_ASSERT(!mMasterSocket); > MOZ_ASSERT(mPending.GetSize() == 0); > + MOZ_ASSERT(!mTransportFlow); > + Remove space. @@ +365,5 @@ > +} > + > +void DataChannelConnection::DestroyOnSTSFinal() > +{ > + mTransportFlow = nullptr; Isn't the transport flow still being used by the queued runnable which tries to send a packet? I would have expected that you null this after the delay in the DataChannelConnectionShutdown().
Reporter | ||
Comment 37•3 years ago
|
||
> > +void DataChannelConnection::DestroyOnSTSFinal() > > +{ > > + mTransportFlow = nullptr; > > Isn't the transport flow still being used by the queued runnable which tries > to send a packet? > I would have expected that you null this after the delay in the > DataChannelConnectionShutdown(). Yes.... So that would actually send the (late) packet. Nulling it here means that any attempt to SendPacket after we supposedly shut down SCTP would (safely) crash, either null-deref or an assertion (in debug). Which as my comment 35 says will ensure that we know it's blocking UAFs, and when we actually fix the bug, the nullptr crashes going away will tell us it worked. If the UAFs go away *and* we get no crashes, then we know it must have been some form of "SendPacket runnable already in the queue and the cleanup jumped the queue via RUN_ON_THREAD" bug. If we care more about stability, I'd do as you say, but the absolute crash volume here is generally small, and they're already crashing (and have been).
Reporter | ||
Comment 38•3 years ago
|
||
Comment on attachment 8921456 [details] [diff] [review] Ensure we wait before shutting down DataChannelConnection [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? Perhaps a bit - a hold-a-ref-for-30-seconds timer on a sec bug is a bit of a giveaway not only that there's a refcount bug there, but that it's not really fixed. My plan is if this works to request uplift to 57, or as a ride-along on a 57 point release. It may be hard to know if it works until beta given crash volumes. Which older supported branches are affected by this flaw? all If not all supported branches, which bug introduced the flaw? Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? trivial How likely is this patch to cause regressions; how much testing does it need? unlikely to cause regressions. Worst would be some type of shutdown issue (DataChannelConnection closed "late" in shutdown). I've tested locally with no problems.
Comment 39•3 years ago
|
||
sec-approval+ for the new patch on trunk. Given the new patch, shouldn't we be marking 57 and 58 as "affected" again?
Updated•3 years ago
|
Reporter | ||
Comment 40•3 years ago
|
||
yeah, they shouldn't have been marked fixed ever.
Reporter | ||
Comment 41•3 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/a3d2956de12afdf42ebab1cde8b7a35b1e3cc068
Reporter | ||
Comment 42•3 years ago
|
||
Fix missing 'override' on Notify(): https://hg.mozilla.org/integration/mozilla-inbound/rev/c03ca22ae138e0392e158d35a496ebb2ccd5cde7
Comment 43•3 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/7c385b35c6f4 for a non-stylo chrome leak, https://treeherder.mozilla.org/logviewer.html#?job_id=139502449&repo=mozilla-inbound
Comment 44•3 years ago
|
||
Oh, not just non-stylo, that just happens to be the only debug mochitest-chrome we run every push.
Reporter | ||
Comment 45•3 years ago
|
||
so this shutdown-leaks (which I was a bit concerned about) if the timer doesn't fire soon enough, and it's a long timer. Wouldn't hurt real-life use, but breaks tests. This moves the pattern to be a static list of connections in cleanup, so we can Clear() them in xpcom-will-shutdown. Verified in timer-expires and quit-before-timer cases using GDB
Comment 46•3 years ago
|
||
Comment on attachment 8922047 [details] [diff] [review] more DataChannelConnection shutdown cleanup Review of attachment 8922047 [details] [diff] [review]: ----------------------------------------------------------------- LGTM with the one little renaming for consistency. ::: netwerk/sctp/datachannel/DataChannel.cpp @@ +156,5 @@ > } > return NS_OK; > } > + > + void AddConnection(DataChannelConnection* aConnection) I find it confusing why this is called AddConnection(), but it's counter part RemoveConnectionShutdown(). I think it would make sense to name them in the same way. No strong preference which one is better.
Reporter | ||
Comment 47•3 years ago
|
||
> LGTM with the one little renaming for consistency. > > > + void AddConnection(DataChannelConnection* aConnection) > > I find it confusing why this is called AddConnection(), but it's counter > part RemoveConnectionShutdown(). I think it would make sense to name them in > the same way. No strong preference which one is better. Well, that's because one takes a DataChannelConnection, and the other takes and removes a DataChannelConnectionShutdown object (which was created by AddConnection()) Perhaps CreateConnectionShutdown() would resolve this without making it less clear.
Reporter | ||
Comment 48•3 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac78ac3218b1cfb82ef9aaf003ca7ea1484f5c77
Reporter | ||
Comment 49•3 years ago
|
||
fix nsTArray_base leaks
![]() |
||
Comment 50•3 years ago
|
||
Comment on attachment 8922369 [details] [diff] [review] fix leaked nsTArray Review of attachment 8922369 [details] [diff] [review]: ----------------------------------------------------------------- This works for fixing the leak and eliminating a static constructor to boot, thanks. I assume we're not concerned about RemoveConnectionShutdown being called after we've cleared sConnections?
Reporter | ||
Comment 51•3 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/29e0b65615c7945a62dd7e856f2413fa3dea987e with a few tweaks discussed in IRC
Comment 52•3 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/ac78ac3218b1 https://hg.mozilla.org/mozilla-central/rev/29e0b65615c7
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 53•3 years ago
|
||
Comment on attachment 8922047 [details] [diff] [review] more DataChannelConnection shutdown cleanup Approval Request Comment [Feature/Bug causing the regression]: existing UAF [User impact if declined]: UAF crashes continue [Is this code covered by automated tests?]: yes, but they don't catch this bug [Has the fix been verified in Nightly?]: It's been on nightly for 3 days with no new crashes [Needs manual test from QE? If yes, steps to reproduce]: no [List of other uplifts needed for the feature/fix]: just the other leak-fix patch in this bug [Is the change risky?]: no [Why is the change risky/not risky?]: This has been on Nightly over the weekend, with no new crashes that I can find. I see no crashes on nightly since it landed, but the crash rate on Nightly isn't high enough to know anything for weeks. Give the overall rate of crashes with UAFs, and that this should mitigate all or at least most of them (by delaying releasing the object), I think this is worthwhile uplifting, and avoid a last-second uplift or trying to have it ride-along in a point release. We can wait a few more days before uplifting to watch Nightly (both for new crashes, and for any sign that this is working/not-working - former UAFs should still crash as safe nullptr crashes (or assertions in debug builds) [String changes made/needed]: none
Reporter | ||
Updated•3 years ago
|
Hi Dan, do these new patches require sec-approval+? Thanks!
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 55•3 years ago
|
||
So far on Nightly since we landed (on 10/27 in m-c) we have 1 crash with an 0xffffffffff address: https://crash-stats.mozilla.com/report/index/1b51928f-5365-443e-8991-23c6d0171031 As such, it's unclear if this is a UAF that windows crashreporter couldn't get an address for, or if this mitigation is working. However, about 50% of crashes with this signature have been showing up as 0xfffffffff, and the patch should convert any UAF crashes into either null-derefs or assertions (on debug builds). it's possible that the assumption about how a UAF occurs here is simply totally wrong. The crash volume is so low on nightly it's hard to tell.
Reporter | ||
Comment 56•3 years ago
|
||
Still only the one 58-with-patch crash, and as per before, it's not clear if that's a nullptr or a UAF ptr. There are no new 58a1 crashes with sctp anywhere on the stack since this landed, so there's no apparent regressions from the landing. (Though nightly doesn't get hit as much with anything as beta) Let's go ahead with uplift to 57. At worst it won't fix the problem (and it may not... but even that's info).
Comment on attachment 8922047 [details] [diff] [review] more DataChannelConnection shutdown cleanup Thanks for the due diligence Randell. This has baked on Nightly, sec-high, beta57+
Comment 58•3 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/b729ddecf1d1 https://hg.mozilla.org/releases/mozilla-beta/rev/ba08d4e02e0c Are we ready to call this fixed yet? Given where we are in the cycle, it might not be a bad idea to move any follow-up work to a new bug at this point.
Comment 59•3 years ago
|
||
Backed out from Beta for bustage. Please post a rebased patch when you get a chance. https://hg.mozilla.org/releases/mozilla-beta/rev/14581cf06e7f https://treeherder.mozilla.org/logviewer.html#?job_id=141530546&repo=mozilla-beta
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 60•3 years ago
|
||
trivial fix to make this apply to beta
Reporter | ||
Updated•3 years ago
|
Updated•3 years ago
|
Comment 61•3 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/d1ee467027fd https://hg.mozilla.org/releases/mozilla-release/rev/589136f15b93
Reporter | ||
Comment 62•3 years ago
|
||
There are about 6 58a1 crashes so far. All have 0xffffffff as the address, so no negative or positive confirmation yet, though 6 0xffffffff's is starting to look good, given that some of the signatures hit have a 50% UAF ptr rate.
Reporter | ||
Comment 63•3 years ago
|
||
It appears there's no joy - https://crash-stats.mozilla.com/report/index/3c5c8bbd-ed2f-45b4-8518-3516e0171104 shows we still UAF even with a 30-second timeout on Connection destruction. Also https://crash-stats.mozilla.com/report/index/7e10d4b3-0472-484c-89ca-762650171104 and https://crash-stats.mozilla.com/report/index/700f4769-7d6a-49df-acdd-3ee3b0171104 (finally hit a non-0xfffffffff crash on that signature). We could back out the patch, since it doesn't seem to have helped. OTOH is also doesn't seem to have hurt, and I don't know if the churn is worth it
Ritu, fyi since we may have an RC3 build.
I chatted with Dan and the crash volume seems to have down a bit so it may have helped some. I don't believe this is worth the churn. Let's re-review post 57 launch if this is still a concern.
Comment 66•3 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #63) > We could back out the patch, since it doesn't seem to have helped. It may have helped a tiny bit. Searching on sowakeup in the proto signature each of the top two signatures were getting about 50 crashes a week on 57beta, and in the last week got only 40 each. Some of the other signatures didn't budge much. Maybe it's just noise? More time will tell. > OTOH italso doesn't seem to have hurt, and I don't know if the churn is worth it Churn is never worth it in Release Candidates. The criteria is whether it's point-release-bad hurting us.
Reporter | ||
Comment 67•3 years ago
|
||
Taking off tracking, but if we get a fix I'd consider it for any point release.
Reporter | ||
Comment 68•3 years ago
|
||
Relanded the SCTP library update, with a patch that I think fixes this bug from bug 1421963. I plan to uplift that second patch (which would also apply without the SCTP library update from bug 1297418). That issue would explain these crashes that all seem related to channel resets.
Updated•3 years ago
|
From comment 68, looks like this was fixed in 59 by the patch in bug 1421963. jesup, is that accurate? Should this be marked fixed in 59, wontfix for 58? Or is there some remaining issue? Is there a way to check the fixes for this bug and the duplicates?
Reporter | ||
Comment 71•3 years ago
|
||
No, that fix appears related but did not fix this signature -- this has a different pathway to cause a problem (timer-kicked-off)
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Wontfix for 59 after discussion with jesup on irc. We could still take a patch for 60 though.
Comment 75•3 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #73) > Wontfix for 59 after discussion with jesup on irc. We could still take a > patch for 60 though. That's becoming more and more unlikely. Any updates, Randell? :)
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 77•2 years ago
|
||
Updating tracking flags as we get closer to the 64 release.
Updated•2 years ago
|
Reporter | ||
Comment 78•2 years ago
|
||
Very probably caused Bug 1521304
Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Reporter | ||
Comment 79•2 years ago
|
||
Does not appear fixed by bug 1521304, as we have no drop in crashes in beta 8 or 9.
This may well be a race condition within the SCTP library, where we're tearing down the stack from socketthread or mainthread (with our lock), and it's racing against an SCTP timer-thread abort.
The e5e5 crashes on accessing sb_cond imply it was deleted. sodealloc() deletes the sb_cond, and also destroys the lock (DeleteCriticalSection, which requires the lock not be owned), so we know sodealloc assumes all access is ended.
in sofree():
/*
* From this point on, we assume that no other references to this
* socket exist anywhere else in the stack. Therefore, no locks need
* to be acquired or held.
*
* We used to do a lot of socket buffer and socket locking here, as
* well as invoke sorflush() and perform wakeups. The direct call to
* dom_dispose() and sbrelease_internal() are an inlining of what was
* necessary from sorflush().
*
* Notice that the socket buffer and kqueue state are torn down
* before calling pru_detach. This means that protocols shold not
* assume they can perform socket wakeups, etc, in their detach code.
*/
sodealloc(so);
When a DataChannel closes, we call usrsctp_close()->soabort(so) (locks) ->sofree(so)->sctp_close(so) (not under lock anymore), which calls SCTP_OS_TIMER_STOP(hb_timer).
sofree() then calls sodealloc(), which frees the sb_cond and then frees so.
SCTP_OS_TIMER_STOP does:
SCTP_TIMERQ_LOCK();
/*
* Don't attempt to delete a callout that's not on the queue.
*/
if (!(c->c_flags & SCTP_CALLOUT_PENDING)) {
c->c_flags &= ~SCTP_CALLOUT_ACTIVE;
SCTP_TIMERQ_UNLOCK();
return (0);
}
c->c_flags &= ~(SCTP_CALLOUT_ACTIVE | SCTP_CALLOUT_PENDING);
if (c == sctp_os_timer_next) {
sctp_os_timer_next = TAILQ_NEXT(c, tqe);
}
TAILQ_REMOVE(&SCTP_BASE_INFO(callqueue), c, tqe);
SCTP_TIMERQ_UNLOCK();
sctp_handle_tick() does:
c->c_flags &= ~SCTP_CALLOUT_PENDING;
SCTP_TIMERQ_UNLOCK();
c_func(c_arg);
SCTP_TIMERQ_LOCK();
so if the timer is sitting in handle_tick() at c_func(), and on another thread we call SCTP_OS_TIMER_STOP(), the OS_TIMER_STOP can get the lock, and finds that the timer is no longer pending, and so unlocks returns (and per above, we can then call sodealloc() and free the cond var and the so. And then the timer thread wakes up, and calls c_func, which is sctp_timeout_handler() (etc...)
We cannot delete a socket (and free it's condvars) until we have positive confirmation that the timer has been removed and cannot be called - including being in the process of being called.
This could be a lock (though I'm sure there are reasons why the timer code unlocks before calling the callbacks; likely because that can manipulate the timer list). There are other ways you could synchronize, but they all amount to "if it's currently processing the timer you're removing, you have to wait".
It's possible I have some details wrong but this seems like a clear hole to me, and almost certainly the cause here.
NI various, including dminor and drno as FYI, since this is a top-UAF bug.
Comment 80•2 years ago
|
||
Nico has been looking at DataChannels lately.
Comment 81•2 years ago
|
||
Would love to help but I don't fully comprehend the threading model in Firefox.
Reporter | ||
Comment 82•2 years ago
|
||
This problem shouldn't depend on the threading model; any case where these occur on different threads and aren't covered by locks can cause this problem. The timer code runs on a different thread. Hitting this bug requires unfortunate timing, as is the norm with this sort of bug.
reiterating:
"We cannot delete a socket (and free it's condvars) until we have positive confirmation that the timer has been removed and cannot be called - including being in the process of being called.
This could be a lock (though I'm sure there are reasons why the timer code unlocks before calling the callbacks; likely because that can manipulate the timer list). There are other ways you could synchronize, but they all amount to "if it's currently processing the timer you're removing, you have to wait"."
Updated•2 years ago
|
Comment 83•2 years ago
|
||
To give a status update: a patch for the problem (which we should be able to cheery pick) is currently actively being worked on by our SCTP friends.
Comment 84•2 years ago
|
||
... to be more precise: Peter Lei (he is on this bug report) is working on it and testing to make sure it doesn't break anything.
Once that is finished, he will provide the link to the patches required to integrate.
Comment 85•2 years ago
|
||
The latest sctp library code should hopefully fix this- if you're not taking the full library update, you'll want to cherry pick the following commits (in "git log" order, so apply bottom up):
21943f4 Fix callout stop deadlock
faf6330 Fix for missing timer wait wakeups
d225e88 Fix for rescheduling a timer that is currently in progress.
164ca60 Pull in missing guard for the currently executing callout (v2)
93fe1b3 Fix windows build break
a7c2902 Pull in missing guard for the currently executing callout
Reporter | ||
Comment 86•2 years ago
|
||
Peter - thanks for diving in on this!
One typo:
+#define SCTP_TIMERWIT_UNLOCK() KASSERT(pthread_mutex_unlock(&sctp_os_timerwait_mtx) == 0, ("%s: sctp_os_timerwait_mtx not locked", func))
I think "TIMERWAIT"
Assignee | ||
Comment 87•2 years ago
|
||
This patch can be applied under the cherry pick patch set to get it to cleanly apply (but not compile).
Assignee | ||
Comment 88•2 years ago
|
||
Apply under the cherry pick patch queue.
Comment 89•2 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #86)
Peter - thanks for diving in on this!
One typo:
+#define SCTP_TIMERWIT_UNLOCK() KASSERT(pthread_mutex_unlock(&sctp_os_timerwait_mtx) == 0, ("%s: sctp_os_timerwait_mtx not locked", func))I think "TIMERWAIT"
Ah, yes thanks! I don't think we normally build with INVARIANTS (hence missed that) but will fix that.
Assignee | ||
Comment 90•2 years ago
|
||
Peter, is the only additional patch that needs to be pulled in dfe89db Fix compile breakage under INVARIANTS
?
Comment 91•2 years ago
|
||
Nico, I think that is all you need.
We are currently improving our continuous integration tests for usrsctp to catch such kind of compile errors and adding other tests. That is why the status of the tests might be reported as failed on our website.
Comment 92•2 years ago
|
||
(In reply to Nico Grunbaum [:ng] from comment #90)
Peter, is the only additional patch that needs to be pulled in
dfe89db Fix compile breakage under INVARIANTS
?
Not sure if you absolutely require it; if INVARIANTS is defined, you'd need it as the build would be currently breaking. But yes, this is the only other commit that I know of w.r.t. the thread race fixes beyond the original set I posted.
Assignee | ||
Comment 93•2 years ago
|
||
Assignee | ||
Comment 94•2 years ago
|
||
One of the patches in this set mentions fixing a timer deadlock, we could perhaps use that as a cover patch.
Assignee | ||
Comment 95•2 years ago
|
||
Assignee | ||
Comment 96•2 years ago
|
||
timer deadlock fix
Assignee | ||
Comment 97•2 years ago
|
||
Comment on attachment 9059326 [details]
Bug 1400563 - timer deadlock fix
Security Approval Request
- How easily could an exploit be constructed based on the patch?: It is probably difficult to create an exploit with knowledge of the bug. There are a number of comments in the patch that indicate that there was a timer shutdown race. Removing those comments from the patch wouldn't have much effect as the patch comes from the upstream 3rd party library, and one can see them on GitHub.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
- Which older supported branches are affected by this flaw?: All
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: It should be easy as the effected code is altered vary rarely.
- How likely is this patch to cause regressions; how much testing does it need?: It is possible that it may cause regressions, it needs time to bake on Nightly.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 98•2 years ago
|
||
Comment on attachment 9059326 [details]
Bug 1400563 - timer deadlock fix
sec-approval+ for trunk so it can get some bake time.
Updated•2 years ago
|
Reporter | ||
Comment 100•2 years ago
|
||
Note, however, that we see few of these in Nightly; many more in beta:
But if you restrict to 68, there's only 1 crash other than OOM.
Comment 101•2 years ago
|
||
With not a single crash in any of the 68 beta builds I'm tempted to close this. But we probably want to wait until 68 becomes release before we know for sure.
Updated•2 years ago
|
Comment 102•2 years ago
|
||
Not a single reported crash since 68 (which contains the patches) went into release. Looks like we are getting pretty close to calling this fixed!
I think we can now call this fixed, checking all the signatures above, no crashes in 68 or beyond for the last month.
Al did this ever get a security advisory? I just looked in https://www.mozilla.org/en-US/security/advisories/mfsa2019-21/ and didn't see it there.
Comment 105•2 years ago
|
||
(In reply to Liz Henry (:lizzard) from comment #104)
Al did this ever get a security advisory? I just looked in https://www.mozilla.org/en-US/security/advisories/mfsa2019-21/ and didn't see it there.
It did not because the fields were all changed after we shipped on July 9. I've added it to the roll up, where it should go.
Updated•2 years ago
|
Updated•9 months ago
|
Description
•