1151046 - UAF in nr_ice_peer_ctx_destroy_cb

Reporter

Description

•

9 years ago

From crash-stats:

https://crash-stats.mozilla.com/signature/?signature=nr_ice_peer_ctx_destroy_cb&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1

Looking at the track, the ice_ctx seems to have been freed.

There is only one place that leads to a call to this function:

https://dxr.mozilla.org/mozilla-central/source/media/mtransport/nricectx.cpp?from=~NrIceCtx&case=true#524

Here we end up scheduling two timers, the first to destroy the ice_peer_ctx, and the second to free the ice_ctx. In this order, we're fine, but if somehow these timers execute in the opposite order we end up seeing a crash like the ones above. 

Given that I trust the timer stuff as far as I can throw it (which has not proven to be very far, since my trash can is a scant 2 feet from me), we should probably not trust these to happen in the order we schedule them in.

Byron Campen [:bwc]

Reporter

Comment 1

•

9 years ago

It is also possible that we're messing up our refcount on NrIceCtx somehow, resulting the the d'tor running twice. This would explain some other UAFs we're seeing where the "free" only happens during teardown of the whole NrIceCtx.

Eric Rescorla (:ekr)

Comment 2

•

9 years ago

FWIW, I have repeatedly assumed that things do happen in the right order.
Should we rewrite NR_async_timer_set() to enforce that? It wouldn't be
hard.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Reporter

Comment 3

•

9 years ago

Clearing ni per discussion with ekr.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Reporter

Comment 4

•

9 years ago

So, reading and thinking some more, since we're crashing because of poisoned memory down in nr_ice_peer_ctx_destroy_cb, if ~NrIceCtx() is running for the second time, I would expect us to crash due to poisoned memory here:

https://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#497

Also, given that NrIceCtx is virtual, I would expect the vptr to be poisoned as well, which means we probably wouldn't even be able to invoke the d'tor (I don't see how the compiler would know the static type, and be able to avoid following the vptr, anywhere other than in NrIceCtx::Create()).

Byron Campen [:bwc]

Reporter

Comment 5

•

9 years ago

Actually, there's a third possibility here: we're crashing here.

      STAILQ_REMOVE(&pctx->ctx->peers, pctx, nr_ice_peer_ctx_, entry);

It is possible that one of the peers in the tailq has been destroyed, causing its next ptr to have a crashy value.

Byron Campen [:bwc]

Reporter

Comment 6

•

9 years ago

More reading, it seems that the scenario in comment 5 isn't going to arise, since I can't see how we would create more than one ice_peer_ctx; if the only ice_peer_ctx is freed, we would crash earlier.

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

9 years ago

Regarding timer ordering:
* These are both scheduled on this thread with a 0 delta timeout
* They may both have the same timeout target, or the second may be larger than the first
* Timers are inserted using the TimerAdditionComparator, which returns LessThan() == true for any timer with a lower *or* equal timeout (which should ensure that two calls to schedule a timer with target == N will be inserted in the 'correct' order)
** Timers inserted with target N when the current time is N+M will be inserted *after* all timers with time <= N+M (i.e. inserted as if the target was N+M, not N).  This would not change the ordering of timers inserted with a target of N; it does mean that scheduling of timers with *different* timeouts can execute in an order different than was intended.  I.e. InsertTimer(N+X); InsertTimer(N) may execute with N+X first in the case where InsertTimer(N) happens after or at N+X.
* I see nothing in the IDL for nsITimer that makes in-order firing of timers started with the same target tick a property of the API contract, though it's likely that other bits of code rely on this property as well (and it's a reasonable property to provide).  The change here would be to add a comment to the IDL specifying this behavior.
** I don't know why the "insert-after-expired-timers" bit was done, though I have some guesses.  We should detail why, and perhaps expose that in the API contract as well.
** If we do this, we should (in DEBUG) MOZ_ASSERT that the timer following the insertion point is > than the inserted time (MOZ_ASSERT(mTimers[insert_point+1].timeout > timeout) roughly).
* I see nothing in the TimerThread impl that would change ordering

tldr: From inspection, the timers should execute in the correct order -- but this is purely inspection

NI froyd/ehsan about the timer API issues above for any insight (bz has also delved into this code IIRC) - especially about the insert-after-expired-timers bit, which *can* cause inversion of out-of-order insertion with different timeouts (not this case)

Flags: needinfo?(nfroyd)

Flags: needinfo?(ehsan)

Eric Rescorla (:ekr)

Comment 8

•

9 years ago

I certainly expected timers to fire in order, based on your point #3.

However, Byron and i have decided out of an abundance of caution to
special-case the NR_async_timer_set(0) code to guarantee in-order
execution. See issue 1151080

(no longer active)

Comment 9

•

9 years ago

I'm not really familiar with the history of why we treat insert after expired timers this way.

Flags: needinfo?(ehsan)

Al Billings [:abillings - ex-MoCo]

Updated

•

9 years ago

Keywords: csectype-uaf, sec-high

Byron Campen [:bwc]

Reporter

Updated

•

9 years ago

Crash Signature: [@ nr_ice_peer_ctx_destroy_cb]

Nathan Froyd [:froydnj]

Comment 10

•

9 years ago

After some spelunking, it looks like we started inserting timers after expired timers in bug 252324, specifically bug 252324 comment 1 and comment 2.

Documenting the raft of things in comment 7 seem perfectly reasonable to me; it's silly to have to go look at logs and blame for a decade ago every time we try to figure out things about timers.

Flags: needinfo?(nfroyd)

Randell Jesup [:jesup] (needinfo me)

Comment 11

•

9 years ago

(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #10)
> After some spelunking, it looks like we started inserting timers after
> expired timers in bug 252324, specifically bug 252324 comment 1 and comment 2.

It seems like this was mostly an avoid-overflow/etc fix, not a "we want to explicitly have it fire after the expired timers" reason.  There are other ways to deal with the numerical issues here (even if simply a test for overflow).  Likely this should spin into a separate bug

> Documenting the raft of things in comment 7 seem perfectly reasonable to me;
> it's silly to have to go look at logs and blame for a decade ago every time
> we try to figure out things about timers.

Absolutely.  And it should spin into a bug.

Maire Reavy [:mreavy]

Comment 12

•

9 years ago

Byron's back from PTO on Monday, and he's the best person to work this.

Assignee: nobody → docfaraday

Rank: 10

Priority: -- → P1

Eric Rescorla (:ekr)

Comment 13

•

9 years ago

Note: I just landed the timer ordering patch in bug 1151080, so we should now have any timeout(0) events happening in order.

Randell Jesup [:jesup] (needinfo me)

Comment 14

•

9 years ago

(In reply to Eric Rescorla (:ekr) from comment #13)
> Note: I just landed the timer ordering patch in bug 1151080, so we should
> now have any timeout(0) events happening in order.

Per comment 7 (if the analysis is correct - I just rechecked it, and it still seems correct), they should have been in order even without this.  But, as you mention, this is not specified in the API, so better safe for now.  So we're still trying to figure out what's up here, assuming we have no change in the crash stats after this landing.

Eric Rescorla (:ekr)

Comment 15

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #14)
> (In reply to Eric Rescorla (:ekr) from comment #13)
> > Note: I just landed the timer ordering patch in bug 1151080, so we should
> > now have any timeout(0) events happening in order.
> 
> Per comment 7 (if the analysis is correct - I just rechecked it, and it
> still seems correct), they should have been in order even without this.

Yes, I generally agree with this. My point was simply that if we were
wrong in that analysis, then it should be fixed now. 


> But, as you mention, this is not specified in the API, so better safe for
> now.  So we're still trying to figure out what's up here, assuming we have
> no change in the crash stats after this landing.

Yes, I agree.

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Group: media-core-security

Byron Campen [:bwc]

Reporter

Comment 16

•

9 years ago

No sign of this on crash-stats for 39 or later. Might be fixed. I guess we keep an eye on it when 39 is in release.

Randell Jesup [:jesup] (needinfo me)

Comment 17

•

9 years ago

I see crashes for 39b1, b2, and b3.  Example:

https://crash-stats.mozilla.com/report/index/b8a8ba57-ab8d-4567-b327-aa09d2150608

Crashed at ice_peer_ctx.c:480 doing STAILQ_REMOVE with a 0x5a signature again:
    if (pctx->ctx)
      STAILQ_REMOVE(&pctx->ctx->peers, pctx, nr_ice_peer_ctx_, entry);

So, not fixed I think.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Reporter

Comment 18

•

9 years ago

Ok, so at least we didn't fix it on accident, which would have been scary. I guess we'll have to see whether bug 1151080 helps in 40...

Flags: needinfo?(docfaraday)

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

9 years ago

(In reply to Byron Campen [:bwc] from comment #18)
> Ok, so at least we didn't fix it on accident, which would have been scary. I
> guess we'll have to see whether bug 1151080 helps in 40...

While it might change the timing (and might make this easier or harder to hit), the previous analysis here would indicate that bug 1151080 shouldn't fundamentally change anything and shouldn't change the actual ordering.  Of course, if we have no ideas on how to resolve this...  Perhaps we should rewrite it or litter it with beartraps to find or at least make-safe-crash the issue.

Maire Reavy [:mreavy]

Updated

•

9 years ago

backlog: --- → webRTC+

Maire Reavy [:mreavy]

Comment 20

•

9 years ago

Nils -- to rebalance the "P1" bug load, I'm moving this one over to you.

Assignee: docfaraday → drno

Byron Campen [:bwc]

Reporter

Updated

•

9 years ago

Comment 21

•

9 years ago

Nils, might this be incidentally fixed by the other work?

Flags: needinfo?(drno)

Byron Campen [:bwc]

Reporter

Comment 22

•

9 years ago

No crashes on 40 release as of today. I'm betting this is fixed.

BMO Automation

Updated

•

9 years ago

Group: core-security

Randell Jesup [:jesup] (needinfo me)

Comment 23

•

9 years ago

No crashes on 40+ in the last month.  It's possible though very unlikely that bug 1178890 also contributed to a fix here, in that that bug *could* have caused reorderings -- though only in the rare case of sleep wakeup, which would be especially odd in the case of an active ICE gathering.

I'm closing this for now.  We may still want to consider adding assertions as to timer firing ordering somewhere (either in mtransport, or timers)

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Group: media-core-security

Nils Ohlmeier [:drno]

Assignee

Updated

•

8 years ago

Flags: needinfo?(drno)