From crash-stats: https://crash-stats.mozilla.com/signature/?signature=nr_ice_peer_ctx_destroy_cb&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1 Looking at the track, the ice_ctx seems to have been freed. There is only one place that leads to a call to this function: https://dxr.mozilla.org/mozilla-central/source/media/mtransport/nricectx.cpp?from=~NrIceCtx&case=true#524 Here we end up scheduling two timers, the first to destroy the ice_peer_ctx, and the second to free the ice_ctx. In this order, we're fine, but if somehow these timers execute in the opposite order we end up seeing a crash like the ones above. Given that I trust the timer stuff as far as I can throw it (which has not proven to be very far, since my trash can is a scant 2 feet from me), we should probably not trust these to happen in the order we schedule them in.
It is also possible that we're messing up our refcount on NrIceCtx somehow, resulting the the d'tor running twice. This would explain some other UAFs we're seeing where the "free" only happens during teardown of the whole NrIceCtx.
FWIW, I have repeatedly assumed that things do happen in the right order. Should we rewrite NR_async_timer_set() to enforce that? It wouldn't be hard.
Clearing ni per discussion with ekr.
So, reading and thinking some more, since we're crashing because of poisoned memory down in nr_ice_peer_ctx_destroy_cb, if ~NrIceCtx() is running for the second time, I would expect us to crash due to poisoned memory here: https://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#497 Also, given that NrIceCtx is virtual, I would expect the vptr to be poisoned as well, which means we probably wouldn't even be able to invoke the d'tor (I don't see how the compiler would know the static type, and be able to avoid following the vptr, anywhere other than in NrIceCtx::Create()).
Actually, there's a third possibility here: we're crashing here. STAILQ_REMOVE(&pctx->ctx->peers, pctx, nr_ice_peer_ctx_, entry); It is possible that one of the peers in the tailq has been destroyed, causing its next ptr to have a crashy value.
More reading, it seems that the scenario in comment 5 isn't going to arise, since I can't see how we would create more than one ice_peer_ctx; if the only ice_peer_ctx is freed, we would crash earlier.
Regarding timer ordering: * These are both scheduled on this thread with a 0 delta timeout * They may both have the same timeout target, or the second may be larger than the first * Timers are inserted using the TimerAdditionComparator, which returns LessThan() == true for any timer with a lower *or* equal timeout (which should ensure that two calls to schedule a timer with target == N will be inserted in the 'correct' order) ** Timers inserted with target N when the current time is N+M will be inserted *after* all timers with time <= N+M (i.e. inserted as if the target was N+M, not N). This would not change the ordering of timers inserted with a target of N; it does mean that scheduling of timers with *different* timeouts can execute in an order different than was intended. I.e. InsertTimer(N+X); InsertTimer(N) may execute with N+X first in the case where InsertTimer(N) happens after or at N+X. * I see nothing in the IDL for nsITimer that makes in-order firing of timers started with the same target tick a property of the API contract, though it's likely that other bits of code rely on this property as well (and it's a reasonable property to provide). The change here would be to add a comment to the IDL specifying this behavior. ** I don't know why the "insert-after-expired-timers" bit was done, though I have some guesses. We should detail why, and perhaps expose that in the API contract as well. ** If we do this, we should (in DEBUG) MOZ_ASSERT that the timer following the insertion point is > than the inserted time (MOZ_ASSERT(mTimers[insert_point+1].timeout > timeout) roughly). * I see nothing in the TimerThread impl that would change ordering tldr: From inspection, the timers should execute in the correct order -- but this is purely inspection NI froyd/ehsan about the timer API issues above for any insight (bz has also delved into this code IIRC) - especially about the insert-after-expired-timers bit, which *can* cause inversion of out-of-order insertion with different timeouts (not this case)
I certainly expected timers to fire in order, based on your point #3. However, Byron and i have decided out of an abundance of caution to special-case the NR_async_timer_set(0) code to guarantee in-order execution. See issue 1151080
I'm not really familiar with the history of why we treat insert after expired timers this way.
Crash Signature: [@ nr_ice_peer_ctx_destroy_cb]
After some spelunking, it looks like we started inserting timers after expired timers in bug 252324, specifically bug 252324 comment 1 and comment 2. Documenting the raft of things in comment 7 seem perfectly reasonable to me; it's silly to have to go look at logs and blame for a decade ago every time we try to figure out things about timers.
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #10) > After some spelunking, it looks like we started inserting timers after > expired timers in bug 252324, specifically bug 252324 comment 1 and comment 2. It seems like this was mostly an avoid-overflow/etc fix, not a "we want to explicitly have it fire after the expired timers" reason. There are other ways to deal with the numerical issues here (even if simply a test for overflow). Likely this should spin into a separate bug > Documenting the raft of things in comment 7 seem perfectly reasonable to me; > it's silly to have to go look at logs and blame for a decade ago every time > we try to figure out things about timers. Absolutely. And it should spin into a bug.
Byron's back from PTO on Monday, and he's the best person to work this.
Assignee: nobody → docfaraday
Priority: -- → P1
Note: I just landed the timer ordering patch in bug 1151080, so we should now have any timeout(0) events happening in order.
(In reply to Eric Rescorla (:ekr) from comment #13) > Note: I just landed the timer ordering patch in bug 1151080, so we should > now have any timeout(0) events happening in order. Per comment 7 (if the analysis is correct - I just rechecked it, and it still seems correct), they should have been in order even without this. But, as you mention, this is not specified in the API, so better safe for now. So we're still trying to figure out what's up here, assuming we have no change in the crash stats after this landing.
(In reply to Randell Jesup [:jesup] from comment #14) > (In reply to Eric Rescorla (:ekr) from comment #13) > > Note: I just landed the timer ordering patch in bug 1151080, so we should > > now have any timeout(0) events happening in order. > > Per comment 7 (if the analysis is correct - I just rechecked it, and it > still seems correct), they should have been in order even without this. Yes, I generally agree with this. My point was simply that if we were wrong in that analysis, then it should be fixed now. > But, as you mention, this is not specified in the API, so better safe for > now. So we're still trying to figure out what's up here, assuming we have > no change in the crash stats after this landing. Yes, I agree.
No sign of this on crash-stats for 39 or later. Might be fixed. I guess we keep an eye on it when 39 is in release.
I see crashes for 39b1, b2, and b3. Example: https://crash-stats.mozilla.com/report/index/b8a8ba57-ab8d-4567-b327-aa09d2150608 Crashed at ice_peer_ctx.c:480 doing STAILQ_REMOVE with a 0x5a signature again: if (pctx->ctx) STAILQ_REMOVE(&pctx->ctx->peers, pctx, nr_ice_peer_ctx_, entry); So, not fixed I think.
Ok, so at least we didn't fix it on accident, which would have been scary. I guess we'll have to see whether bug 1151080 helps in 40...
(In reply to Byron Campen [:bwc] from comment #18) > Ok, so at least we didn't fix it on accident, which would have been scary. I > guess we'll have to see whether bug 1151080 helps in 40... While it might change the timing (and might make this easier or harder to hit), the previous analysis here would indicate that bug 1151080 shouldn't fundamentally change anything and shouldn't change the actual ordering. Of course, if we have no ideas on how to resolve this... Perhaps we should rewrite it or litter it with beartraps to find or at least make-safe-crash the issue.
Nils -- to rebalance the "P1" bug load, I'm moving this one over to you.
Assignee: docfaraday → drno
Nils, might this be incidentally fixed by the other work?
No crashes on 40 release as of today. I'm betting this is fixed.
No crashes on 40+ in the last month. It's possible though very unlikely that bug 1178890 also contributed to a fix here, in that that bug *could* have caused reorderings -- though only in the rare case of sleep wakeup, which would be especially odd in the case of an active ICE gathering. I'm closing this for now. We may still want to consider adding assertions as to timer firing ordering somewhere (either in mtransport, or timers)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.