1147857 - crash in EverySecondTelemetryCallback_s on NSFW video site

Reporter

Description

•

10 years ago

This bug was filed from the Socorro interface and is report bp-13191f00-053e-492e-97e0-deb302150323. ============================================================= I feel dirty even filing this bug (see below) but it is the #1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. That said, it affects desktop as well. The unfortunate (and dirty) piece is while this seems to be pretty isolated to specific websites, those are heavily NSFW videos and imagery and I will not post their URLs here. Click the links in the Crash Signature field in this bug for more stats and reports.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 1

•

10 years ago

It really is using WebRTC: thread 11 is in sipcc::SendLocalIceCandidateToContentImpl/OnIceCandidate, and is converting a string. I don't see any other webrtc threads, though. Crash appears to be a null-deref in PeerConnectionCtx.cpp:262 auto& r = *(*q)->report;

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 2

•

10 years ago

So the next line uses r; I don't see how q would null, but I can see how report could be null: BuildStatsQuery_m sets ->report to new RTCStatsReportInternalConstruct(...). However, before that, there are 3 kickouts: if (!HasMedia()) { return NS_OK; } (this can't happen since we call HasMedia() before building the stats report); a check for !mMedia->ice_ctx || !mThread (possible if this is some type of startup race -- my bet as to the cause), or if nsGlobalWindow::GetPerformace() fails (I'm not sure exactly what can cause this; perhaps if mDoc->GetNavigationTiming() fails). The two possible failures both return non-OK rv's, so it should be testable, but right now we don't check the result of BuildStatsQuery_m() in EverySecondTelemetryCallback_m(). Obvious solution (if this is indeed the bug) is to do roughly: if (NS_FAILED(p->second->BuildStatsQuery_m(nullptr, queries->back()))) { queries->pop_back(); // nsAutoPtr<>, so it should delete it } The second question would be "is it reasonable that this fail due to a race, and if so does the race imply any other issues, or does it just say 'too early to look for stats'?" I'm afraid verifying this might require testing with signaling logs active to see what CSFLogError() pops out.

Rank: 10

Flags: needinfo?(jib)

Flags: needinfo?(docfaraday)

Flags: firefox-backlog+

Keywords: csectype-nullptr

Priority: -- → P1

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 3

•

10 years ago

Attached patch be careful about stats query creation (obsolete) — Details — Splinter Review

Possible solution

Attachment #8584589 - Flags: review?(jib)

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

10 years ago

Assignee: nobody → rjesup

Status: NEW → ASSIGNED

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 4

•

10 years ago

Comment on attachment 8584589 [details] [diff] [review] be careful about stats query creation Review of attachment 8584589 [details] [diff] [review]: ----------------------------------------------------------------- Improved error handling is always r=me. But mIceCtx is nulled on the socket transport thread on teardown, suggesting a race, so more patches may be needed before closing this issue. Some locking seems needed unless other events guarantee order here. Still investigating. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp @@ +289,5 @@ > p != ctx->mPeerConnections.end(); ++p) { > if (p->second->HasMedia()) { > queries->append(nsAutoPtr<RTCStatsQuery>(new RTCStatsQuery(true))); > + if (NS_WARN_IF(NS_FAILED(p->second->BuildStatsQuery_m(nullptr, // all tracks > + queries->back())))) { I always shudder when I see code inside assertion macros, as I always have to double-check whether they're no-ops or eliminate their condition in release. It should really be calledd NS_RELEASE_WARN_IF imho. Could we do an nsresult rv temporary here, just for extra sanity? Nit: As mentioned on irc, I'd also prefer the rarely taken codepath to do as little as possible, so I'd rather that codepath be the skipping of a push rather than a popBack, then only good ones are ever on the list, but either way. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp @@ +2811,5 @@ > > if (!mMedia->ice_ctx() || !mThread) { > CSFLogError(logTag, "Could not build stats query, critical components of " > + "PeerConnectionImpl not set (ice_ctx=%p, mThread=%p)", > + mMedia->ice_ctx().get(), mThread.get()); If this is a race then pointers may have changed by the time they're read the second time in the CSFLogError call. Since it seems we're only emitting pointers here to tell which one failed, I think I'd prefer separate ifs and CSFLogErrors without pointers.

Attachment #8584589 - Flags: review?(jib) → review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 5

•

10 years ago

Attached patch be careful about stats query creation — Details — Splinter Review

now with extra safety

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

10 years ago

Attachment #8584589 - Attachment is obsolete: true

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

10 years ago

Attachment #8584643 - Flags: review?(jib)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 6

•

10 years ago

Comment on attachment 8584643 [details] [diff] [review] be careful about stats query creation Review of attachment 8584643 [details] [diff] [review]: ----------------------------------------------------------------- r=me with bail on empty queries. EverySecondTelemetryCallback_m will often come in after things are shut down, because its timer is cancelled quite late, but if its checks on entry are sound then it should just exist fine. It checks PeerConnectionCtx::IsActive() and mPeerConnections.empty() on entry so as far as main thread goes it should be good. The one race seems to be with mIceCtx on sts, but I think with your change to hold a threadsafe ref it should be ok to proceed for what it does with it (extracting stats for building a request list for sts). How sts will react when it gets this stats request this late I don't know, but since we support getting stats after close we have to deal with requests late, so presumably we do. In any case the fix here should fix the immediate crash. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp @@ +298,3 @@ > } > } > rv = RUN_ON_THREAD(stsThread, With this change, queries may now be empty here, whereas before this wasn't possible thanks to the !mPeerConnections.empty() check earlier in this function. STS can probably handle an empty list, but it hasn't had to before, so it may be more conservative to bail here if queries is empty and not dispatch to STS in that case.

Attachment #8584643 - Flags: review?(jib) → review+

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 7

•

10 years ago

urr s/exist/exit/

Flags: needinfo?(jib)

Byron Campen [:bwc]

Comment 8

•

10 years ago

Comment on attachment 8584643 [details] [diff] [review] be careful about stats query creation Review of attachment 8584643 [details] [diff] [review]: ----------------------------------------------------------------- Drive-by review, just wanted to point out a couple of things. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp @@ +288,5 @@ > for (auto p = ctx->mPeerConnections.begin(); > p != ctx->mPeerConnections.end(); ++p) { > if (p->second->HasMedia()) { > queries->append(nsAutoPtr<RTCStatsQuery>(new RTCStatsQuery(true))); > + if (NS_WARN_IF(NS_FAILED(p->second->BuildStatsQuery_m(nullptr, // all tracks We probably want to also check in WebrtcGlobalInformation::GetAllStats ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp @@ +2810,5 @@ > } > > + // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold > + // a ref instead of making multiple calls. NrIceCtx uses threadsafe refcounting. > + query->iceCtx = mMedia->ice_ctx(); mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if mMedia is still set, the ice ctx is still there. Even if there was a risk of the ice ctx disappearing during this function call, we'd end up releasing the last ref on main, which is also bad.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 9

•

10 years ago

(In reply to Byron Campen [:bwc] from comment #8) > > + // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold > > + // a ref instead of making multiple calls. NrIceCtx uses threadsafe refcounting. > > + query->iceCtx = mMedia->ice_ctx(); > > mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if > mMedia is still set, the ice ctx is still there. Even if there was a risk of > the ice ctx disappearing during this function call, we'd end up releasing > the last ref on main, which is also bad. Not necessarily. Provided Randell disregards my "bail on empty queries" in comment 6, then even if this ends up being the last ice_ctx reference then it'll get promptly dispatched to STS, where we ... hmm, I thought we had some code to release the iceCtx held by QueryRequest on sts rather than FreeBackOnMain. Have I been imagining things?

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 10

•

10 years ago

Yes there it is [1], so we should be good. Oh, and can still bail on empty list of course, my mistake. [1] http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp?rev=ac4464790ec4&mark=3202-3202#3190

Byron Campen [:bwc]

Comment 11

•

10 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #9) > (In reply to Byron Campen [:bwc] from comment #8) > > > + // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold > > > + // a ref instead of making multiple calls. NrIceCtx uses threadsafe refcounting. > > > + query->iceCtx = mMedia->ice_ctx(); > > > > mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if > > mMedia is still set, the ice ctx is still there. Even if there was a risk of > > the ice ctx disappearing during this function call, we'd end up releasing > > the last ref on main, which is also bad. > > Not necessarily. Provided Randell disregards my "bail on empty queries" in > comment 6, then even if this ends up being the last ice_ctx reference then > it'll get promptly dispatched to STS, where we ... hmm, I thought we had > some code to release the iceCtx held by QueryRequest on sts rather than > FreeBackOnMain. Have I been imagining things? I mis-typed; we could release the last ref on main if we hit an error.

Flags: needinfo?(docfaraday)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 12

•

10 years ago

OK good, otherwise we should reorder the bail tests to obtain ice_ctx() last so it's safely shepherded to STS.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 13

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d56c2eb468f6

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 14

•

10 years ago

Attached patch followup patch to continue BuildStats cleanup — Details — Splinter Review

updates in response to additional comments in the bug

Attachment #8585029 - Flags: review?(jib)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 15

•

10 years ago

Comment on attachment 8585029 [details] [diff] [review] followup patch to continue BuildStats cleanup Review of attachment 8585029 [details] [diff] [review]: ----------------------------------------------------------------- lgtm. ::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp @@ +2824,4 @@ > // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold > // a ref instead of making multiple calls. NrIceCtx uses threadsafe refcounting. > + // NOTE: Do this after all other failure tests, to ensure we don't > + // accidentally release the Ctx on Mainthread. I don't know the answer, but this NOTE: comment seems to contradict Byron's comment 11. Code looks good though so no biggie.

Attachment #8585029 - Flags: review?(jib) → review+

Phil Ringnalda (:philor)

Comment 16

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/d56c2eb468f6

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-firefox39: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 17

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/015fa1d7c548

Phil Ringnalda (:philor)

Comment 18

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/015fa1d7c548

Robert Kaiser

Reporter

Comment 19

•

10 years ago

[Tracking Requested - why for this release]: From what I see in crash stats, we'll need to get this onto beta to actually see if it worked (and I do not think anyone should do manual QA to try re reproduce on those NSFW sites, we should let the data speak). So I'm nominating for 38 so we track this on beta. Also nominating for 37 as it's the #1 crash on Android beta right now and if we can verify by data on 38, it's probably something we'd want to take as a ride-along if we'd end up with a point release.

tracking-firefox37: --- → ?

tracking-firefox38: --- → ?

Paul Kerr [:pkerr]

Comment 20

•

10 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #6) > Comment on attachment 8584643 [details] [diff] [review] > be careful about stats query creation > > Review of attachment 8584643 [details] [diff] [review]: > ----------------------------------------------------------------- > > r=me with bail on empty queries. > > EverySecondTelemetryCallback_m will often come in after things are shut > down, because its timer is cancelled quite late, but if its checks on entry > are sound then it should just exist fine. It checks > PeerConnectionCtx::IsActive() and mPeerConnections.empty() on entry so as > far as main thread goes it should be good. The one race seems to be with > mIceCtx on sts, but I think with your change to hold a threadsafe ref it > should be ok to proceed for what it does with it (extracting stats for > building a request list for sts). > > How sts will react when it gets this stats request this late I don't know, > but since we support getting stats after close we have to deal with requests > late, so presumably we do. In any case the fix here should fix the immediate > crash. > > ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp > @@ +298,3 @@ > > } > > } > > rv = RUN_ON_THREAD(stsThread, > > With this change, queries may now be empty here, whereas before this wasn't > possible thanks to the !mPeerConnections.empty() check earlier in this > function. > > STS can probably handle an empty list, but it hasn't had to before, so it > may be more conservative to bail here if queries is empty and not dispatch > to STS in that case. Bailing on queries->empty in WebrtcGlobalInformation::GetAllStats, returning NS_OK, but not making the callback with an empty report may hang a caller. This does not happen with aboutWebrtc.js as it will render the controls even if the callbacks from GetAllStats are never triggered. But, I am concerned that another user may get tripped up on this. So, I think this does not break the current code set but I may update this section as part of the e10s work I am doing in this module.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 21

•

10 years ago

Good point. There's also no initial check on ctx->mPeerConnections.empty() in WebrtcGlobalInformation, so an empty list may happen here even without an error. That tells me sts is able to handle empty lists, so we'd probably be better off always firing the dispatch after all.

Sylvestre Ledru [:Sylvestre]

Comment 22

•

10 years ago

Tracking for 38 (what Kairo said in comment #19)

status-firefox37: --- → affected

status-firefox38: --- → affected

tracking-firefox38: ? → +

Sylvestre Ledru [:Sylvestre]

Comment 23

•

10 years ago

Jan-Ivar, as Randell is in PTO, could you fill the uplift request to beta? Thanks

Flags: needinfo?(jib)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 24

•

10 years ago

Comment on attachment 8584643 [details] [diff] [review] be careful about stats query creation Approval Request Comment [Feature/regressing bug #]: Bug 970685 [User impact if declined]: #1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well. [Describe test coverage new/current, TreeHerder]: In 39. See comment 19. [Risks and why]: It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly). [String/UUID change made/needed]: None

Flags: needinfo?(jib)

Attachment #8584643 - Flags: approval-mozilla-beta?

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 25

•

10 years ago

Comment on attachment 8585029 [details] [diff] [review] followup patch to continue BuildStats cleanup Applies to both patches.

Attachment #8585029 - Flags: approval-mozilla-beta?

Sylvestre Ledru [:Sylvestre]

Updated

•

10 years ago

Attachment #8585029 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Sylvestre Ledru [:Sylvestre]

Updated

•

10 years ago

Attachment #8584643 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Paul Kerr [:pkerr]

Comment 26

•

10 years ago

Comment on attachment 8584643 [details] [diff] [review] be careful about stats query creation Approval Request Comment [Feature/regressing bug #]: Bug 970685 [User impact if declined]: #1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well on current release. [Describe test coverage new/current, TreeHerder]: In 39. See comment 19. [Risks and why]: It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly). [String/UUID change made/needed]: None

Attachment #8584643 - Flags: approval-mozilla-release?

Ryan VanderMeulen [:RyanVM]

Comment 27

•

10 years ago

Pushed to m-r at lmandel's request. https://hg.mozilla.org/releases/mozilla-release/rev/f463fd5d9924 https://hg.mozilla.org/releases/mozilla-release/rev/927bd4e66023

status-firefox37: affected → fixed

Paul Kerr [:pkerr]

Comment 28

•

10 years ago

Comment on attachment 8585029 [details] [diff] [review] followup patch to continue BuildStats cleanup Approval Request Comment [Feature/regressing bug #]: Bug 970685 [User impact if declined]: #1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well on current release. [Describe test coverage new/current, TreeHerder]: In 39. See comment 19. [Risks and why]: It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly). [String/UUID change made/needed]: None

Attachment #8585029 - Flags: approval-mozilla-release?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 29

•

10 years ago

Comment on attachment 8585029 [details] [diff] [review] followup patch to continue BuildStats cleanup Thanks for stepping in Paul. As Ryan said, I requested that this fix be landed before the approval was on the bug as I knew that it was coming. This is the top crash for Fennec right now. We're going to fix this in 37.0.1. Release+

Attachment #8585029 - Flags: approval-mozilla-release? → approval-mozilla-release+

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

10 years ago

Attachment #8584643 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ryan VanderMeulen [:RyanVM]

Comment 30

•

10 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/2c5d97fcb993 https://hg.mozilla.org/releases/mozilla-beta/rev/108255910fbf

status-firefox38: affected → fixed

Lawrence Mandel [:lmandel] (use needinfo)

Comment 31

•

10 years ago

relnoted as "Fixed: Crash due to WebRTC usage on certain web sites".

tracking-firefox37: ? → +

relnote-firefox: --- → 37+

Ioana Chiorean

Comment 32

•

10 years ago

We will go for what Kairo mentioned above and will check this to be fix after we have the build in the Google Play verifying crashstats ( we confirmed on IRC with Kairo that this is ok)

Eric Rescorla (:ekr)

Comment 33

•

10 years ago

Is there a reason someone can't write a manual test case that reproduces the crash (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19) > [Tracking Requested - why for this release]: > From what I see in crash stats, we'll need to get this onto beta to actually > see if it worked (and I do not think anyone should do manual QA to try re > reproduce on those NSFW sites, we should let the data speak). So I'm > nominating for 38 so we track this on beta. > > Also nominating for 37 as it's the #1 crash on Android beta right now and if > we can verify by data on 38, it's probably something we'd want to take as a > ride-along if we'd end up with a point release. Is there a reason someone can't write a test that reproduces this crash based on whatever theory there is about the problem and then verify that this fix fixes it?

Robert Kaiser

Reporter

Comment 34

•

10 years ago

(In reply to Eric Rescorla (:ekr) from comment #33) > Is there a reason someone can't write a test that reproduces this crash > based on whatever theory there is about the problem and then verify that > this fix fixes it? I don't know if there is, but I'm unable to devise such a test case, we'd need to involved devs to find one.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 35

•

10 years ago

FWIW I've just tried and failed to reproduce locally on desktop with EverySecondTelemetryCallback_s timer changed from 1 second to 1 millisecond and with the two fixes here reverted. No crashes on shutdown with normal WebRTC use (tried 20-30 times with opt build). I've pushed an android build of same to try[1]. If someone is able to reproduce reliably with it, then I can push a second with the fix in it. I don't know the conditions under which this crashes on shutdown, perhaps these sites just use ICE for fingerprinting or some-such rather than using the whole webrtc stack? Shouldn't matter I suppose but then everything may matter with races. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=d007df77fdeb

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 36

•

10 years ago

> I don't know the conditions under which this crashes on shutdown, perhaps > these sites just use ICE for fingerprinting or some-such rather than using > the whole webrtc stack? Shouldn't matter I suppose but then everything may > matter with races. *) It's hard to test the fix, as the URLs for the sites aren't public (for understandable reasons). Perhaps someone could see if it's fallible enough to test on these (icky) sites if they're willing to view them (or willing to cover their screen), or give a pointer to the JS source or snippet. *) perhaps they're using it for fingerprinting, or perhaps they're using it a a p2p CDN to cut their bandwidth costs (which are likely considerable). Their JS may be doing something unusual (or stupid). *) we certainly plugged a hole that would lead to this sort of failure; that doesn't mean 100% for sure we've fixed this in-the-field failure. We may have to deploy it and see what happens, unfortunately.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Updated

•

10 years ago

Comment 37

•

10 years ago

I can confirm that 37.0.1 does not have crashes with this signature any more. We didn't see it in any significant frequency in Nightly or Aurora, so I'd guess it's OK there as well. We have not shipped a beta with the fix yet, that should go out tomorrow.

Status: RESOLVED → VERIFIED

status-firefox37: fixed → verified

status-firefox38: fixed → affected

status-firefox39: fixed → verified

Robert Kaiser

Reporter

Comment 38

•

10 years ago

Note that the 39 "verified" isn't a full guess, we haven't seen any crashes on Aurora since 39 was uplifted there - just that frequency was low on that channel even before the patch.

Robert Kaiser

Reporter

Updated

•

10 years ago

status-firefox38: affected → fixed

Robert Kaiser

Reporter

Comment 39

•

10 years ago

OK, we now have a beta with the patch and it's not seeing the crash.

status-firefox38: fixed → verified

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Updated

•

10 years ago

Depends on: 1153056

be careful about stats query creation 10 years ago Randell Jesup [:jesup] (needinfo me) 2.57 KB, patch	jib : review+	Details \| Diff \| Splinter Review
be careful about stats query creation 10 years ago Randell Jesup [:jesup] (needinfo me) 3.63 KB, patch	jib : review+ Sylvestre : approval-mozilla-beta+ lmandel : approval-mozilla-release+	Details \| Diff \| Splinter Review
followup patch to continue BuildStats cleanup 10 years ago Randell Jesup [:jesup] (needinfo me) 4.31 KB, patch	jib : review+ Sylvestre : approval-mozilla-beta+ lmandel : approval-mozilla-release+	Details \| Diff \| Splinter Review