Closed Bug 1147857 Opened 8 years ago Closed 8 years ago

crash in EverySecondTelemetryCallback_s on NSFW video site

Categories

(Core :: WebRTC: Signaling, defect, P1)

All
Android
defect

Tracking

()

VERIFIED FIXED
mozilla39
Tracking Status
firefox37 + verified
firefox38 + verified
firefox39 --- verified
relnote-firefox --- 37+

People

(Reporter: kairo, Assigned: jesup)

References

Details

(Keywords: crash, csectype-nullptr)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-13191f00-053e-492e-97e0-deb302150323.
=============================================================

I feel dirty even filing this bug (see below) but it is the #1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. That said, it affects desktop as well.

The unfortunate (and dirty) piece is while this seems to be pretty isolated to specific websites, those are heavily NSFW videos and imagery and I will not post their URLs here.

Click the links in the Crash Signature field in this bug for more stats and reports.
It really is using WebRTC: thread 11 is in  	sipcc::SendLocalIceCandidateToContentImpl/OnIceCandidate, and is converting a string.
I don't see any other webrtc threads, though.

Crash appears to be a null-deref in PeerConnectionCtx.cpp:262
 auto& r = *(*q)->report;
So the next line uses r; I don't see how q would null, but I can see how report could be null:

BuildStatsQuery_m sets ->report to new RTCStatsReportInternalConstruct(...).  However, before that, there are 3 kickouts: if (!HasMedia()) { return NS_OK; } (this can't happen since we call HasMedia() before building the stats report); a check for !mMedia->ice_ctx || !mThread (possible if this is some type of startup race -- my bet as to the cause), or if nsGlobalWindow::GetPerformace() fails (I'm not sure exactly what can cause this; perhaps if mDoc->GetNavigationTiming() fails).

The two possible failures both return non-OK rv's, so it should be testable, but right now we don't check the result of BuildStatsQuery_m() in EverySecondTelemetryCallback_m().

Obvious solution (if this is indeed the bug) is to do roughly:
if (NS_FAILED(p->second->BuildStatsQuery_m(nullptr, queries->back()))) {
  queries->pop_back(); // nsAutoPtr<>, so it should delete it
}

The second question would be "is it reasonable that this fail due to a race, and if so does the race imply any other issues, or does it just say 'too early to look for stats'?"

I'm afraid verifying this might require testing with signaling logs active to see what CSFLogError() pops out.
Rank: 10
Flags: needinfo?(jib)
Flags: needinfo?(docfaraday)
Flags: firefox-backlog+
Priority: -- → P1
Attached patch be careful about stats query creation (obsolete) — — Splinter Review
Possible solution
Attachment #8584589 - Flags: review?(jib)
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8584589 [details] [diff] [review]
be careful about stats query creation

Review of attachment 8584589 [details] [diff] [review]:
-----------------------------------------------------------------

Improved error handling is always r=me.

But mIceCtx is nulled on the socket transport thread on teardown, suggesting a race, so more patches may be needed before closing this issue.

Some locking seems needed unless other events guarantee order here. Still investigating.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ +289,5 @@
>          p != ctx->mPeerConnections.end(); ++p) {
>      if (p->second->HasMedia()) {
>        queries->append(nsAutoPtr<RTCStatsQuery>(new RTCStatsQuery(true)));
> +      if (NS_WARN_IF(NS_FAILED(p->second->BuildStatsQuery_m(nullptr, // all tracks
> +                                                            queries->back())))) {

I always shudder when I see code inside assertion macros, as I always have to double-check whether they're no-ops or eliminate their condition in release.

It should really be calledd NS_RELEASE_WARN_IF imho. Could we do an nsresult rv temporary here, just for extra sanity?

Nit: As mentioned on irc, I'd also prefer the rarely taken codepath to do as little as possible, so I'd rather that codepath be the skipping of a push rather than a popBack, then only good ones are ever on the list, but either way.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +2811,5 @@
>  
>    if (!mMedia->ice_ctx() || !mThread) {
>      CSFLogError(logTag, "Could not build stats query, critical components of "
> +                "PeerConnectionImpl not set (ice_ctx=%p, mThread=%p)",
> +                mMedia->ice_ctx().get(), mThread.get());

If this is a race then pointers may have changed by the time they're read the second time in the CSFLogError call. Since it seems we're only emitting pointers here to tell which one failed, I think I'd prefer separate ifs and CSFLogErrors without pointers.
Attachment #8584589 - Flags: review?(jib) → review+
now with extra safety
Attachment #8584589 - Attachment is obsolete: true
Attachment #8584643 - Flags: review?(jib)
Comment on attachment 8584643 [details] [diff] [review]
be careful about stats query creation

Review of attachment 8584643 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with bail on empty queries.

EverySecondTelemetryCallback_m will often come in after things are shut down, because its timer is cancelled quite late, but if its checks on entry are sound then it should just exist fine. It checks PeerConnectionCtx::IsActive() and mPeerConnections.empty() on entry so as far as main thread goes it should be good. The one race seems to be with mIceCtx on sts, but I think with your change to hold a threadsafe ref it should be ok to proceed for what it does with it (extracting stats for building a request list for sts).

How sts will react when it gets this stats request this late I don't know, but since we support getting stats after close we have to deal with requests late, so presumably we do. In any case the fix here should fix the immediate crash.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ +298,3 @@
>      }
>    }
>    rv = RUN_ON_THREAD(stsThread,

With this change, queries may now be empty here, whereas before this wasn't possible thanks to the !mPeerConnections.empty() check earlier in this function.

STS can probably handle an empty list, but it hasn't had to before, so it may be more conservative to bail here if queries is empty and not dispatch to STS in that case.
Attachment #8584643 - Flags: review?(jib) → review+
urr s/exist/exit/
Flags: needinfo?(jib)
Comment on attachment 8584643 [details] [diff] [review]
be careful about stats query creation

Review of attachment 8584643 [details] [diff] [review]:
-----------------------------------------------------------------

Drive-by review, just wanted to point out a couple of things.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ +288,5 @@
>    for (auto p = ctx->mPeerConnections.begin();
>          p != ctx->mPeerConnections.end(); ++p) {
>      if (p->second->HasMedia()) {
>        queries->append(nsAutoPtr<RTCStatsQuery>(new RTCStatsQuery(true)));
> +      if (NS_WARN_IF(NS_FAILED(p->second->BuildStatsQuery_m(nullptr, // all tracks

We probably want to also check in WebrtcGlobalInformation::GetAllStats

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +2810,5 @@
>    }
>  
> +  // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold
> +  // a ref instead of making multiple calls.  NrIceCtx uses threadsafe refcounting.
> +  query->iceCtx = mMedia->ice_ctx();

mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if mMedia is still set, the ice ctx is still there. Even if there was a risk of the ice ctx disappearing during this function call, we'd end up releasing the last ref on main, which is also bad.
(In reply to Byron Campen [:bwc] from comment #8)
> > +  // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold
> > +  // a ref instead of making multiple calls.  NrIceCtx uses threadsafe refcounting.
> > +  query->iceCtx = mMedia->ice_ctx();
> 
> mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if
> mMedia is still set, the ice ctx is still there. Even if there was a risk of
> the ice ctx disappearing during this function call, we'd end up releasing
> the last ref on main, which is also bad.

Not necessarily. Provided Randell disregards my "bail on empty queries" in comment 6, then even if this ends up being the last ice_ctx reference then it'll get promptly dispatched to STS, where we ... hmm, I thought we had some code to release the iceCtx held by QueryRequest on sts rather than FreeBackOnMain. Have I been imagining things?
(In reply to Jan-Ivar Bruaroey [:jib] from comment #9)
> (In reply to Byron Campen [:bwc] from comment #8)
> > > +  // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold
> > > +  // a ref instead of making multiple calls.  NrIceCtx uses threadsafe refcounting.
> > > +  query->iceCtx = mMedia->ice_ctx();
> > 
> > mMedia->ice_ctx() is deleted on STS after mMedia is nulled out on main; if
> > mMedia is still set, the ice ctx is still there. Even if there was a risk of
> > the ice ctx disappearing during this function call, we'd end up releasing
> > the last ref on main, which is also bad.
> 
> Not necessarily. Provided Randell disregards my "bail on empty queries" in
> comment 6, then even if this ends up being the last ice_ctx reference then
> it'll get promptly dispatched to STS, where we ... hmm, I thought we had
> some code to release the iceCtx held by QueryRequest on sts rather than
> FreeBackOnMain. Have I been imagining things?

   I mis-typed; we could release the last ref on main if we hit an error.
Flags: needinfo?(docfaraday)
OK good, otherwise we should reorder the bail tests to obtain ice_ctx() last so it's safely shepherded to STS.
updates in response to additional comments in the bug
Attachment #8585029 - Flags: review?(jib)
Comment on attachment 8585029 [details] [diff] [review]
followup patch to continue BuildStats cleanup

Review of attachment 8585029 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +2824,4 @@
>    // Note: mMedia->ice_ctx() is deleted on STS thread; so make sure we grab and hold
>    // a ref instead of making multiple calls.  NrIceCtx uses threadsafe refcounting.
> +  // NOTE: Do this after all other failure tests, to ensure we don't
> +  // accidentally release the Ctx on Mainthread.

I don't know the answer, but this NOTE: comment seems to contradict Byron's comment 11.

Code looks good though so no biggie.
Attachment #8585029 - Flags: review?(jib) → review+
https://hg.mozilla.org/mozilla-central/rev/d56c2eb468f6
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
[Tracking Requested - why for this release]:
From what I see in crash stats, we'll need to get this onto beta to actually see if it worked (and I do not think anyone should do manual QA to try re reproduce on those NSFW sites, we should let the data speak). So I'm nominating for 38 so we track this on beta.

Also nominating for 37 as it's the #1 crash on Android beta right now and if we can verify by data on 38, it's probably something we'd want to take as a ride-along if we'd end up with a point release.
(In reply to Jan-Ivar Bruaroey [:jib] from comment #6)
> Comment on attachment 8584643 [details] [diff] [review]
> be careful about stats query creation
> 
> Review of attachment 8584643 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r=me with bail on empty queries.
> 
> EverySecondTelemetryCallback_m will often come in after things are shut
> down, because its timer is cancelled quite late, but if its checks on entry
> are sound then it should just exist fine. It checks
> PeerConnectionCtx::IsActive() and mPeerConnections.empty() on entry so as
> far as main thread goes it should be good. The one race seems to be with
> mIceCtx on sts, but I think with your change to hold a threadsafe ref it
> should be ok to proceed for what it does with it (extracting stats for
> building a request list for sts).
> 
> How sts will react when it gets this stats request this late I don't know,
> but since we support getting stats after close we have to deal with requests
> late, so presumably we do. In any case the fix here should fix the immediate
> crash.
> 
> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
> @@ +298,3 @@
> >      }
> >    }
> >    rv = RUN_ON_THREAD(stsThread,
> 
> With this change, queries may now be empty here, whereas before this wasn't
> possible thanks to the !mPeerConnections.empty() check earlier in this
> function.
> 
> STS can probably handle an empty list, but it hasn't had to before, so it
> may be more conservative to bail here if queries is empty and not dispatch
> to STS in that case.

Bailing on queries->empty in WebrtcGlobalInformation::GetAllStats, returning NS_OK, but not making the callback with an empty report may hang a caller. This does not happen with aboutWebrtc.js as it will render the controls even if the callbacks from GetAllStats are never triggered. But, I am concerned that another user may get tripped up on this. So, I think this does not break the current code set but I may update this section as part of the e10s work I am doing in this module.
Good point. There's also no initial check on ctx->mPeerConnections.empty() in WebrtcGlobalInformation, so an empty list may happen here even without an error. That tells me sts is able to handle empty lists, so we'd probably be better off always firing the dispatch after all.
Tracking for 38 (what Kairo said in comment #19)
Jan-Ivar, as Randell is in PTO, could you fill the uplift request to beta? Thanks
Flags: needinfo?(jib)
Comment on attachment 8584643 [details] [diff] [review]
be careful about stats query creation

Approval Request Comment
[Feature/regressing bug #]: Bug 970685
[User impact if declined]:
#1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well.
[Describe test coverage new/current, TreeHerder]: In 39. See comment 19.
[Risks and why]: 
It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly).
[String/UUID change made/needed]: None
Flags: needinfo?(jib)
Attachment #8584643 - Flags: approval-mozilla-beta?
Comment on attachment 8585029 [details] [diff] [review]
followup patch to continue BuildStats cleanup

Applies to both patches.
Attachment #8585029 - Flags: approval-mozilla-beta?
Attachment #8585029 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8584643 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8584643 [details] [diff] [review]
be careful about stats query creation

Approval Request Comment
[Feature/regressing bug #]: Bug 970685
[User impact if declined]:
#1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well on current release.
[Describe test coverage new/current, TreeHerder]: In 39. See comment 19.
[Risks and why]: 
It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly).
[String/UUID change made/needed]: None
Attachment #8584643 - Flags: approval-mozilla-release?
Comment on attachment 8585029 [details] [diff] [review]
followup patch to continue BuildStats cleanup

Approval Request Comment
[Feature/regressing bug #]: Bug 970685
[User impact if declined]:
#1 crash on Firefox 36 for Android right now and #2 in Firefox 37 Beta for Android. Affects desktop as well on current release.
[Describe test coverage new/current, TreeHerder]: In 39. See comment 19.
[Risks and why]: 
It's a shutdown crash, so low risk of making things worse. Change is small and conservative (obtain reference once instead of twice to avoid race on check, and check for errors and bail properly).
[String/UUID change made/needed]: None
Attachment #8585029 - Flags: approval-mozilla-release?
Comment on attachment 8585029 [details] [diff] [review]
followup patch to continue BuildStats cleanup

Thanks for stepping in Paul. As Ryan said, I requested that this fix be landed before the approval was on the bug as I knew that it was coming. This is the top crash for Fennec right now. We're going to fix this in 37.0.1. Release+
Attachment #8585029 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #8584643 - Flags: approval-mozilla-release? → approval-mozilla-release+
relnoted as "Fixed: Crash due to WebRTC usage on certain web sites".
We will go for what Kairo mentioned above and will check this to be fix after we have the build in the Google Play verifying crashstats ( we confirmed on IRC with Kairo that this is ok)
Is there a reason someone can't write a manual test case that reproduces the
crash (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #19)
> [Tracking Requested - why for this release]:
> From what I see in crash stats, we'll need to get this onto beta to actually
> see if it worked (and I do not think anyone should do manual QA to try re
> reproduce on those NSFW sites, we should let the data speak). So I'm
> nominating for 38 so we track this on beta.
> 
> Also nominating for 37 as it's the #1 crash on Android beta right now and if
> we can verify by data on 38, it's probably something we'd want to take as a
> ride-along if we'd end up with a point release.


Is there a reason someone can't write a test that reproduces this crash
based on whatever theory there is about the problem and then verify that
this fix fixes it?
(In reply to Eric Rescorla (:ekr) from comment #33)
> Is there a reason someone can't write a test that reproduces this crash
> based on whatever theory there is about the problem and then verify that
> this fix fixes it?

I don't know if there is, but I'm unable to devise such a test case, we'd need to involved devs to find one.
FWIW I've just tried and failed to reproduce locally on desktop with EverySecondTelemetryCallback_s timer changed from 1 second to 1 millisecond and with the two fixes here reverted. No crashes on shutdown with normal WebRTC use (tried 20-30 times with opt build).

I've pushed an android build of same to try[1]. If someone is able to reproduce reliably with it, then I can push a second with the fix in it.

I don't know the conditions under which this crashes on shutdown, perhaps these sites just use ICE for fingerprinting or some-such rather than using the whole webrtc stack? Shouldn't matter I suppose but then everything may matter with races.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=d007df77fdeb
> I don't know the conditions under which this crashes on shutdown, perhaps
> these sites just use ICE for fingerprinting or some-such rather than using
> the whole webrtc stack? Shouldn't matter I suppose but then everything may
> matter with races.

*) It's hard to test the fix, as the URLs for the sites aren't public (for understandable reasons).  Perhaps someone could see if it's fallible enough to test on these (icky) sites if they're willing to view them (or willing to cover their screen), or give a pointer to the JS source or snippet.

*) perhaps they're using it for fingerprinting, or perhaps they're using it a a p2p CDN to cut their bandwidth costs (which are likely considerable).  Their JS may be doing something unusual (or stupid).

*) we certainly plugged a hole that would lead to this sort of failure; that doesn't mean 100% for sure we've fixed this in-the-field failure.  We may have to deploy it and see what happens, unfortunately.
I can confirm that 37.0.1 does not have crashes with this signature any more. We didn't see it in any significant frequency in Nightly or Aurora, so I'd guess it's OK there as well. We have not shipped a beta with the fix yet, that should go out tomorrow.
Note that the 39 "verified" isn't a full guess, we haven't seen any crashes on Aurora since 39 was uplifted there - just that frequency was low on that channel even before the patch.
OK, we now have a beta with the patch and it's not seeing the crash.
You need to log in before you can comment on or make changes to this bug.