Closed Bug 1188391 Opened 4 years ago Closed Last month

Investigate ICE success/failure rates in Telemetry

Categories

(Core :: WebRTC, defect, P2)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox42 --- affected
Blocking Flags:

People

(Reporter: jesup, Unassigned, NeedInfo)

References

(Depends on 3 open bugs)

Details

Attachments

(1 file)

The ICE success/failure rates in telemetry seem very surprising; perhaps they're mis-reporting since trickle ICE became the default.  Needs investigation
So, looking at the candidate types given failure, it seems that the lion's share happens when neither end had any relay candidates.

Also, looking at trickle statistics, we see a staggering amount of late trickle happening. For release 38, about 34M trickle candidates took more than 10 seconds to come in, while only about 2M trickle candidates made it before 10 seconds. The difference between release 37 and 38 is stark:

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=release%2F37&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-03-04&end_date=2015-04-14&sanitize=1&os=WINNT!Darwin!Linux

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=release%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-04-05&end_date=2015-06-04&sanitize=1&os=WINNT!Darwin!Linux

When looking at trickle arrival times by platform, windows and OS X fare extremely poorly, while linux functions properly.

In 38 release, we see the first case where the STUN rate limit is breached, although it does not explain a large percentage of the failures:

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_STUN_RATE_LIMIT_EXCEEDED_BY_TYPE_GIVEN_FAILURE&max_channel_version=release%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-04-26&end_date=2015-06-04&sanitize=1&os=WINNT!Darwin!Linux

Looking at failure rates by platform, it is very clear that Windows is the primary problem. Rates are much more reasonable on OS X, and extremely good on Linux. Most versions of Windows have bad success rates, but the worst are Vista, 7, and 8.
The trickle arrival time stuff paints a weird picture. An extremely large (about 50%) percentage of the arrival times are between 10 and 11 seconds, as if something is purposefully trickling stuff at the 10 second mark.
Some more information, when looking at 38 across its ride on the trains, we see the same stark difference in trickle arrival times between when it was in aurora and when it was in beta.

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=nightly%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-01-11&end_date=2015-02-22&sanitize=1&os=WINNT!Darwin!Linux

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=aurora%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-02-22&end_date=2015-03-29&sanitize=1&os=WINNT!Darwin!Linux

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=beta%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-03-29&end_date=2015-05-24&sanitize=1&os=WINNT!Darwin!Linux

http://telemetry.mozilla.org/dist.html#measure=WEBRTC_ICE_ON_TIME_TRICKLE_ARRIVAL_TIME&max_channel_version=release%2F38&min_channel_version=nightly%2F38&product=Firefox&cumulative=0&use_submission_date=0&start_date=2015-04-05&end_date=2015-06-04&sanitize=1&os=WINNT!Darwin!Linux

Looking more closely at 38 on beta, the difference between the data from 03/30 to 05/15 and the data from 05/16 to 05/24 is huge. Something big changed in the middle of May.
Change seems to be in the 5/11 build on 38 beta.  the previous build (5/8) seems "normal"
When looking day-by-day on release 38, most of the late trickle happens on just a handful of days. However, these days do not see unusual failure rates.
Depends on: 1189099
backlog: --- → webRTC+
Rank: 18
Priority: -- → P1
So looking at this in PeerConnection.js, I see that it does this:

    if (iceConnectionState === 'failed') {
      histogram.add(false);
      this._dompc.logError("ICE failed, see about:webrtc for more details", null, 0);
    }
    if (this._dompc.iceConnectionState === 'checking' &&
        (iceConnectionState === 'completed' ||
         iceConnectionState === 'connected')) {
          histogram.add(true);
    }
So, my concern is that we might somehow get to state 'failed' for "calls" that never actually had a chance of succeeding.  I'd like to verify that's the case (that the NYTimes thing isn't warping our data, for example), and if other variants can get to 'failed' (or 'success', though I'm much less worried that's possible).  Perhaps we need a third bucket, or just exclude some from the two here.

We also need to key off Loop vs non-Loop for reporting and update Histogram.json; see PeerConnectionImpl::isLoop() to know if it's a loop call or not.
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
(In reply to Randell Jesup [:jesup] from comment #6)
> So looking at this in PeerConnection.js, I see that it does this:
> 
>     if (iceConnectionState === 'failed') {
>       histogram.add(false);
>       this._dompc.logError("ICE failed, see about:webrtc for more details",
> null, 0);
>     }
>     if (this._dompc.iceConnectionState === 'checking' &&
>         (iceConnectionState === 'completed' ||
>          iceConnectionState === 'connected')) {
>           histogram.add(true);
>     }
> So, my concern is that we might somehow get to state 'failed' for "calls"
> that never actually had a chance of succeeding.  I'd like to verify that's
> the case (that the NYTimes thing isn't warping our data, for example), and
> if other variants can get to 'failed' (or 'success', though I'm much less
> worried that's possible).  Perhaps we need a third bucket, or just exclude
> some from the two here.
> 
> We also need to key off Loop vs non-Loop for reporting and update
> Histogram.json; see PeerConnectionImpl::isLoop() to know if it's a loop call
> or not.

   It depends on what we mean by "never actually had a chance of succeeding". We won't hit this unless offer/answer actually concludes, so unused PCs won't trip this. We have some amount of visibility into what could have worked, but didn't, from the candidate pair telemetry but probably not enough.
Flags: needinfo?(docfaraday)
Depends on: 1198883
(In reply to Byron Campen [:bwc] from comment #7)
> We won't hit this unless offer/answer actually concludes, so unused PCs won't trip this.

That may be true historically, but not in nightly with relay-only FYI http://jsfiddle.net/Lcg9ghtg
Hmm. On the flipside, there are probably other fatal errors (in gathering) that will cause something similar that are legitimately our fault.
Nils - do any of your updated stats help investigate this?
Nils -- Based on our discussion in triage yesterday, I'd like to assign this to you to dig down roughly one layer deeper on the ICE failure rates we're seeing.  Depending on what you find, we'll either make this a meta bug or reassess who should fix this.
Assignee: nobody → drno
Flags: needinfo?(drno)
So after looking around in the nICEr code when we switch the ICE connection status to failed basically results in this list:
1) After media streams have times out (this is the normal case for an ICE failure)
2) When gathering fails, which includes failure to gather local addresses (to answer the question from comment #10: the updated stats should at least tell us how often we fail to gather local addresses)
3) No streams are assigned/requested. As Byron pointed out a PeerConnection which never went through offer-answer should not leave the 'new' connection state. I guess that leave us with codec mis-matches and other scenarios why no usable stream is left after offer-answer concludes.
4) All kind of internal errors, like memory allocation etc, which should only happen rarely.

The open question is how likely/often do we encounter #2 and #3?

The two options I see right now are:
A) Wait for the improved ICE candiate stats to give us an idea on how often we encounter #2 and then take it from there.
B) Change the code jesup pointed out in comment #6 to only report failures in the stats if the connection state 'checking' had been reach followed by 'failure' (similar to like we do it for the success case already).

The dis-advantage of option B) is that it would leave us "blind" on how often we encounter problems #2, #3 and #4. But maybe we could mitigate that through new/other telemetry probes.
Bug 1188391: separate immediate ICE failures from failures after checking. r?bwc
Attachment #8688072 - Flags: review?(docfaraday)
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

https://reviewboard.mozilla.org/r/25281/#review22769

::: dom/media/PeerConnection.js:1326
(Diff revision 1)
> -    if (iceConnectionState === 'failed') {
> -      histogram.add(false);
> +    if (iceConnectionState === 'checking') {
> +      checking_histogram.add(true);
> +    } else if (iceConnectionState === 'failed') {
> +      checking_histogram.add(false);
>        this._dompc.logError("ICE failed, see about:webrtc for more details", null, 0);
>      }

If we see new->checking->failed, this will hit this histogram twice. This isn't what we want, right?
Attachment #8688072 - Flags: review?(docfaraday)
Just for reference: bug 1221837 is an example of case 3) from comment #12 (but we still have no idea how often this happens out there).
See Also: → 1221837
Attachment #8688072 - Flags: review?(docfaraday)
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25281/diff/1-2/
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

https://reviewboard.mozilla.org/r/25281/#review22781

::: dom/media/PeerConnection.js:1334
(Diff revisions 1 - 2)
> +    } else if (iceConnectionState === 'checking') {
> +      checking_histogram.add(true);
> +    } else if (iceConnectionState === 'failed') {
> +      checking_histogram.add(false);
> +      this._dompc.logError("ICE failed, see about:webrtc for more details", null, 0);

This is probably going to do weird things when renegotiation puts us back to checking (from connected/completed). We probably need to check if the previous state was new.
Attachment #8688072 - Flags: review?(docfaraday)
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25281/diff/2-3/
Attachment #8688072 - Flags: review?(docfaraday)
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

https://reviewboard.mozilla.org/r/25281/#review22807

This looks right to me.
Attachment #8688072 - Flags: review?(docfaraday) → review+
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25281/diff/3-4/
Attachment #8688072 - Flags: review?(vladan.bugzilla)
Vladan, anyone who you give us a quick review for a new telemetry probe?
We are simply trying to split the existing probe into two, to get a better understanding of the different failure scenarios.
Attachment #8688072 - Flags: review?(vladan.bugzilla)
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

https://reviewboard.mozilla.org/r/25281/#review22905

::: toolkit/components/telemetry/Histograms.json:10039
(Diff revision 3)
> +  "WEBRTC_ICE_CHECKING_RATE": {

* add an alert_emails and bug_numbers field
* make it expire in 8 versions (~1 year), you can easily renew it then and you'll be notified by an automated email. 
* It would be good to add these fields & update expieries for all the existing WEBRTC_ICE probes
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25281/diff/3-4/
Attachment #8688072 - Attachment description: MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking. r?bwc → MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking. r+bwc,r?vladan
Attachment #8688072 - Flags: review?(vladan.bugzilla)
Attachment #8688072 - Flags: review?(vladan.bugzilla) → review+
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

https://reviewboard.mozilla.org/r/25281/#review23387

::: toolkit/components/telemetry/Histograms.json:10056
(Diff revisions 3 - 4)
>      "kind": "boolean",

sorry, I forgot to mention, can you also add a bug_numbers field to this new histogram
Comment on attachment 8688072 [details]
MozReview Request: Bug 1188391: separate immediate ICE failures from failures after checking.  r+bwc,r?vladan

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25281/diff/4-5/
Try run results look sane.

I suggest to leave this open, so that we can check if this change is enough to bring the ICE success rate value into acceptable ranges.
Reminder to myself to check back on the new WEBRTC_ICE_CHECKING_RATE and old WEBRTC_ICE_SUCCESS_RATE once we have gathered enough data in Telemetry.
Flags: needinfo?(drno)
Depends on: 933986
Flags: needinfo?(drno)
Depends on: 1257004
Depends on: 1257006
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
I don't have the capacity any more to work on this any time soon.
Assignee: drno → nobody

The leave-open keyword is there and there is no activity for 6 months.
:drno, maybe it's time to close this bug?

Flags: needinfo?(drno)

The leave-open keyword is there and there is no activity for 6 months.
:drno, maybe it's time to close this bug?

Flags: needinfo?(drno)
Status: NEW → RESOLVED
Closed: Last month
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.