Add ice telemetry to begin replacing stats provided by Hello calls

RESOLVED FIXED in Firefox 54

Status

()

Core
WebRTC: Networking
P2
normal
Rank:
25
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: mjf, Assigned: mjf)

Tracking

(Blocks: 3 bugs)

unspecified
mozilla54
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox54 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

59 bytes, text/x-review-board-request
drno
: review+
chutten
: review+
Details | Review
(Assignee)

Description

a year ago
We're missing valuable stats that were provided by Hello.
(Assignee)

Updated

a year ago
Rank: 25
Comment hidden (mozreview-request)

Comment 2

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review101574

In general looks good.
My biggest concern is (and that's the reason for the r-) is that the accumulation code looks like it's only accumulating STUN retransmissions for relayed candidates. So I would like to see it again with the fixes.

::: media/mtransport/nricectx.h:194
(Diff revision 1)
>    std::string alpn_;
>  };
>  
>  class TestNat;
>  
> +class NrIceTelemetry {

Since Telemetry is a Mozilla specific term I would prefer if at least the parts down in .../third_party/nICEr/* would use 'stats' rather than 'telemetry'. Not sure if it easier to also rename everything up in media/mtransport/* as well to keep it consistent.

::: media/mtransport/nricectxhandler.cpp:119
(Diff revision 1)
> +  new_ctx->ctx_->telemetry.stun_retransmits += telemetry.stun_retransmits;
> +  new_ctx->ctx_->telemetry.turn_401s += telemetry.turn_401s;
> +  new_ctx->ctx_->telemetry.turn_403s += telemetry.turn_403s;
> +  new_ctx->ctx_->telemetry.turn_438s += telemetry.turn_438s;

All the += operations in here make me wonder if need to wrap these in some common operation which prevents overflows of these (with the assumption that reporting the max value makes more sense than the smaller value after an overflow).

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c:824
(Diff revision 1)
>        cand->u.srvrflx.stun_handle=0;
>      }
>  
> +    // record retransmits back to the ice ctx
> +    if (cand->u.srvrflx.stun->retransmit_ct) {
> +      cand->ctx->telemetry.stun_retransmits += cand->u.srvrflx.stun->retransmit_ct;

overflow protection?

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c:207
(Diff revision 1)
>      if(pair->state==NR_ICE_PAIR_STATE_SUCCEEDED)
>        goto done;
>  
> +    // record retransmits back to the ice ctx
> +    if (pair->stun_client && pair->stun_client->retransmit_ct) {
> +      pair->local->ctx->telemetry.stun_retransmits += pair->stun_client->retransmit_ct;

overflow protection?

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c:1284
(Diff revision 1)
> +    // record retransmits back to the ice ctx
> +    if (comp->consent_ctx->retransmit_ct) {
> +      comp->ctx->telemetry.stun_retransmits += comp->consent_ctx->retransmit_ct;
> +    }
> +

Consent is actually kind of different beast. It has per definition no retransmissions, but uses new STUN transactions for every re-try and always sends only a single request.
Hence this would drive our retrnamissions counters through the roof for ever longer call, as this would count every successful consent refresh as a STUN retransmission.

So lets remove this here. I think we should have follow up ticket to add stats for consent time outs (and maybe for the failures).

::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.h:114
(Diff revision 1)
> +typedef struct nr_ice_telemetry_ {
> +  UINT2 stun_retransmits;
> +  UINT2 turn_401s;
> +  UINT2 turn_403s;
> +  UINT2 turn_438s;
> +} nr_ice_telemetry;

Lets call these nr_ice_stats.

::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.c:1073
(Diff revision 1)
> +          if (cand->type == RELAYED) {
> +            ctx->telemetry.turn_401s += cand->u.relayed.turn->cnt_401s;
> +            ctx->telemetry.turn_403s += cand->u.relayed.turn->cnt_403s;
> +            ctx->telemetry.turn_438s += cand->u.relayed.turn->cnt_438s;
> +
> +            stun_ctx = STAILQ_FIRST(&cand->u.relayed.turn->stun_ctxs);
> +            while (stun_ctx) {
> +              ctx->telemetry.stun_retransmits += stun_ctx->stun->retransmit_ct;
> +
> +              stun_ctx = STAILQ_NEXT(stun_ctx, entry);
> +            }
> +          }

This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here?

::: media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:256
(Diff revision 1)
>  
>      if (ctx->state != NR_STUN_CLIENT_STATE_RUNNING)
>          ABORT(R_NOT_PERMITTED);
>  
> +    // track retransmits for ice telemetry
> +    ++ctx->retransmit_ct;

overflow protection?

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:258
(Diff revision 1)
> +      if (ctx->stun->error_code == 401) {
> +        // track 401s for ice telemetry
> +        ++ctx->tctx->cnt_401s;

As an initial 401 is expected from the TURN server I think we should not count the initial 401's here, but further down in the |if (ctx->retry_ct >0)| block.

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:300
(Diff revision 1)
>          ABORT(R_FAILED);
>        }
>        break;
>  
>      case NR_STUN_CLIENT_STATE_TIMED_OUT:
>        ABORT(R_FAILED);

Turn timeouts are probably also very interesting for a follow up bug.

::: toolkit/components/telemetry/Histograms.json:6941
(Diff revision 1)
>      "n_buckets": 20,
>      "description": "The length of time (in milliseconds) that a trickle candidate took to arrive after the start of ICE, given that it arrived after ICE failed."
>    },
>    "WEBRTC_ICE_SUCCESS_TIME": {
>      "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> -    "expires_in_version": "53",
> +    "expires_in_version": "55",

I think these 53 values should be taken care of in bug 1319268

::: toolkit/components/telemetry/Histograms.json:7255
(Diff revision 1)
>      "high": 30,
>      "n_buckets": 29,
>      "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed."
>    },
> +  "WEBRTC_TEST_ICE_NICER": {
> +    "expires_in_version": "55",

55 is pretty low. I would at least go for 58 as the inital version, if not |never|

::: toolkit/components/telemetry/Histograms.json:7256
(Diff revision 1)
>      "n_buckets": 29,
>      "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed."
>    },
> +  "WEBRTC_TEST_ICE_NICER": {
> +    "expires_in_version": "55",
> +    "alert_emails": ["webrtc-telemetry-alerts@mozilla.com"],

The email should probably be webrtc-ice-telemetry-alerts@mozilla.com

::: toolkit/components/telemetry/histogram-whitelists.json:1705
(Diff revision 1)
>      "WEBRTC_AVSYNC_WHEN_VIDEO_LAGS_AUDIO_MS",
>      "WEBRTC_CALL_COUNT",
>      "WEBRTC_CALL_DURATION",
>      "WEBRTC_CALL_TYPE",
>      "WEBRTC_DATACHANNEL_NEGOTIATED",
> +    "WEBRTC_TEST_ICE_NICER",

As I haven't seen/used this file before do you know/understand what the purpose is of this white list?
Attachment #8821424 - Flags: review?(drno) → review-
(Assignee)

Comment 3

a year ago
mozreview-review-reply
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review101574

> overflow protection?

done

> overflow protection?

done

> This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here?

During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here.  I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c.  I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like.  My concern is that it might double count.  I'm going to instrument and do some more testing.

> overflow protection?

done

> As I haven't seen/used this file before do you know/understand what the purpose is of this white list?

After an IRC tip from dmajor pointing to Bug 1219768, I've removed our entry from this file.  It is grandfathering probes that don't include bug_numbers and alert_emails fields.

Comment 4

a year ago
mozreview-review-reply
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review101574

> During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here.  I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c.  I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like.  My concern is that it might double count.  I'm going to instrument and do some more testing.

Never mind. Your reply made me have a look again and realized that I must have mis-read the code the first time. I though all the other places only do retrnasmit_ct++, where in fact they add to the telemetry struct.
Comment hidden (mozreview-request)
(Assignee)

Updated

a year ago
Blocks: 1331068

Comment 6

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review105400

::: media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp:1115
(Diff revision 2)
> +  CSFLogDebug(logTag, "Ice Telemetry: stun_retransmits: %d",
> +              stats.stun_retransmits);
> +  CSFLogDebug(logTag, "Ice Telemetry: turn_401s: %d", stats.turn_401s);
> +  CSFLogDebug(logTag, "Ice Telemetry: turn_403s: %d", stats.turn_403s);
> +  CSFLogDebug(logTag, "Ice Telemetry: turn_438s: %d", stats.turn_438s);

Can we combine this into one message?
Attachment #8821424 - Flags: review?(drno) → review+
Blocks: 1331130

Comment 7

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review105422

I'm not sure what i should look at here, so i just looked at the Telemetry / Histogram usage.
I think you don't need to use keyed histograms here as you have a well known set of values to measure - see the comment below.

I'll be out on PTO, please flag :dexter or :chutten if you need more feedback.

Note that after code review, this will still need data collection review:
https://wiki.mozilla.org/Firefox/Data_Collection

::: toolkit/components/telemetry/Histograms.json:7307
(Diff revision 2)
>    },
> +  "WEBRTC_ICE_NICER_STATS": {
> +    "expires_in_version": "never",
> +    "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> +    "bug_numbers": [1325536],
> +    "kind": "count",

I think this should just a categorical histogram, see this example:
https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Histograms.json#5313

This is the API call:
https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Telemetry.h#105
Attachment #8821424 - Flags: review?(gfritzsche) → review-
Blocks: 1247542
(Assignee)

Comment 8

a year ago
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #7)
> I think you don't need to use keyed histograms here as you have a well known
> set of values to measure - see the comment below.
The main reason we went with the keyed histogram solution is that we will be adding additional ICE telemetry over time.  We picked these 4 just to get started.  Does that change Georg's suggestion to use categorical histogram?
Flags: needinfo?(chutten)
(Assignee)

Comment 9

a year ago
mozreview-review-reply
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review105400

> Can we combine this into one message?

Done.
(Assignee)

Updated

a year ago
Attachment #8821424 - Flags: review?(benjamin)

Comment 10

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review106006

::: toolkit/components/telemetry/Histograms.json:7309
(Diff revision 2)
> +    "expires_in_version": "never",
> +    "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> +    "bug_numbers": [1325536],
> +    "kind": "count",
> +    "keyed": true,
> +    "description": "nICEr stats by key (stun_retransmits, turn_401s, turn_403s, turn_438s)."

Georg is right, this shouldn't be a keyed histogram. They are expensive to process and unnecessary in this case. Categorical or enumerated would do fine.

I'd really like this comment to be more descriptive. Imagine somebody who doesn't know anything about webrtc and is reviewing our data collection docs. They don't know what ICE is and can't figure out when this might be collected. Maybe link to the webrtc spec or other documentation?

And although I know now what ICE is, I'm not sure exactly what each of these is counting. When there is an ICE, does it record exactly one of these reasons, or can it record more than one reason per connection problem? It's good to document the invariants.

You should start collecting this temporarily to make sure that it's measuring something useful, before we decide whether it's worth making permanent. Please expire in 57 to start.
Attachment #8821424 - Flags: review?(benjamin) → review-
I agree with bsmedberg: Categorical is an excellent type for this, allowing you to use the labels conveniently. Plus, with bug 1312806, adding new category labels is simplicity itself.
Flags: needinfo?(chutten)
(Assignee)

Comment 12

a year ago
(In reply to Chris H-C :chutten from comment #11)
> I agree with bsmedberg: Categorical is an excellent type for this, allowing
> you to use the labels conveniently. Plus, with bug 1312806, adding new
> category labels is simplicity itself.
The API for AccumulateCategorical appears to require N calls for a counter value of N.  These counts are accumulated deep in the nICEr stack which does not have access to our c++ Telemetry calls.  I could not find any c++ usage of a categorical histogram to confirm this.  If this is the expected usage, it seems like categorical does not really fit our needs.
Flags: needinfo?(chutten)
Maybe I need to take a step back, as I'm not sure I properly understand what your data reqs are here. Would you be willing to find me on #telemetry so we can chat about what you want to measure and I can flail wildly until I hit upon a solution that satisfies your data needs?
Flags: needinfo?(chutten)
(Assignee)

Comment 14

a year ago
After more conversation with chutten and dexter, it looks like Telemetry Scalars are more appropriate for our usage.  From chutten and dexter:
TelemetryScalar.h contains the API: Add(id, value) Set(id, value) and SetMaximum(id, value)
If you take a look at toolkit/components/telemetry/Scalars.yaml you can see existing scalars
You can organize them by namespaces, which seems directly relevant to your interests as you want to store a bunch of stuff that isn't just different options of the same one thing, but is in the same "domain"
here's how to set a scalar from JS -> http://searchfox.org/mozilla-central/source/browser/modules/BrowserUsageTelemetry.jsm#148
how to do the same in C++ -> http://searchfox.org/mozilla-central/rev/790b2cb423ea1ecb5746722d51633caec9bab95a/toolkit/components/telemetry/tests/gtest/TestScalars.cpp#146
Comment hidden (mozreview-request)

Comment 16

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review106796

The use of scalars seems correct to me.

You'll need a Data Review on this. And if you don't already, a technical review of the actual counting and transfer mechanism as I'm not a domain expert and can't tell if you are counting the right things at the right time or not, and are passing them up and down the tech stack properly or not.
Attachment #8821424 - Flags: review?(chutten) → review+

Comment 17

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review107092

data-r=me with as much of that is possible.

::: toolkit/components/telemetry/Scalars.yaml:288
(Diff revision 3)
>      kind: boolean
>      keyed: true
>      notification_emails:
>        - telemetry-client-dev@mozilla.com
> +
> +# The following section contains WebRTC nICEr scalars

As comments, these are readable in this file but won't show up in the autogenerated docs. Could you add these links to the `description` field of each of these metrics? Repeating isn't bad ;-)
Attachment #8821424 - Flags: review+
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

a year ago
Keywords: checkin-needed

Comment 20

a year ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/95d09880ddf6
ice telemetry. r=bsmedberg,chutten,drno
Keywords: checkin-needed
Depends on: 1333578

Comment 22

a year ago
mozreview-review
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review108178

::: toolkit/components/telemetry/Scalars.yaml:328
(Diff revision 5)
> +
> +# The following section contains WebRTC nICEr scalars
> +# For more info on ICE, see https://tools.ietf.org/html/rfc5245
> +# For more info on STUN, see https://tools.ietf.org/html/rfc5389
> +# For more info on TURN, see https://tools.ietf.org/html/rfc5766
> +webrtc.nicer:

Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded.
(Assignee)

Comment 23

a year ago
mozreview-review-reply
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.

https://reviewboard.mozilla.org/r/100710/#review108178

> Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded.

These measurements are collected in the content process.  I was seeing results in {profile}/datareporting/archived/yyyy-mm/*.jsonlz4 when running a non-debug build.
Michael, thank you for your patience! Bug 1333578 just landed on central: you can rebase your patch on it and it shouldn't crash anymore.
Flags: needinfo?(mfroman)
(Assignee)

Comment 25

a year ago
Thank you!  Just did a quick check here on local build, and it looks good.  I'm going to rebase and make sure I get a good try run.
Flags: needinfo?(mfroman)
Ah, one last thing! There was a conversation on fhr-dev [1] about this probe. You should add the "main" process to the 'record_in_processes' list, along with the "content" process. By doing this you make sure that you're able to collect data on non-e10s enabled builds.

[1] - https://mail.mozilla.org/pipermail/fhr-dev/2017-January/001128.html
Flags: needinfo?(mfroman)
Comment hidden (mozreview-request)
(Assignee)

Comment 28

a year ago
Done! Thank you!
Flags: needinfo?(mfroman)
(Assignee)

Updated

a year ago
Keywords: checkin-needed

Comment 29

a year ago
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e8805c2f775a
ice telemetry. r=bsmedberg,chutten,drno
Keywords: checkin-needed

Comment 30

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/e8805c2f775a
Status: NEW → RESOLVED
Last Resolved: a year ago
status-firefox54: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Depends on: 1333024
Blocks: 1338273
You need to log in before you can comment on or make changes to this bug.