Closed
Bug 1325536
Opened 8 years ago
Closed 8 years ago
Add ice telemetry to begin replacing stats provided by Hello calls
Categories
(Core :: WebRTC: Networking, defect, P2)
Core
WebRTC: Networking
Tracking
()
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox54 | --- | fixed |
People
(Reporter: mjf, Assigned: mjf)
References
(Blocks 3 open bugs)
Details
Attachments
(1 file)
We're missing valuable stats that were provided by Hello.
Assignee | ||
Updated•8 years ago
|
Rank: 25
Comment hidden (mozreview-request) |
Comment 2•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review101574
In general looks good.
My biggest concern is (and that's the reason for the r-) is that the accumulation code looks like it's only accumulating STUN retransmissions for relayed candidates. So I would like to see it again with the fixes.
::: media/mtransport/nricectx.h:194
(Diff revision 1)
> std::string alpn_;
> };
>
> class TestNat;
>
> +class NrIceTelemetry {
Since Telemetry is a Mozilla specific term I would prefer if at least the parts down in .../third_party/nICEr/* would use 'stats' rather than 'telemetry'. Not sure if it easier to also rename everything up in media/mtransport/* as well to keep it consistent.
::: media/mtransport/nricectxhandler.cpp:119
(Diff revision 1)
> + new_ctx->ctx_->telemetry.stun_retransmits += telemetry.stun_retransmits;
> + new_ctx->ctx_->telemetry.turn_401s += telemetry.turn_401s;
> + new_ctx->ctx_->telemetry.turn_403s += telemetry.turn_403s;
> + new_ctx->ctx_->telemetry.turn_438s += telemetry.turn_438s;
All the += operations in here make me wonder if need to wrap these in some common operation which prevents overflows of these (with the assumption that reporting the max value makes more sense than the smaller value after an overflow).
::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c:824
(Diff revision 1)
> cand->u.srvrflx.stun_handle=0;
> }
>
> + // record retransmits back to the ice ctx
> + if (cand->u.srvrflx.stun->retransmit_ct) {
> + cand->ctx->telemetry.stun_retransmits += cand->u.srvrflx.stun->retransmit_ct;
overflow protection?
::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c:207
(Diff revision 1)
> if(pair->state==NR_ICE_PAIR_STATE_SUCCEEDED)
> goto done;
>
> + // record retransmits back to the ice ctx
> + if (pair->stun_client && pair->stun_client->retransmit_ct) {
> + pair->local->ctx->telemetry.stun_retransmits += pair->stun_client->retransmit_ct;
overflow protection?
::: media/mtransport/third_party/nICEr/src/ice/ice_component.c:1284
(Diff revision 1)
> + // record retransmits back to the ice ctx
> + if (comp->consent_ctx->retransmit_ct) {
> + comp->ctx->telemetry.stun_retransmits += comp->consent_ctx->retransmit_ct;
> + }
> +
Consent is actually kind of different beast. It has per definition no retransmissions, but uses new STUN transactions for every re-try and always sends only a single request.
Hence this would drive our retrnamissions counters through the roof for ever longer call, as this would count every successful consent refresh as a STUN retransmission.
So lets remove this here. I think we should have follow up ticket to add stats for consent time outs (and maybe for the failures).
::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.h:114
(Diff revision 1)
> +typedef struct nr_ice_telemetry_ {
> + UINT2 stun_retransmits;
> + UINT2 turn_401s;
> + UINT2 turn_403s;
> + UINT2 turn_438s;
> +} nr_ice_telemetry;
Lets call these nr_ice_stats.
::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.c:1073
(Diff revision 1)
> + if (cand->type == RELAYED) {
> + ctx->telemetry.turn_401s += cand->u.relayed.turn->cnt_401s;
> + ctx->telemetry.turn_403s += cand->u.relayed.turn->cnt_403s;
> + ctx->telemetry.turn_438s += cand->u.relayed.turn->cnt_438s;
> +
> + stun_ctx = STAILQ_FIRST(&cand->u.relayed.turn->stun_ctxs);
> + while (stun_ctx) {
> + ctx->telemetry.stun_retransmits += stun_ctx->stun->retransmit_ct;
> +
> + stun_ctx = STAILQ_NEXT(stun_ctx, entry);
> + }
> + }
This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here?
::: media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:256
(Diff revision 1)
>
> if (ctx->state != NR_STUN_CLIENT_STATE_RUNNING)
> ABORT(R_NOT_PERMITTED);
>
> + // track retransmits for ice telemetry
> + ++ctx->retransmit_ct;
overflow protection?
::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:258
(Diff revision 1)
> + if (ctx->stun->error_code == 401) {
> + // track 401s for ice telemetry
> + ++ctx->tctx->cnt_401s;
As an initial 401 is expected from the TURN server I think we should not count the initial 401's here, but further down in the |if (ctx->retry_ct >0)| block.
::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:300
(Diff revision 1)
> ABORT(R_FAILED);
> }
> break;
>
> case NR_STUN_CLIENT_STATE_TIMED_OUT:
> ABORT(R_FAILED);
Turn timeouts are probably also very interesting for a follow up bug.
::: toolkit/components/telemetry/Histograms.json:6941
(Diff revision 1)
> "n_buckets": 20,
> "description": "The length of time (in milliseconds) that a trickle candidate took to arrive after the start of ICE, given that it arrived after ICE failed."
> },
> "WEBRTC_ICE_SUCCESS_TIME": {
> "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> - "expires_in_version": "53",
> + "expires_in_version": "55",
I think these 53 values should be taken care of in bug 1319268
::: toolkit/components/telemetry/Histograms.json:7255
(Diff revision 1)
> "high": 30,
> "n_buckets": 29,
> "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed."
> },
> + "WEBRTC_TEST_ICE_NICER": {
> + "expires_in_version": "55",
55 is pretty low. I would at least go for 58 as the inital version, if not |never|
::: toolkit/components/telemetry/Histograms.json:7256
(Diff revision 1)
> "n_buckets": 29,
> "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed."
> },
> + "WEBRTC_TEST_ICE_NICER": {
> + "expires_in_version": "55",
> + "alert_emails": ["webrtc-telemetry-alerts@mozilla.com"],
The email should probably be webrtc-ice-telemetry-alerts@mozilla.com
::: toolkit/components/telemetry/histogram-whitelists.json:1705
(Diff revision 1)
> "WEBRTC_AVSYNC_WHEN_VIDEO_LAGS_AUDIO_MS",
> "WEBRTC_CALL_COUNT",
> "WEBRTC_CALL_DURATION",
> "WEBRTC_CALL_TYPE",
> "WEBRTC_DATACHANNEL_NEGOTIATED",
> + "WEBRTC_TEST_ICE_NICER",
As I haven't seen/used this file before do you know/understand what the purpose is of this white list?
Attachment #8821424 -
Flags: review?(drno) → review-
Assignee | ||
Comment 3•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review101574
> overflow protection?
done
> overflow protection?
done
> This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here?
During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here. I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c. I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like. My concern is that it might double count. I'm going to instrument and do some more testing.
> overflow protection?
done
> As I haven't seen/used this file before do you know/understand what the purpose is of this white list?
After an IRC tip from dmajor pointing to Bug 1219768, I've removed our entry from this file. It is grandfathering probes that don't include bug_numbers and alert_emails fields.
Comment 4•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review101574
> During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here. I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c. I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like. My concern is that it might double count. I'm going to instrument and do some more testing.
Never mind. Your reply made me have a look again and realized that I must have mis-read the code the first time. I though all the other places only do retrnasmit_ct++, where in fact they add to the telemetry struct.
Comment hidden (mozreview-request) |
Comment 6•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review105400
::: media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp:1115
(Diff revision 2)
> + CSFLogDebug(logTag, "Ice Telemetry: stun_retransmits: %d",
> + stats.stun_retransmits);
> + CSFLogDebug(logTag, "Ice Telemetry: turn_401s: %d", stats.turn_401s);
> + CSFLogDebug(logTag, "Ice Telemetry: turn_403s: %d", stats.turn_403s);
> + CSFLogDebug(logTag, "Ice Telemetry: turn_438s: %d", stats.turn_438s);
Can we combine this into one message?
Attachment #8821424 -
Flags: review?(drno) → review+
![]() |
||
Comment 7•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review105422
I'm not sure what i should look at here, so i just looked at the Telemetry / Histogram usage.
I think you don't need to use keyed histograms here as you have a well known set of values to measure - see the comment below.
I'll be out on PTO, please flag :dexter or :chutten if you need more feedback.
Note that after code review, this will still need data collection review:
https://wiki.mozilla.org/Firefox/Data_Collection
::: toolkit/components/telemetry/Histograms.json:7307
(Diff revision 2)
> },
> + "WEBRTC_ICE_NICER_STATS": {
> + "expires_in_version": "never",
> + "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> + "bug_numbers": [1325536],
> + "kind": "count",
I think this should just a categorical histogram, see this example:
https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Histograms.json#5313
This is the API call:
https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Telemetry.h#105
Attachment #8821424 -
Flags: review?(gfritzsche) → review-
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #7)
> I think you don't need to use keyed histograms here as you have a well known
> set of values to measure - see the comment below.
The main reason we went with the keyed histogram solution is that we will be adding additional ICE telemetry over time. We picked these 4 just to get started. Does that change Georg's suggestion to use categorical histogram?
Flags: needinfo?(chutten)
Assignee | ||
Comment 9•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review105400
> Can we combine this into one message?
Done.
Assignee | ||
Updated•8 years ago
|
Attachment #8821424 -
Flags: review?(benjamin)
Comment 10•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review106006
::: toolkit/components/telemetry/Histograms.json:7309
(Diff revision 2)
> + "expires_in_version": "never",
> + "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"],
> + "bug_numbers": [1325536],
> + "kind": "count",
> + "keyed": true,
> + "description": "nICEr stats by key (stun_retransmits, turn_401s, turn_403s, turn_438s)."
Georg is right, this shouldn't be a keyed histogram. They are expensive to process and unnecessary in this case. Categorical or enumerated would do fine.
I'd really like this comment to be more descriptive. Imagine somebody who doesn't know anything about webrtc and is reviewing our data collection docs. They don't know what ICE is and can't figure out when this might be collected. Maybe link to the webrtc spec or other documentation?
And although I know now what ICE is, I'm not sure exactly what each of these is counting. When there is an ICE, does it record exactly one of these reasons, or can it record more than one reason per connection problem? It's good to document the invariants.
You should start collecting this temporarily to make sure that it's measuring something useful, before we decide whether it's worth making permanent. Please expire in 57 to start.
Attachment #8821424 -
Flags: review?(benjamin) → review-
Comment 11•8 years ago
|
||
I agree with bsmedberg: Categorical is an excellent type for this, allowing you to use the labels conveniently. Plus, with bug 1312806, adding new category labels is simplicity itself.
Flags: needinfo?(chutten)
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #11)
> I agree with bsmedberg: Categorical is an excellent type for this, allowing
> you to use the labels conveniently. Plus, with bug 1312806, adding new
> category labels is simplicity itself.
The API for AccumulateCategorical appears to require N calls for a counter value of N. These counts are accumulated deep in the nICEr stack which does not have access to our c++ Telemetry calls. I could not find any c++ usage of a categorical histogram to confirm this. If this is the expected usage, it seems like categorical does not really fit our needs.
Flags: needinfo?(chutten)
Comment 13•8 years ago
|
||
Maybe I need to take a step back, as I'm not sure I properly understand what your data reqs are here. Would you be willing to find me on #telemetry so we can chat about what you want to measure and I can flail wildly until I hit upon a solution that satisfies your data needs?
Flags: needinfo?(chutten)
Assignee | ||
Comment 14•8 years ago
|
||
After more conversation with chutten and dexter, it looks like Telemetry Scalars are more appropriate for our usage. From chutten and dexter:
TelemetryScalar.h contains the API: Add(id, value) Set(id, value) and SetMaximum(id, value)
If you take a look at toolkit/components/telemetry/Scalars.yaml you can see existing scalars
You can organize them by namespaces, which seems directly relevant to your interests as you want to store a bunch of stuff that isn't just different options of the same one thing, but is in the same "domain"
here's how to set a scalar from JS -> http://searchfox.org/mozilla-central/source/browser/modules/BrowserUsageTelemetry.jsm#148
how to do the same in C++ -> http://searchfox.org/mozilla-central/rev/790b2cb423ea1ecb5746722d51633caec9bab95a/toolkit/components/telemetry/tests/gtest/TestScalars.cpp#146
Comment hidden (mozreview-request) |
Comment 16•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review106796
The use of scalars seems correct to me.
You'll need a Data Review on this. And if you don't already, a technical review of the actual counting and transfer mechanism as I'm not a domain expert and can't tell if you are counting the right things at the right time or not, and are passing them up and down the tech stack properly or not.
Attachment #8821424 -
Flags: review?(chutten) → review+
Comment 17•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review107092
data-r=me with as much of that is possible.
::: toolkit/components/telemetry/Scalars.yaml:288
(Diff revision 3)
> kind: boolean
> keyed: true
> notification_emails:
> - telemetry-client-dev@mozilla.com
> +
> +# The following section contains WebRTC nICEr scalars
As comments, these are readable in this file but won't show up in the autogenerated docs. Could you add these links to the `description` field of each of these metrics? Repeating isn't bad ;-)
Attachment #8821424 -
Flags: review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 20•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/95d09880ddf6
ice telemetry. r=bsmedberg,chutten,drno
Keywords: checkin-needed
Comment 21•8 years ago
|
||
Comment 22•8 years ago
|
||
mozreview-review |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review108178
::: toolkit/components/telemetry/Scalars.yaml:328
(Diff revision 5)
> +
> +# The following section contains WebRTC nICEr scalars
> +# For more info on ICE, see https://tools.ietf.org/html/rfc5245
> +# For more info on STUN, see https://tools.ietf.org/html/rfc5389
> +# For more info on TURN, see https://tools.ietf.org/html/rfc5766
> +webrtc.nicer:
Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded.
Assignee | ||
Comment 23•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8821424 [details]
Bug 1325536 - ice telemetry.
https://reviewboard.mozilla.org/r/100710/#review108178
> Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded.
These measurements are collected in the content process. I was seeing results in {profile}/datareporting/archived/yyyy-mm/*.jsonlz4 when running a non-debug build.
Comment 24•8 years ago
|
||
Michael, thank you for your patience! Bug 1333578 just landed on central: you can rebase your patch on it and it shouldn't crash anymore.
Flags: needinfo?(mfroman)
Assignee | ||
Comment 25•8 years ago
|
||
Thank you! Just did a quick check here on local build, and it looks good. I'm going to rebase and make sure I get a good try run.
Flags: needinfo?(mfroman)
Comment 26•8 years ago
|
||
Ah, one last thing! There was a conversation on fhr-dev [1] about this probe. You should add the "main" process to the 'record_in_processes' list, along with the "content" process. By doing this you make sure that you're able to collect data on non-e10s enabled builds.
[1] - https://mail.mozilla.org/pipermail/fhr-dev/2017-January/001128.html
Flags: needinfo?(mfroman)
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 29•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e8805c2f775a
ice telemetry. r=bsmedberg,chutten,drno
Keywords: checkin-needed
![]() |
||
Comment 30•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in
before you can comment on or make changes to this bug.
Description
•