1325536 - Add ice telemetry to begin replacing stats provided by Hello calls

Michael Froman [:mjf]

Assignee

Description

•

8 years ago

We're missing valuable stats that were provided by Hello.

Michael Froman [:mjf]

Assignee

Updated

•

8 years ago

Rank: 25

Comment hidden (mozreview-request)

Nils Ohlmeier [:drno]

Comment 2

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review101574 In general looks good. My biggest concern is (and that's the reason for the r-) is that the accumulation code looks like it's only accumulating STUN retransmissions for relayed candidates. So I would like to see it again with the fixes. ::: media/mtransport/nricectx.h:194 (Diff revision 1) > std::string alpn_; > }; > > class TestNat; > > +class NrIceTelemetry { Since Telemetry is a Mozilla specific term I would prefer if at least the parts down in .../third_party/nICEr/* would use 'stats' rather than 'telemetry'. Not sure if it easier to also rename everything up in media/mtransport/* as well to keep it consistent. ::: media/mtransport/nricectxhandler.cpp:119 (Diff revision 1) > + new_ctx->ctx_->telemetry.stun_retransmits += telemetry.stun_retransmits; > + new_ctx->ctx_->telemetry.turn_401s += telemetry.turn_401s; > + new_ctx->ctx_->telemetry.turn_403s += telemetry.turn_403s; > + new_ctx->ctx_->telemetry.turn_438s += telemetry.turn_438s; All the += operations in here make me wonder if need to wrap these in some common operation which prevents overflows of these (with the assumption that reporting the max value makes more sense than the smaller value after an overflow). ::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c:824 (Diff revision 1) > cand->u.srvrflx.stun_handle=0; > } > > + // record retransmits back to the ice ctx > + if (cand->u.srvrflx.stun->retransmit_ct) { > + cand->ctx->telemetry.stun_retransmits += cand->u.srvrflx.stun->retransmit_ct; overflow protection? ::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c:207 (Diff revision 1) > if(pair->state==NR_ICE_PAIR_STATE_SUCCEEDED) > goto done; > > + // record retransmits back to the ice ctx > + if (pair->stun_client && pair->stun_client->retransmit_ct) { > + pair->local->ctx->telemetry.stun_retransmits += pair->stun_client->retransmit_ct; overflow protection? ::: media/mtransport/third_party/nICEr/src/ice/ice_component.c:1284 (Diff revision 1) > + // record retransmits back to the ice ctx > + if (comp->consent_ctx->retransmit_ct) { > + comp->ctx->telemetry.stun_retransmits += comp->consent_ctx->retransmit_ct; > + } > + Consent is actually kind of different beast. It has per definition no retransmissions, but uses new STUN transactions for every re-try and always sends only a single request. Hence this would drive our retrnamissions counters through the roof for ever longer call, as this would count every successful consent refresh as a STUN retransmission. So lets remove this here. I think we should have follow up ticket to add stats for consent time outs (and maybe for the failures). ::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.h:114 (Diff revision 1) > +typedef struct nr_ice_telemetry_ { > + UINT2 stun_retransmits; > + UINT2 turn_401s; > + UINT2 turn_403s; > + UINT2 turn_438s; > +} nr_ice_telemetry; Lets call these nr_ice_stats. ::: media/mtransport/third_party/nICEr/src/ice/ice_ctx.c:1073 (Diff revision 1) > + if (cand->type == RELAYED) { > + ctx->telemetry.turn_401s += cand->u.relayed.turn->cnt_401s; > + ctx->telemetry.turn_403s += cand->u.relayed.turn->cnt_403s; > + ctx->telemetry.turn_438s += cand->u.relayed.turn->cnt_438s; > + > + stun_ctx = STAILQ_FIRST(&cand->u.relayed.turn->stun_ctxs); > + while (stun_ctx) { > + ctx->telemetry.stun_retransmits += stun_ctx->stun->retransmit_ct; > + > + stun_ctx = STAILQ_NEXT(stun_ctx, entry); > + } > + } This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here? ::: media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:256 (Diff revision 1) > > if (ctx->state != NR_STUN_CLIENT_STATE_RUNNING) > ABORT(R_NOT_PERMITTED); > > + // track retransmits for ice telemetry > + ++ctx->retransmit_ct; overflow protection? ::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:258 (Diff revision 1) > + if (ctx->stun->error_code == 401) { > + // track 401s for ice telemetry > + ++ctx->tctx->cnt_401s; As an initial 401 is expected from the TURN server I think we should not count the initial 401's here, but further down in the |if (ctx->retry_ct >0)| block. ::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c:300 (Diff revision 1) > ABORT(R_FAILED); > } > break; > > case NR_STUN_CLIENT_STATE_TIMED_OUT: > ABORT(R_FAILED); Turn timeouts are probably also very interesting for a follow up bug. ::: toolkit/components/telemetry/Histograms.json:6941 (Diff revision 1) > "n_buckets": 20, > "description": "The length of time (in milliseconds) that a trickle candidate took to arrive after the start of ICE, given that it arrived after ICE failed." > }, > "WEBRTC_ICE_SUCCESS_TIME": { > "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"], > - "expires_in_version": "53", > + "expires_in_version": "55", I think these 53 values should be taken care of in bug 1319268 ::: toolkit/components/telemetry/Histograms.json:7255 (Diff revision 1) > "high": 30, > "n_buckets": 29, > "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed." > }, > + "WEBRTC_TEST_ICE_NICER": { > + "expires_in_version": "55", 55 is pretty low. I would at least go for 58 as the inital version, if not |never| ::: toolkit/components/telemetry/Histograms.json:7256 (Diff revision 1) > "n_buckets": 29, > "description": "The number of times AddIceCandidate failed on a given PeerConnection, given that ICE failed." > }, > + "WEBRTC_TEST_ICE_NICER": { > + "expires_in_version": "55", > + "alert_emails": ["webrtc-telemetry-alerts@mozilla.com"], The email should probably be webrtc-ice-telemetry-alerts@mozilla.com ::: toolkit/components/telemetry/histogram-whitelists.json:1705 (Diff revision 1) > "WEBRTC_AVSYNC_WHEN_VIDEO_LAGS_AUDIO_MS", > "WEBRTC_CALL_COUNT", > "WEBRTC_CALL_DURATION", > "WEBRTC_CALL_TYPE", > "WEBRTC_DATACHANNEL_NEGOTIATED", > + "WEBRTC_TEST_ICE_NICER", As I haven't seen/used this file before do you know/understand what the purpose is of this white list?

Attachment #8821424 - Flags: review?(drno) → review-

Michael Froman [:mjf]

Assignee

Comment 3

•

8 years ago

mozreview-review-reply

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review101574 > overflow protection? done > overflow protection? done > This looks to me like it's accumulating |stun_retransmits| only for relayed candidates and the stats from non relay candidates are ignored. Or do I miss something here? During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here. I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c. I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like. My concern is that it might double count. I'm going to instrument and do some more testing. > overflow protection? done > As I haven't seen/used this file before do you know/understand what the purpose is of this white list? After an IRC tip from dmajor pointing to Bug 1219768, I've removed our entry from this file. It is grandfathering probes that don't include bug_numbers and alert_emails fields.

Nils Ohlmeier [:drno]

Comment 4

•

8 years ago

mozreview-review-reply

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review101574 > During testing, I never once saw u.srvrflx.stun != 0, which is why I wasn't capturing stun retransmits here. I catch them in ice_candidate_pair.c, ice_candidate.c, and ice_component.c. I can easily check for non-null u.srvrflx.stun and do the accumulation here too if you'd like. My concern is that it might double count. I'm going to instrument and do some more testing. Never mind. Your reply made me have a look again and realized that I must have mis-read the code the first time. I though all the other places only do retrnasmit_ct++, where in fact they add to the telemetry struct.

Comment hidden (mozreview-request)

Michael Froman [:mjf]

Assignee

Updated

•

8 years ago

Blocks: 1331068

Nils Ohlmeier [:drno]

Comment 6

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review105400 ::: media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp:1115 (Diff revision 2) > + CSFLogDebug(logTag, "Ice Telemetry: stun_retransmits: %d", > + stats.stun_retransmits); > + CSFLogDebug(logTag, "Ice Telemetry: turn_401s: %d", stats.turn_401s); > + CSFLogDebug(logTag, "Ice Telemetry: turn_403s: %d", stats.turn_403s); > + CSFLogDebug(logTag, "Ice Telemetry: turn_438s: %d", stats.turn_438s); Can we combine this into one message?

Attachment #8821424 - Flags: review?(drno) → review+

Nils Ohlmeier [:drno]

Updated

•

8 years ago

Blocks: 1331130

Georg Fritzsche [:gfritzsche]

Comment 7

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review105422 I'm not sure what i should look at here, so i just looked at the Telemetry / Histogram usage. I think you don't need to use keyed histograms here as you have a well known set of values to measure - see the comment below. I'll be out on PTO, please flag :dexter or :chutten if you need more feedback. Note that after code review, this will still need data collection review: https://wiki.mozilla.org/Firefox/Data_Collection ::: toolkit/components/telemetry/Histograms.json:7307 (Diff revision 2) > }, > + "WEBRTC_ICE_NICER_STATS": { > + "expires_in_version": "never", > + "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"], > + "bug_numbers": [1325536], > + "kind": "count", I think this should just a categorical histogram, see this example: https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Histograms.json#5313 This is the API call: https://dxr.mozilla.org/mozilla-central/rev/de67fccc4c64a49f261aea29141357b94c7b3b9c/toolkit/components/telemetry/Telemetry.h#105

Attachment #8821424 - Flags: review?(gfritzsche) → review-

Nils Ohlmeier [:drno]

Updated

•

8 years ago

Blocks: 1247542

Michael Froman [:mjf]

Assignee

Comment 8

•

8 years ago

(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #7) > I think you don't need to use keyed histograms here as you have a well known > set of values to measure - see the comment below. The main reason we went with the keyed histogram solution is that we will be adding additional ICE telemetry over time. We picked these 4 just to get started. Does that change Georg's suggestion to use categorical histogram?

Flags: needinfo?(chutten)

Michael Froman [:mjf]

Assignee

Comment 9

•

8 years ago

mozreview-review-reply

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review105400 > Can we combine this into one message? Done.

Michael Froman [:mjf]

Assignee

Updated

•

8 years ago

Attachment #8821424 - Flags: review?(benjamin)

Benjamin Smedberg

Comment 10

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review106006 ::: toolkit/components/telemetry/Histograms.json:7309 (Diff revision 2) > + "expires_in_version": "never", > + "alert_emails": ["webrtc-ice-telemetry-alerts@mozilla.com"], > + "bug_numbers": [1325536], > + "kind": "count", > + "keyed": true, > + "description": "nICEr stats by key (stun_retransmits, turn_401s, turn_403s, turn_438s)." Georg is right, this shouldn't be a keyed histogram. They are expensive to process and unnecessary in this case. Categorical or enumerated would do fine. I'd really like this comment to be more descriptive. Imagine somebody who doesn't know anything about webrtc and is reviewing our data collection docs. They don't know what ICE is and can't figure out when this might be collected. Maybe link to the webrtc spec or other documentation? And although I know now what ICE is, I'm not sure exactly what each of these is counting. When there is an ICE, does it record exactly one of these reasons, or can it record more than one reason per connection problem? It's good to document the invariants. You should start collecting this temporarily to make sure that it's measuring something useful, before we decide whether it's worth making permanent. Please expire in 57 to start.

Attachment #8821424 - Flags: review?(benjamin) → review-

Chris H-C :chutten|PTO (back Oct 23)

Comment 11

•

8 years ago

I agree with bsmedberg: Categorical is an excellent type for this, allowing you to use the labels conveniently. Plus, with bug 1312806, adding new category labels is simplicity itself.

Flags: needinfo?(chutten)

Michael Froman [:mjf]

Assignee

Comment 12

•

8 years ago

(In reply to Chris H-C :chutten from comment #11) > I agree with bsmedberg: Categorical is an excellent type for this, allowing > you to use the labels conveniently. Plus, with bug 1312806, adding new > category labels is simplicity itself. The API for AccumulateCategorical appears to require N calls for a counter value of N. These counts are accumulated deep in the nICEr stack which does not have access to our c++ Telemetry calls. I could not find any c++ usage of a categorical histogram to confirm this. If this is the expected usage, it seems like categorical does not really fit our needs.

Flags: needinfo?(chutten)

Chris H-C :chutten|PTO (back Oct 23)

Comment 13

•

8 years ago

Maybe I need to take a step back, as I'm not sure I properly understand what your data reqs are here. Would you be willing to find me on #telemetry so we can chat about what you want to measure and I can flail wildly until I hit upon a solution that satisfies your data needs?

Flags: needinfo?(chutten)

Michael Froman [:mjf]

Assignee

Comment 14

•

8 years ago

After more conversation with chutten and dexter, it looks like Telemetry Scalars are more appropriate for our usage. From chutten and dexter: TelemetryScalar.h contains the API: Add(id, value) Set(id, value) and SetMaximum(id, value) If you take a look at toolkit/components/telemetry/Scalars.yaml you can see existing scalars You can organize them by namespaces, which seems directly relevant to your interests as you want to store a bunch of stuff that isn't just different options of the same one thing, but is in the same "domain" here's how to set a scalar from JS -> http://searchfox.org/mozilla-central/source/browser/modules/BrowserUsageTelemetry.jsm#148 how to do the same in C++ -> http://searchfox.org/mozilla-central/rev/790b2cb423ea1ecb5746722d51633caec9bab95a/toolkit/components/telemetry/tests/gtest/TestScalars.cpp#146

Comment hidden (mozreview-request)

Chris H-C :chutten|PTO (back Oct 23)

Comment 16

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review106796 The use of scalars seems correct to me. You'll need a Data Review on this. And if you don't already, a technical review of the actual counting and transfer mechanism as I'm not a domain expert and can't tell if you are counting the right things at the right time or not, and are passing them up and down the tech stack properly or not.

Attachment #8821424 - Flags: review?(chutten) → review+

Benjamin Smedberg

Comment 17

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review107092 data-r=me with as much of that is possible. ::: toolkit/components/telemetry/Scalars.yaml:288 (Diff revision 3) > kind: boolean > keyed: true > notification_emails: > - telemetry-client-dev@mozilla.com > + > +# The following section contains WebRTC nICEr scalars As comments, these are readable in this file but won't show up in the autogenerated docs. Could you add these links to the `description` field of each of these metrics? Repeating isn't bad ;-)

Attachment #8821424 - Flags: review+

Comment hidden (mozreview-request)

Michael Froman [:mjf]

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 20

•

8 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/autoland/rev/95d09880ddf6 ice telemetry. r=bsmedberg,chutten,drno

Keywords: checkin-needed

Phil Ringnalda (:philor)

Comment 21

•

8 years ago

Backed out in https://hg.mozilla.org/integration/autoland/rev/706dc77f1554 for https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=95d09880ddf6cdcbf6803ad93eaccd61bf75b655&group_state=expanded&filter-failure_classification_id=2

Alessio Placitelli [:Dexter]

Updated

•

8 years ago

Depends on: 1333578

Alessio Placitelli [:Dexter]

Comment 22

•

8 years ago

mozreview-review

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review108178 ::: toolkit/components/telemetry/Scalars.yaml:328 (Diff revision 5) > + > +# The following section contains WebRTC nICEr scalars > +# For more info on ICE, see https://tools.ietf.org/html/rfc5245 > +# For more info on STUN, see https://tools.ietf.org/html/rfc5389 > +# For more info on TURN, see https://tools.ietf.org/html/rfc5766 > +webrtc.nicer: Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded.

Michael Froman [:mjf]

Assignee

Comment 23

•

8 years ago

mozreview-review-reply

Comment on attachment 8821424 [details] Bug 1325536 - ice telemetry. https://reviewboard.mozilla.org/r/100710/#review108178 > Are these measurements collected on the main process or on a different process? If no 'record_in_processes' property is specify, the measure is assumed to be collected on the main process: collection on other processes is discarded. These measurements are collected in the content process. I was seeing results in {profile}/datareporting/archived/yyyy-mm/*.jsonlz4 when running a non-debug build.

Alessio Placitelli [:Dexter]

Comment 24

•

8 years ago

Michael, thank you for your patience! Bug 1333578 just landed on central: you can rebase your patch on it and it shouldn't crash anymore.

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Comment 25

•

8 years ago

Thank you! Just did a quick check here on local build, and it looks good. I'm going to rebase and make sure I get a good try run.

Flags: needinfo?(mfroman)

Alessio Placitelli [:Dexter]

Comment 26

•

8 years ago

Ah, one last thing! There was a conversation on fhr-dev [1] about this probe. You should add the "main" process to the 'record_in_processes' list, along with the "content" process. By doing this you make sure that you're able to collect data on non-e10s enabled builds. [1] - https://mail.mozilla.org/pipermail/fhr-dev/2017-January/001128.html

Flags: needinfo?(mfroman)

Comment hidden (mozreview-request)

Michael Froman [:mjf]

Assignee

Comment 28

•

8 years ago

Done! Thank you!

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 29

•

8 years ago

Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e8805c2f775a ice telemetry. r=bsmedberg,chutten,drno

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 30

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e8805c2f775a

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Alessio Placitelli [:Dexter]

Updated

•

8 years ago

Depends on: 1333024

Nils Ohlmeier [:drno]

Updated

•

8 years ago

Blocks: 1338273