Can we now use standard telemetry for webrtc stats?
Categories
(Core :: WebRTC: Signaling, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox72 | --- | fixed |
People
(Reporter: chutten, Assigned: dminor)
Details
(Whiteboard: [measurement:client:tracking])
Attachments
(2 files)
webrtc stats are one of the few remaining pieces of telemetry captured on childPayloads instead of being aggregated to the parent. Unfortunately, webrtc stats are by their own nature quite complicated. We first introduced a pair of probes in bug 970690 Then we introduced a custom struct in bug 1198883 Luckily, no one's using that custom struct yet so we can still change it further if we're clever. If we can use standard telemetry primitives (Histograms, Scalars, whatever) we can remove the custom webrtc handling and get client child aggregation for free. The trick is that we need to record two 2^11 (2048) bitstrings' worth of information[1]. Anyone have a brilliant idea? [1]: http://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/peerconnection/WebrtcGlobalInformation.cpp#1053-1080
Reporter | ||
Comment 1•7 years ago
|
||
So we could have a 2048-bucket categorical histogram. Pros: * Excellent tooling support. * telemetry.mozilla.org might display this exactly the way we want * String buckets instead of bitstrings sounds easier to read to me Cons: * 2048 buckets... this would be the first histogram quite this wide. Might stress something. * Have to call Accumulate multiple times per collection (once for each flipped bit). (might not be a big deal as it could replace the bitstring creation code) If telemetry.mozilla.org won't satisfy the analysis needs, we're looking at custom analysis. At that point it doesn't really matter what format we use so long as we can munge it in python (or SQL) later. That opens things up like keyed boolean histograms (one histogram each for success/failure, keys are the bitstrings), keyed uint scalars (ditto), and probably other ideas. Questions 1) :drno - what analysis would you like to perform on this data when you get it? 2) :gfritzsche, :Dexter - have any brilliant ideas for storing this data? Any knowledge about internal bucket limits?
Comment 2•7 years ago
|
||
(In reply to Chris H-C :chutten from comment #1) > So we could have a 2048-bucket categorical histogram. That sounds like a good idea to me, given that it's an exceptional measurement and that we don't want to add 2048-bucket every other day. I'm a bit concerned about the impact on the ping size: our serialization format would basically enforce 2048 keys to be dumped in the "values" section of this histogram. Is that correct? > 2) :gfritzsche, :Dexter - have any brilliant ideas for storing this data? > Any knowledge about internal bucket limits? As far as I can tell/remember, we only require the histogram to be in a whitelist if more than 100 buckets are needed. We don't seem to enforce other limit (other than the minimum/default number of 50 buckets for categoricals).
Comment 3•7 years ago
|
||
To give a bit of history: the reason this in custom code is that the default Histograms could not carry the 27 bits we are using. Or there were at least concerns about the size of data to be transferred as each of the 2^27 representations would get transferred (?). I think the default analyzes I would want to perform on this data would something like: - show me success vs failure percentages in case where both sides of the call had IPv6 UDP - show me success vs failure percentages where only one side had TCP available - show me how many Windows clients had TCP locally available ... And obviously ;-) all of that per Firefox version, channel and OS :-) Ideally we would have some kind of interface similar to standard Telemetry interface where people can change OS, version etc in drop downs. But obviously an initial version with hard coded queries would be a good start as well.
Comment 4•7 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #3) > To give a bit of history: the reason this in custom code is that the default > Histograms could not carry the 27 bits we are using. Or there were at least > concerns about the size of data to be transferred as each of the 2^27 > representations would get transferred (?). The serialization/transfer of histogram data is sparse, but if you record many of those representations in a single session, that is a concern. AFAIU, performance becomes a concern for the aggregator though for high bucket counts. > I think the default analyzes I would want to perform on this data would > something like: > - show me success vs failure percentages in case where both sides of the > call had IPv6 UDP > - show me success vs failure percentages where only one side had TCP > available > - show me how many Windows clients had TCP locally available > ... Can you enumerate the standard questions and add standard scalars or histograms for them? (e.g. boolean scalar for "tcp available", boolean histogram for "success/failure with both sides having udp") Then you would have them show up automatically in e.g. the TMO dashboard without further work.
Reporter | ||
Comment 5•7 years ago
|
||
Hey :frank, know of any perf/stability concerns for aggregating particularly wide (~2K buckets) histograms? ...but you know what, maybe we can be more clever than this. We could represent this as four (success/failure and local/remote) 11-bucket categorical histograms. Then if a bit would be flipped in the bitstring, we accumulate to that bit's bucket in the categorical histogram. For example, a bitstring of 4 and a bitstring of 5 (both local, success) would result in values of [1, 0, 2]. So we'd know what proportion of all connections use which features over time, but not pairs of features (that information goes missing at the client)
Comment 6•7 years ago
|
||
(In reply to Chris H-C :chutten from comment #5) > Hey :frank, know of any perf/stability concerns for aggregating particularly > wide (~2K buckets) histograms? Nope, we have a bunch that are 1K wide, and two that are 10K wide. See one here: https://mzl.la/2tc1MWf.
Updated•7 years ago
|
Updated•7 years ago
|
Comment 7•7 years ago
|
||
Mass change P2->P3 to align with new Mozilla triage process.
Comment 8•5 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
(In reply to Nils Ohlmeier [:drno] from comment #3)
To give a bit of history: the reason this in custom code is that the default
Histograms could not carry the 27 bits we are using. Or there were at least
concerns about the size of data to be transferred as each of the 2^27
representations would get transferred (?).The serialization/transfer of histogram data is sparse, but if you record
many of those representations in a single session, that is a concern.
AFAIU, performance becomes a concern for the aggregator though for high
bucket counts.I think the default analyzes I would want to perform on this data would
something like:
- show me success vs failure percentages in case where both sides of the
call had IPv6 UDP- show me success vs failure percentages where only one side had TCP
available- show me how many Windows clients had TCP locally available
...Can you enumerate the standard questions and add standard scalars or
histograms for them?
(e.g. boolean scalar for "tcp available", boolean histogram for
"success/failure with both sides having udp")
Then you would have them show up automatically in e.g. the TMO dashboard
without further work.
Looks like this fell between the cracks. The original intent for this telemetry was to answer the question "Why did our ICE success rate go down?". As such, we did not have a small number of things we wanted to monitor. We really did want every combination of capabilities (local and remote).
That said, I don't think anyone has looked at this telemetry in a really long time. I can't even figure out how to find this data anymore. We don't have telemetry for the overall ICE success rate, either.
Has anybody actually used this ICE candidate telemetry in the last year? We may just need to remove this.
Assignee | ||
Comment 11•5 years ago
|
||
I'm not using it. I can take care of removing it.
Reporter | ||
Comment 12•5 years ago
|
||
Please let me know if I can be of any assistance in its removal.
Assignee | ||
Comment 13•5 years ago
|
||
This ICE candidate telemetry has not been used in a long time and in
addition requires special handling by the telemetry code. It is best
removed.
Assignee | ||
Comment 14•5 years ago
|
||
The ICE candidate telemetry recorded using this is no longer useful,
and so this code can be safely removed.
Depends on D50656
Comment 15•5 years ago
|
||
Pushed by dminor@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e2536fbffa15 Remove ICE candidate telemetry; r=bwc https://hg.mozilla.org/integration/autoland/rev/bbd49f460213 Remove WebrtcTelemetry and associated code; r=chutten
Comment 16•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e2536fbffa15
https://hg.mozilla.org/mozilla-central/rev/bbd49f460213
Description
•