Closed
Bug 994184
Opened 11 years ago
Closed 10 years ago
[meta] Loop needs to upload ICE success information, logs
Categories
(Hello (Loop) :: General, defect, P5)
Tracking
(firefox34 fixed)
People
(Reporter: abr, Unassigned)
References
Details
(Whiteboard: [tech-risk][feedback][est:1d][p=1, test])
When we initially launch the MLP, it is quite likely that we will see non-trivial connection failures, and we will want to be in a position to rapidly diagnose (and, where possible, fix) the causes of these failures. To that end, we need to be able to get to the ICE statistics; and, if a call fails to set up (i.e., does not transition to "connected" in a reasonable amount of time), upload ICE logging information.
The ICE logging information is already captured in a ring buffer for the purposes of populating the "about:webrtc" panel. The client should be capable of accessing this same ring buffer, and posting it to a server for collection.
Minimally, the server should accept and store the logs and make them available to authorized individuals.
(this is a meta bug -- we need sub-bugs for client and server behaviors)
Comment 1•11 years ago
|
||
Is this related to the loop server, or do we have a telemetry server that's already able to handle that?
Comment 2•11 years ago
|
||
Moving to mvp for now, but we'll want this soon after mlp if not before.
Updated•11 years ago
|
Whiteboard: [tech-risk][feedback]
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #2)
> Moving to mvp for now, but we'll want this soon after mlp if not before.
This really needs to be part of the system before it becomes user-discoverable, or we'll be flying blind. If we get out and start having lots of failures without any way of determining why, we're going to be in a really uncomfortable position. In other words, this needs to block MLP release; I'm putting it back into the list. The work on the client side should be relatively small, so I'm not too worried about it impacting our desktop delivery.
(In reply to Alexis Metaireau (:alexis) from comment #1)
> Is this related to the loop server, or do we have a telemetry server that's
> already able to handle that?
I need to touch base with you to determine what is already available from the services team. If we can leverage something already deployed, then that's really the way to do this -- I don't want us reinventing any wheels here. I'll catch you on IRC.
Comment 4•11 years ago
|
||
Naive question: is it worth considering (either now or later) using RTCPeerConnection.getStats() to do this in the shared code so that we could easily upload/correlate-but-anonymize stats from both sides of the same call?
Reporter | ||
Comment 5•11 years ago
|
||
(In reply to Dan Mosedale (:dmose) from comment #4)
> Naive question: is it worth considering (either now or later) using
> RTCPeerConnection.getStats() to do this in the shared code so that we could
> easily upload/correlate-but-anonymize stats from both sides of the same call?
I'm not sure what in getStats you think we should use to provide this correlation; but it seems it would be easier (and probably more foolproof) to include the room token as part of the log meta-information.
Comment 6•11 years ago
|
||
Sorry, I wasn't clear. I meant "use getStats() on both sides purely for statistics collection instead of the raw ringbuffer stuff on the client side only, and, in addition, send necessary data so that the statistics data can be correlated." And yeah, the room token might be just the thing.
Reporter | ||
Comment 7•11 years ago
|
||
After talking with Alexis, it would appear that the infrastructure for doing this is most likely in performance and metrics rather than services. I'll be reaching out to Mark Reid to get his take on the best way forward from the server side.
Reporter | ||
Comment 8•11 years ago
|
||
Mark Reid pointed me to Bagheera (see https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate for catching our failure logs. Still researching...
Reporter | ||
Comment 9•11 years ago
|
||
I've looked into the client side of what needs to happen here, to confirm that it should be very straightforward. The steps appear to be:
- Acquire the PC (this may take some digging around, and is likely to be the trickiest part).
- Monitor for ICE state changes
- If the state transitions to "failed," (we may also want to set a timer and consider failure if success doesn't happen in a reasonable timeframe) submit a report as follows:
- Instantate a WebrtcGlobalInformation object (see http://dxr.mozilla.org/mozilla-central/source/dom/webidl/WebrtcGlobalInformation.webidl)
- Call "getLogging" on the WGI object
- In the callback from getLogging (which takes an array of log lines as its argument), send the log to Bagheera (see http://dxr.mozilla.org/mozilla-central/source/services/common/bagheeraclient.js for the API and http://dxr.mozilla.org/mozilla-central/source/services/healthreport/healthreporter.jsm#1447 for an example):
- import("resource://services-common/bagheeraclient.js");
- Instantiate a new BagheeraClient with a URL pointing to the Bagheera server
- Call client.uploadJSON with our designated namespace, a fresh UUID, and an object containing the log file array and other metainformation (e.g., call room token, client version).
This also necessitates having a configuration object that indicates the name server to upload to. I'm not meaning to gloss over the UX here, mind you -- we'll need informed user consent, so there is a user interaction step involved as well. The description above is merely intended to spell out the technical bits that aren't as well-worn as things like user dialog boxes.
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Adam Roach [:abr] from comment #8)
> Mark Reid pointed me to Bagheera (see
> https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate
> for catching our failure logs. Still researching...
On further discussion, we may be taking this down a different path. I'm going to talk to Mark more when he gets back into the office next week.
Updated•11 years ago
|
Whiteboard: [tech-risk][feedback] → [tech-risk][feedback][est:1d]
Updated•11 years ago
|
backlog: --- → mlp+
Reporter | ||
Comment 11•11 years ago
|
||
Okay, probably the more final answer is that we're going to use the telemetry servers straight up to collect these. This is the same interface as is used by the FFxOS FTU pings; you can see the general approach described here: https://wiki.mozilla.org/FirefoxOS/Metrics#Details
The path components they describe are actually wired into the telemetry schema, so we'll be using the same URL format, but with "loop" instead of "ftu" and "fxos": https://github.com/mozilla/telemetry-server/blob/master/telemetry/telemetry_schema.json
When we detect a call failure, the general sequence of events will be to create a JSON object, gzip it, and then use an XHR POST to send it to a URL of the format described above. The server takes this compressed blob, and (from Mark Reid): "uncompresses, validates, converts, recompresses, and stores" it.
Log processing is performed as described here: http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/
And example of the processing that we do for the the FTU ping logs can be found here: https://github.com/mozilla/telemetry-server/tree/master/mapreduce/fxosping
I'm pretty sure that's the final answer. So, this meta-bug needs three sub-bugs: a small bit of code for the client side; a small bit of code for the telemetry processing; and a simple bug to stand up the telemetry server that will take "a day or two" to provision. I will be adding these new bugs shortly.
Updated•11 years ago
|
Whiteboard: [tech-risk][feedback][est:1d] → [tech-risk][feedback][est:1d][p=1]
Updated•11 years ago
|
Priority: -- → P2
Target Milestone: --- → mozilla32
Comment 12•11 years ago
|
||
This will likely be the first bug to land post-MLP (later this week).
Updated•11 years ago
|
Target Milestone: mozilla33 → mozilla34
Comment 13•11 years ago
|
||
leaving the meta - so we don't lose track of Bug 999028, that is under telemetry
Whiteboard: [tech-risk][feedback][est:1d][p=1] → [tech-risk][feedback][est:1d][p=1, test]
Comment 14•11 years ago
|
||
asked ABR for status of what we need to do for 999028 or who we need to contact
Priority: P2 → P5
Comment 15•10 years ago
|
||
Adam -- Where does this stand? I believe it depends on Telemetry (bug 999028) -- anything else?
Flags: needinfo?(adam)
Reporter | ||
Comment 16•10 years ago
|
||
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #15)
> Adam -- Where does this stand? I believe it depends on Telemetry (bug
> 999028) -- anything else?
As far as I know, the only remaining component here is the telemetry analysis bug. Although I have no easy way to verify this, I believe we've been collecting ICE failure information for some time now, and simply need a means to get at them now. That's what Bug 999028 is supposed to do.
Flags: needinfo?(adam)
Comment 17•10 years ago
|
||
resolved meta bugs once broken down - this only has one open item to analyze the data collected and that is in progress
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 18•10 years ago
|
||
James, is this something you can help test when you're back from PTO? If it's unnecessary at this point feel free to flag qe-verify-.
Flags: qe-verify+
Flags: needinfo?(jbonacci)
QA Contact: jbonacci
Comment 19•10 years ago
|
||
There is nothing apparent/obvious here for Services QA.
Flags: qe-verify-
Flags: qe-verify+
Flags: needinfo?(jbonacci)
status-firefox34:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•