Closed Bug 994184 Opened 10 years ago Closed 10 years ago

[meta] Loop needs to upload ICE success information, logs

Categories

(Hello (Loop) :: General, defect, P5)

x86
macOS
defect

Tracking

(firefox34 fixed)

RESOLVED FIXED
mozilla34
Tracking Status
firefox34 --- fixed
backlog mlp+

People

(Reporter: abr, Unassigned)

References

Details

(Whiteboard: [tech-risk][feedback][est:1d][p=1, test])

When we initially launch the MLP, it is quite likely that we will see non-trivial connection failures, and we will want to be in a position to rapidly diagnose (and, where possible, fix) the causes of these failures. To that end, we need to be able to get to the ICE statistics; and, if a call fails to set up (i.e., does not transition to "connected" in a reasonable amount of time), upload ICE logging information.

The ICE logging information is already captured in a ring buffer for the purposes of populating the "about:webrtc" panel. The client should be capable of accessing this same ring buffer, and posting it to a server for collection.

Minimally, the server should accept and store the logs and make them available to authorized individuals.

(this is a meta bug -- we need sub-bugs for client and server behaviors)
Is this related to the loop server, or do we have a telemetry server that's already able to handle that?
Moving to mvp for now, but we'll want this soon after mlp if not before.
Blocks: loop_mvp
No longer blocks: loop_mlp
Whiteboard: [tech-risk][feedback]
(In reply to Mark Banner (:standard8) from comment #2)
> Moving to mvp for now, but we'll want this soon after mlp if not before.

This really needs to be part of the system before it becomes user-discoverable, or we'll be flying blind. If we get out and start having lots of failures without any way of determining why, we're going to be in a really uncomfortable position. In other words, this needs to block MLP release; I'm putting it back into the list. The work on the client side should be relatively small, so I'm not too worried about it impacting our desktop delivery.


(In reply to Alexis Metaireau (:alexis) from comment #1)
> Is this related to the loop server, or do we have a telemetry server that's
> already able to handle that?

I need to touch base with you to determine what is already available from the services team. If we can leverage something already deployed, then that's really the way to do this -- I don't want us reinventing any wheels here. I'll catch you on IRC.
Blocks: loop_mlp
No longer blocks: loop_mvp
Naive question: is it worth considering (either now or later) using RTCPeerConnection.getStats() to do this in the shared code so that we could easily upload/correlate-but-anonymize stats from both sides of the same call?
(In reply to Dan Mosedale (:dmose) from comment #4)
> Naive question: is it worth considering (either now or later) using
> RTCPeerConnection.getStats() to do this in the shared code so that we could
> easily upload/correlate-but-anonymize stats from both sides of the same call?

I'm not sure what in getStats you think we should use to provide this correlation; but it seems it would be easier (and probably more foolproof) to include the room token as part of the log meta-information.
Sorry, I wasn't clear.  I meant "use getStats() on both sides purely for statistics collection instead of the raw ringbuffer stuff on the client side only, and, in addition, send necessary data so that the statistics data can be correlated."  And yeah, the room token might be just the thing.
After talking with Alexis, it would appear that the infrastructure for doing this is most likely in performance and metrics rather than services. I'll be reaching out to Mark Reid to get his take on the best way forward from the server side.
Mark Reid pointed me to Bagheera (see https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate for catching our failure logs. Still researching...
I've looked into the client side of what needs to happen here, to confirm that it should be very straightforward. The steps appear to be:

- Acquire the PC (this may take some digging around, and is likely to be the trickiest part).
- Monitor for ICE state changes
- If the state transitions to "failed," (we may also want to set a timer and consider failure if success doesn't happen in a reasonable timeframe) submit a report as follows:
- Instantate a WebrtcGlobalInformation object (see http://dxr.mozilla.org/mozilla-central/source/dom/webidl/WebrtcGlobalInformation.webidl)
- Call "getLogging" on the WGI object
- In the callback from getLogging (which takes an array of log lines as its argument), send the log to Bagheera (see http://dxr.mozilla.org/mozilla-central/source/services/common/bagheeraclient.js for the API and http://dxr.mozilla.org/mozilla-central/source/services/healthreport/healthreporter.jsm#1447 for an example):
- import("resource://services-common/bagheeraclient.js");
- Instantiate a new BagheeraClient with a URL pointing to the Bagheera server
- Call client.uploadJSON with our designated namespace, a fresh UUID, and an object containing the log file array and other metainformation (e.g., call room token, client version).

This also necessitates having a configuration object that indicates the name server to upload to. I'm not meaning to gloss over the UX here, mind you -- we'll need informed user consent, so there is a user interaction step involved as well. The description above is merely intended to spell out the technical bits that aren't as well-worn as things like user dialog boxes.
(In reply to Adam Roach [:abr] from comment #8)
> Mark Reid pointed me to Bagheera (see
> https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate
> for catching our failure logs. Still researching...

On further discussion, we may be taking this down a different path. I'm going to talk to Mark more when he gets back into the office next week.
Whiteboard: [tech-risk][feedback] → [tech-risk][feedback][est:1d]
backlog: --- → mlp+
Okay, probably the more final answer is that we're going to use the telemetry servers straight up to collect these. This is the same interface as is used by the FFxOS FTU pings; you can see the general approach described here: https://wiki.mozilla.org/FirefoxOS/Metrics#Details

The path components they describe are actually wired into the telemetry schema, so we'll be using the same URL format, but with "loop" instead of "ftu" and "fxos": https://github.com/mozilla/telemetry-server/blob/master/telemetry/telemetry_schema.json

When we detect a call failure, the general sequence of events will be to create a JSON object, gzip it, and then use an XHR POST to send it to a URL of the format described above. The server takes this compressed blob, and (from Mark Reid): "uncompresses, validates, converts, recompresses, and stores" it.

Log processing is performed as described here: http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/

And example of the processing that we do for the the FTU ping logs can be found here: https://github.com/mozilla/telemetry-server/tree/master/mapreduce/fxosping

I'm pretty sure that's the final answer. So, this meta-bug needs three sub-bugs: a small bit of code for the client side; a small bit of code for the telemetry processing; and a simple bug to stand up the telemetry server that will take "a day or two" to provision. I will be adding these new bugs shortly.
Depends on: 998989
Depends on: 998996
Depends on: 999028
Whiteboard: [tech-risk][feedback][est:1d] → [tech-risk][feedback][est:1d][p=1]
Blocks: 1005175
Priority: -- → P2
Target Milestone: --- → mozilla32
This will likely be the first bug to land post-MLP (later this week).
Blocks: loop_mvp
No longer blocks: loop_mlp
Target Milestone: mozilla32 → mozilla33
Target Milestone: mozilla33 → mozilla34
leaving the meta - so we don't lose track of Bug 999028, that is under telemetry
Whiteboard: [tech-risk][feedback][est:1d][p=1] → [tech-risk][feedback][est:1d][p=1, test]
asked ABR for status of what we need to do for  	999028 or who we need to contact
Priority: P2 → P5
Adam -- Where does this stand? I believe it depends on Telemetry (bug 999028) -- anything else?
Flags: needinfo?(adam)
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #15)
> Adam -- Where does this stand? I believe it depends on Telemetry (bug
> 999028) -- anything else?

As far as I know, the only remaining component here is the telemetry analysis bug. Although I have no easy way to verify this, I believe we've been collecting ICE failure information for some time now, and simply need a means to get at them now. That's what Bug 999028 is supposed to do.
Flags: needinfo?(adam)
resolved meta bugs once broken down - this only has one open item to analyze the data collected and that is in progress
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
James, is this something you can help test when you're back from PTO? If it's unnecessary at this point feel free to flag qe-verify-.
Flags: qe-verify+
Flags: needinfo?(jbonacci)
QA Contact: jbonacci
There is nothing apparent/obvious here for Services QA.
Flags: qe-verify-
Flags: qe-verify+
Flags: needinfo?(jbonacci)
You need to log in before you can comment on or make changes to this bug.