994184 - [meta] Loop needs to upload ICE success information, logs

Reporter

Description

•

11 years ago

When we initially launch the MLP, it is quite likely that we will see non-trivial connection failures, and we will want to be in a position to rapidly diagnose (and, where possible, fix) the causes of these failures. To that end, we need to be able to get to the ICE statistics; and, if a call fails to set up (i.e., does not transition to "connected" in a reasonable amount of time), upload ICE logging information. The ICE logging information is already captured in a ring buffer for the purposes of populating the "about:webrtc" panel. The client should be capable of accessing this same ring buffer, and posting it to a server for collection. Minimally, the server should accept and store the logs and make them available to authorized individuals. (this is a meta bug -- we need sub-bugs for client and server behaviors)

Alexis Metaireau (:alexis)

Comment 1

•

11 years ago

Is this related to the loop server, or do we have a telemetry server that's already able to handle that?

Mark Banner (:standard8)

Comment 2

•

11 years ago

Moving to mvp for now, but we'll want this soon after mlp if not before.

Blocks: loop_mvp
No longer blocks: loop_mlp

Maire Reavy [:mreavy]

Updated

•

11 years ago

Whiteboard: [tech-risk][feedback]

Adam Roach [:abr]

Reporter

Comment 3

•

11 years ago

(In reply to Mark Banner (:standard8) from comment #2) > Moving to mvp for now, but we'll want this soon after mlp if not before. This really needs to be part of the system before it becomes user-discoverable, or we'll be flying blind. If we get out and start having lots of failures without any way of determining why, we're going to be in a really uncomfortable position. In other words, this needs to block MLP release; I'm putting it back into the list. The work on the client side should be relatively small, so I'm not too worried about it impacting our desktop delivery. (In reply to Alexis Metaireau (:alexis) from comment #1) > Is this related to the loop server, or do we have a telemetry server that's > already able to handle that? I need to touch base with you to determine what is already available from the services team. If we can leverage something already deployed, then that's really the way to do this -- I don't want us reinventing any wheels here. I'll catch you on IRC.

Blocks: loop_mlp
No longer blocks: loop_mvp

Dan Mosedale (:dmosedale, :dmose)

Comment 4

•

11 years ago

Naive question: is it worth considering (either now or later) using RTCPeerConnection.getStats() to do this in the shared code so that we could easily upload/correlate-but-anonymize stats from both sides of the same call?

Adam Roach [:abr]

Reporter

Comment 5

•

11 years ago

(In reply to Dan Mosedale (:dmose) from comment #4) > Naive question: is it worth considering (either now or later) using > RTCPeerConnection.getStats() to do this in the shared code so that we could > easily upload/correlate-but-anonymize stats from both sides of the same call? I'm not sure what in getStats you think we should use to provide this correlation; but it seems it would be easier (and probably more foolproof) to include the room token as part of the log meta-information.

Dan Mosedale (:dmosedale, :dmose)

Comment 6

•

11 years ago

Sorry, I wasn't clear. I meant "use getStats() on both sides purely for statistics collection instead of the raw ringbuffer stuff on the client side only, and, in addition, send necessary data so that the statistics data can be correlated." And yeah, the room token might be just the thing.

Adam Roach [:abr]

Reporter

Comment 7

•

11 years ago

After talking with Alexis, it would appear that the infrastructure for doing this is most likely in performance and metrics rather than services. I'll be reaching out to Mark Reid to get his take on the best way forward from the server side.

Adam Roach [:abr]

Reporter

Comment 8

•

11 years ago

Mark Reid pointed me to Bagheera (see https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate for catching our failure logs. Still researching...

Adam Roach [:abr]

Reporter

Comment 9

•

11 years ago

I've looked into the client side of what needs to happen here, to confirm that it should be very straightforward. The steps appear to be: - Acquire the PC (this may take some digging around, and is likely to be the trickiest part). - Monitor for ICE state changes - If the state transitions to "failed," (we may also want to set a timer and consider failure if success doesn't happen in a reasonable timeframe) submit a report as follows: - Instantate a WebrtcGlobalInformation object (see http://dxr.mozilla.org/mozilla-central/source/dom/webidl/WebrtcGlobalInformation.webidl) - Call "getLogging" on the WGI object - In the callback from getLogging (which takes an array of log lines as its argument), send the log to Bagheera (see http://dxr.mozilla.org/mozilla-central/source/services/common/bagheeraclient.js for the API and http://dxr.mozilla.org/mozilla-central/source/services/healthreport/healthreporter.jsm#1447 for an example): - import("resource://services-common/bagheeraclient.js"); - Instantiate a new BagheeraClient with a URL pointing to the Bagheera server - Call client.uploadJSON with our designated namespace, a fresh UUID, and an object containing the log file array and other metainformation (e.g., call room token, client version). This also necessitates having a configuration object that indicates the name server to upload to. I'm not meaning to gloss over the UX here, mind you -- we'll need informed user consent, so there is a user interaction step involved as well. The description above is merely intended to spell out the technical bits that aren't as well-worn as things like user dialog boxes.

Adam Roach [:abr]

Reporter

Comment 10

•

11 years ago

(In reply to Adam Roach [:abr] from comment #8) > Mark Reid pointed me to Bagheera (see > https://intranet.mozilla.org/Metrics/bagheera) as the most likely candidate > for catching our failure logs. Still researching... On further discussion, we may be taking this down a different path. I'm going to talk to Mark more when he gets back into the office next week.

Maire Reavy [:mreavy]

Updated

•

11 years ago

Whiteboard: [tech-risk][feedback] → [tech-risk][feedback][est:1d]

Maire Reavy [:mreavy]

Updated

•

11 years ago

backlog: --- → mlp+

Adam Roach [:abr]

Reporter

Comment 11

•

11 years ago

Okay, probably the more final answer is that we're going to use the telemetry servers straight up to collect these. This is the same interface as is used by the FFxOS FTU pings; you can see the general approach described here: https://wiki.mozilla.org/FirefoxOS/Metrics#Details The path components they describe are actually wired into the telemetry schema, so we'll be using the same URL format, but with "loop" instead of "ftu" and "fxos": https://github.com/mozilla/telemetry-server/blob/master/telemetry/telemetry_schema.json When we detect a call failure, the general sequence of events will be to create a JSON object, gzip it, and then use an XHR POST to send it to a URL of the format described above. The server takes this compressed blob, and (from Mark Reid): "uncompresses, validates, converts, recompresses, and stores" it. Log processing is performed as described here: http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/ And example of the processing that we do for the the FTU ping logs can be found here: https://github.com/mozilla/telemetry-server/tree/master/mapreduce/fxosping I'm pretty sure that's the final answer. So, this meta-bug needs three sub-bugs: a small bit of code for the client side; a small bit of code for the telemetry processing; and a simple bug to stand up the telemetry server that will take "a day or two" to provision. I will be adding these new bugs shortly.

Adam Roach [:abr]

Reporter

Updated

•

11 years ago

Depends on: 998989

Adam Roach [:abr]

Reporter

Updated

•

11 years ago

Depends on: 998996

Adam Roach [:abr]

Reporter

Updated

•

11 years ago

Depends on: 999028

:shell escalante

Updated

•

11 years ago

Whiteboard: [tech-risk][feedback][est:1d] → [tech-risk][feedback][est:1d][p=1]

:shell escalante

Updated

•

11 years ago

Blocks: 1005175

:shell escalante

Updated

•

11 years ago

Priority: -- → P2

Target Milestone: --- → mozilla32

Maire Reavy [:mreavy]

Comment 12

•

11 years ago

This will likely be the first bug to land post-MLP (later this week).

Blocks: loop_mvp
No longer blocks: loop_mlp

Target Milestone: mozilla32 → mozilla33

:shell escalante

Updated

•

11 years ago

Target Milestone: mozilla33 → mozilla34

:shell escalante

Comment 13

•

11 years ago

leaving the meta - so we don't lose track of Bug 999028, that is under telemetry

Whiteboard: [tech-risk][feedback][est:1d][p=1] → [tech-risk][feedback][est:1d][p=1, test]

:shell escalante

Comment 14

•

11 years ago

asked ABR for status of what we need to do for 999028 or who we need to contact

Priority: P2 → P5

Maire Reavy [:mreavy]

Comment 15

•

10 years ago

Adam -- Where does this stand? I believe it depends on Telemetry (bug 999028) -- anything else?

Flags: needinfo?(adam)

Adam Roach [:abr]

Reporter

Comment 16

•

10 years ago

(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #15) > Adam -- Where does this stand? I believe it depends on Telemetry (bug > 999028) -- anything else? As far as I know, the only remaining component here is the telemetry analysis bug. Although I have no easy way to verify this, I believe we've been collecting ICE failure information for some time now, and simply need a means to get at them now. That's what Bug 999028 is supposed to do.

Flags: needinfo?(adam)

:shell escalante

Comment 17

•

10 years ago

resolved meta bugs once broken down - this only has one open item to analyze the data collected and that is in progress

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

u279076

Comment 18

•

10 years ago

James, is this something you can help test when you're back from PTO? If it's unnecessary at this point feel free to flag qe-verify-.

Flags: qe-verify+

Flags: needinfo?(jbonacci)

QA Contact: jbonacci

James Bonacci [:jbonacci]

Comment 19

•

10 years ago

There is nothing apparent/obvious here for Services QA.

Flags: qe-verify-

Flags: qe-verify+

Flags: needinfo?(jbonacci)

u279076

Updated

•

10 years ago

status-firefox34: --- → fixed