999028 - Telemetry Analysis Job for Loop ICE Reports

Reporter

Description

•

10 years ago

We need to write a telemetry analysis module for aggregating information about ICE failures in the Loop client. See http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/ for a description of the analysis tools.

The format of the data is described at https://wiki.mozilla.org/Loop/Telemetry#ICE_Failures (this report generation should only handle rerorde with "report":"ice failure"), and the initial data to extract is described in the final bullet under "Nature of Data":

> For initial analysis, we could probably do with something as simple as a
> report that says "on date, there were x failures, broken down as follows:
> failed: failed count, disconnected: disconnected count," and then lets us list
> all the failures for a given date/reason pair, ultimately allowing us to
> download the log to analyze. As we get experience with how things tend to
> break, we might want to refine this some, but it's a good start.

Note that the actual contents of, e.g., the statistics field, the SDP, and the log files need to be treated as confidential.

:shell escalante

Updated

•

10 years ago

Whiteboard: [est:4h] → [est:4h]p=.5

:shell escalante

Updated

•

10 years ago

Blocks: 1005175

:shell escalante

Updated

•

10 years ago

Whiteboard: [est:4h]p=.5 → [est:4h][p=.5, s=mlpnightly2, c=loop-general]

:shell escalante

Comment 1

•

10 years ago

Is there an update on who you were working on this with?  we don't want to lose this bug - but aren't sure where it is.

Flags: needinfo?(adam)

Adam Roach [:abr]

Reporter

Comment 2

•

10 years ago

I believe that EKR was working with Ben Brittain to do a first cut at this work. Ben -- is that correct? Should we assign this bug to you?

Flags: needinfo?(adam) → needinfo?(ben)

Maire Reavy [:mreavy]

Comment 3

•

10 years ago

Hey Adam -- Do we have an owner for this?  Or do I need to find one?

Flags: needinfo?(adam)

Adam Roach [:abr]

Reporter

Comment 4

•

10 years ago

(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #3)
> Hey Adam -- Do we have an owner for this?  Or do I need to find one?

We don't have an owner any more. The original plan was to have Ben Brittain do this, although I believe he's gone back to school now.

Flags: needinfo?(ben)

Flags: needinfo?(adam)

:shell escalante

Comment 5

•

10 years ago

Would this be something that the Metrics teams could do - or do you know if we have a specific team for Telemetry?  The ICE data is being gathered and needs telemetry work now to generate reports.  

The user describes the initial reports needed and has the write-up of how it's formatted.

Flags: needinfo?(sguha)

Flags: needinfo?(kparlante)

"Saptarshi Guha[:joy]"

Comment 6

•

10 years ago

There is no one telemetry team. Our experience with Telemetry reporting is limited.
I'm roping in Ali here as he some experience with the  Telemtry JS API. He can help Katie design and implement the dashboards.

Flags: needinfo?(sguha)

Flags: needinfo?(kparlante)

"Saptarshi Guha[:joy]"

Comment 7

•

10 years ago

Ali, would you be able to help katie with this?

Flags: needinfo?(aalmossawi)

Ali Almossawi

Comment 8

•

10 years ago

Sure, I can work with Katie on this.

Flags: needinfo?(aalmossawi)

Katie Parlante

Comment 9

•

10 years ago

Mark, I'm sending this one your way.

If I understand correctly we have two tasks:
- Create a telemetry analysis job for aggregating this data (via http://telemetry-dash.mozilla.org/)
- Make the aggregated data available to our custom dashboard via the Telemetry JS API (which we need to do to make the data available to our partners)

The second one is sorta captured here: https://bugzilla.mozilla.org/show_bug.cgi?id=1073516

Assignee: nobody → mreid

Flags: needinfo?(mreid)

Katie Parlante

Updated

•

10 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 10

•

10 years ago

Can the aggregate data be public (ie. web-facing)? 

If so, then we can publish the results to a web-facing S3 bucket (per the usual for a telemetry analysis job).

If not, we typically put the results in a private bucket, and we'll need to sort out a mechanism for sync'ing it over to the dashboard.

Flags: needinfo?(kparlante)

Katie Parlante

Comment 11

•

10 years ago

Lets treat this as private/confidential. :whd can handle the access control/mechanism for syncing it to the dashboard.

Flags: needinfo?(kparlante)

Mark Reid [:mreid]

Comment 12

•

10 years ago

What format would be most convenient for use by the Loop dashboard?  I currently have a job that outputs a small json file for each day containing a summary of the failures by type, as well as a gzip'd tsv file with the full payloads for detailed inspection.

Katie Parlante

Comment 13

•

10 years ago

It would be great if we could have one json file instead of one per day, something like:
[
  {  
     "date":"2014-10-16",
     "failureA":0,
     "failureB":0,
     "failureC":0,
  },
  {  
     "date":"2014-10-17",
     "failureA":0,
     "failureB":0,
     "failureC":0,
  }
]

We're not going to make the full payloads available via the dashboard, we can give access to specific devs (:abr, others?)

Mark Reid [:mreid]

Comment 14

•

10 years ago

Ok, I'll update the format. How many days' history should I include? It should be pretty small, but I don't like to generate files that grow forever.

Katie Parlante

Comment 15

•

10 years ago

180 days

Mark Reid [:mreid]

Comment 16

•

10 years ago

Ok, data is now being saved to a single combined file:
s3://telemetry-private-analysis/loop_failures/data/failures_by_type.json

This will require AWS credentials to copy it over to the dashboard web server.  The job is currently scheduled to run at 14:00UTC to populate data for the previous day.  It should take far less than 1 hour to run, so fetching it at 15:00UTC should be safe.

Note that the full per-day detail will still be generated in case you want to make the full payloads available later on.

Mark Reid [:mreid]

Comment 17

•

10 years ago

I created (and merged) a PR for the analysis code here:
https://github.com/mozilla/telemetry-server/pull/86

Katie Parlante

Comment 18

•

10 years ago

Excellent! Thanks for including the PR. Assigning to :whd to move to the metrics box so the dashboard can access it.

Assignee: mreid → whd

Wesley Dawson [:whd]

Assignee

Comment 19

•

10 years ago

:mreid minor issue with the data, the date looks like "20141016" instead of "2014-10-16".

:relud has sorted out the cross-IAM stuff for me, so I'm now setting up the metrics box to pull the data down at 15:00UTC and make it available via the dashboard.

Wesley Dawson [:whd]

Assignee

Comment 20

•

10 years ago

https://github.com/mozilla-services/puppet-config/pull/981
https://github.com/mozilla-services/svcops/pull/313

Available at: https://metrics.services.mozilla.com/loop-server-dashboard/data/loop_failures_by_type.json

Mark Reid [:mreid]

Comment 21

•

10 years ago

(In reply to Wesley Dawson [:whd] from comment #19)
> :mreid minor issue with the data, the date looks like "20141016" instead of
> "2014-10-16".
Right - the data is stored without the dashes. If it's a problem let me know and I'll add them before exporting.

Katie Parlante

Comment 22

•

10 years ago

A graph of ICE failures is now on the dashboard: https://metrics.services.mozilla.com/loop-server-dashboard/

https://github.com/mozilla/loop-server-dashboard/pull/10

:abr, the log data is in gzip'd tsv files in s3, :whd is going to give you credentials to access them as a short term solution.

Instead of building a bespoke dashboard for accessing the log files, we should route this data to kibana or sentry (probably sentry), but we may need to wait for the transition to the new pipeline.

Jason Thomas [:jason]

Updated

•

7 years ago

Whiteboard: [est:4h][p=.5, s=mlpnightly2, c=loop-general] → [est:4h][p=.5, s=mlpnightly2, c=loop-general] [SvcOps]

Jason Thomas [:jason]

Comment 23

•

7 years ago

loop infrastructure was decommissioned in bug 1307378. I don't think this still needed. Please reopen if it is.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

BMO Automation

Updated

•

6 years ago

Product: Webtools → Data Platform and Tools

Nobody; OK to take it and work on it

Updated

•

1 year ago

Component: Telemetry Dashboards (TMO) → General