Closed Bug 999028 Opened 10 years ago Closed 7 years ago

Telemetry Analysis Job for Loop ICE Reports

Categories

(Data Platform and Tools :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: abr, Assigned: whd)

References

Details

(Whiteboard: [est:4h][p=.5, s=mlpnightly2, c=loop-general] [SvcOps])

We need to write a telemetry analysis module for aggregating information about ICE failures in the Loop client. See http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/ for a description of the analysis tools.

The format of the data is described at https://wiki.mozilla.org/Loop/Telemetry#ICE_Failures (this report generation should only handle rerorde with "report":"ice failure"), and the initial data to extract is described in the final bullet under "Nature of Data":

> For initial analysis, we could probably do with something as simple as a
> report that says "on date, there were x failures, broken down as follows:
> failed: failed count, disconnected: disconnected count," and then lets us list
> all the failures for a given date/reason pair, ultimately allowing us to
> download the log to analyze. As we get experience with how things tend to
> break, we might want to refine this some, but it's a good start.

Note that the actual contents of, e.g., the statistics field, the SDP, and the log files need to be treated as confidential.
Whiteboard: [est:4h] → [est:4h]p=.5
Blocks: 1005175
Whiteboard: [est:4h]p=.5 → [est:4h][p=.5, s=mlpnightly2, c=loop-general]
Is there an update on who you were working on this with?  we don't want to lose this bug - but aren't sure where it is.
Flags: needinfo?(adam)
I believe that EKR was working with Ben Brittain to do a first cut at this work. Ben -- is that correct? Should we assign this bug to you?
Flags: needinfo?(adam) → needinfo?(ben)
Hey Adam -- Do we have an owner for this?  Or do I need to find one?
Flags: needinfo?(adam)
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #3)
> Hey Adam -- Do we have an owner for this?  Or do I need to find one?

We don't have an owner any more. The original plan was to have Ben Brittain do this, although I believe he's gone back to school now.
Flags: needinfo?(ben)
Flags: needinfo?(adam)
Would this be something that the Metrics teams could do - or do you know if we have a specific team for Telemetry?  The ICE data is being gathered and needs telemetry work now to generate reports.  

The user describes the initial reports needed and has the write-up of how it's formatted.
Flags: needinfo?(sguha)
Flags: needinfo?(kparlante)
There is no one telemetry team. Our experience with Telemetry reporting is limited.
I'm roping in Ali here as he some experience with the  Telemtry JS API. He can help Katie design and implement the dashboards.
Flags: needinfo?(sguha)
Flags: needinfo?(kparlante)
Ali, would you be able to help katie with this?
Flags: needinfo?(aalmossawi)
Sure, I can work with Katie on this.
Flags: needinfo?(aalmossawi)
Mark, I'm sending this one your way.

If I understand correctly we have two tasks:
- Create a telemetry analysis job for aggregating this data (via http://telemetry-dash.mozilla.org/)
- Make the aggregated data available to our custom dashboard via the Telemetry JS API (which we need to do to make the data available to our partners)

The second one is sorta captured here: https://bugzilla.mozilla.org/show_bug.cgi?id=1073516
Assignee: nobody → mreid
Flags: needinfo?(mreid)
Flags: needinfo?(mreid)
Can the aggregate data be public (ie. web-facing)? 

If so, then we can publish the results to a web-facing S3 bucket (per the usual for a telemetry analysis job).

If not, we typically put the results in a private bucket, and we'll need to sort out a mechanism for sync'ing it over to the dashboard.
Flags: needinfo?(kparlante)
Lets treat this as private/confidential. :whd can handle the access control/mechanism for syncing it to the dashboard.
Flags: needinfo?(kparlante)
What format would be most convenient for use by the Loop dashboard?  I currently have a job that outputs a small json file for each day containing a summary of the failures by type, as well as a gzip'd tsv file with the full payloads for detailed inspection.
It would be great if we could have one json file instead of one per day, something like:
[
  {  
     "date":"2014-10-16",
     "failureA":0,
     "failureB":0,
     "failureC":0,
  },
  {  
     "date":"2014-10-17",
     "failureA":0,
     "failureB":0,
     "failureC":0,
  }
]

We're not going to make the full payloads available via the dashboard, we can give access to specific devs (:abr, others?)
Ok, I'll update the format. How many days' history should I include? It should be pretty small, but I don't like to generate files that grow forever.
180 days
Ok, data is now being saved to a single combined file:
s3://telemetry-private-analysis/loop_failures/data/failures_by_type.json

This will require AWS credentials to copy it over to the dashboard web server.  The job is currently scheduled to run at 14:00UTC to populate data for the previous day.  It should take far less than 1 hour to run, so fetching it at 15:00UTC should be safe.

Note that the full per-day detail will still be generated in case you want to make the full payloads available later on.
I created (and merged) a PR for the analysis code here:
https://github.com/mozilla/telemetry-server/pull/86
Excellent! Thanks for including the PR. Assigning to :whd to move to the metrics box so the dashboard can access it.
Assignee: mreid → whd
:mreid minor issue with the data, the date looks like "20141016" instead of "2014-10-16".

:relud has sorted out the cross-IAM stuff for me, so I'm now setting up the metrics box to pull the data down at 15:00UTC and make it available via the dashboard.
(In reply to Wesley Dawson [:whd] from comment #19)
> :mreid minor issue with the data, the date looks like "20141016" instead of
> "2014-10-16".
Right - the data is stored without the dashes. If it's a problem let me know and I'll add them before exporting.
A graph of ICE failures is now on the dashboard: https://metrics.services.mozilla.com/loop-server-dashboard/

https://github.com/mozilla/loop-server-dashboard/pull/10

:abr, the log data is in gzip'd tsv files in s3, :whd is going to give you credentials to access them as a short term solution.

Instead of building a bespoke dashboard for accessing the log files, we should route this data to kibana or sentry (probably sentry), but we may need to wait for the transition to the new pipeline.
Whiteboard: [est:4h][p=.5, s=mlpnightly2, c=loop-general] → [est:4h][p=.5, s=mlpnightly2, c=loop-general] [SvcOps]
loop infrastructure was decommissioned in bug 1307378. I don't think this still needed. Please reopen if it is.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Product: Webtools → Data Platform and Tools
Component: Telemetry Dashboards (TMO) → General
You need to log in before you can comment on or make changes to this bug.