Open Bug 1177737 Opened 9 years ago Updated 2 years ago

[meta] Create Telemetry client health monitoring page

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

People

(Reporter: gfritzsche, Unassigned)

References

(Depends on 2 open bugs)

Details

(Keywords: meta, Whiteboard: [measurement:client:tracking][measurement:client:project])

      No description provided.
We have several Telemetry health metrics submitted and available in the dashboard.
However, it's tedious and time-consuming to monitor those measures, distributed as they are.

We should have a central monitoring page that at least links to the core measures.
Ideally i'd love to show those core values too and maybe contrast them to expected ranges.
Note that https://bugzilla.mozilla.org/show_bug.cgi?id=1171630 is a similar bug about alerts.

At-a-glance monitoring that has been discussed:
- submission volume
  - ping type
  - channel
  - size of ping
  - submission lag (how long does it take to the server)
  - request duration (how long does it take to process the message through the pipeline)
- error types
- duplicates
- number of profiles on a given channel/time period

Some of this exists already, would be good to fill in the gaps and make sure its checked in:
https://pipeline-prototype-cep.prod.mozaws.net/
https://github.com/mozilla-services/data-pipeline/tree/master/heka/sandbox/filters


Georg, please add any histograms or other "core" measures that you'd like to see on this page.
Flags: needinfo?(gfritzsche)
See Also: → 1171630
Assigning to Rob to set up the filters.
Assignee: nobody → rmiller
This already has good info we should have in there:
https://kibana.shared.us-west-2.prod.mozaws.net/#/dashboard/elasticsearch/Telemetry%20Nightly%20By%20Reason

The list in comment 2 looks good already.
Might be good to also track "release"/"base" vs. "extended" data submission rate per channel.

For client measures we want all the TELEMETRY_* probes (except TELEMETRY_TEST_*), ideally grouped:
* archiving: TELEMETRY_ARCHIVE_*
* pending pings:
  * simpleMeasurements.savedPings (current pending ping count)
  * simpleMeasurements.pingsOverdue (pings found at startup >2weeks old)
  * TELEMETRY_PENDING_* or so coming up per bug 1168835
  * bug 1168835 drops TELEMETRY_FILES_EVICTED and simpleMeasurements.pingsDiscarded
* submission: TELEMETRY_STRINGIFY, TELEMETRY_COMPRESS, TELEMETRY_SUCCESS, TELEMETRY_PING, TELEMETRY_SEND
* other: TELEMETRY_INVALID_PING_TYPE_SUBMITTED, TELEMETRY_DISCARDED_CONTENT_PINGS_COUNT, TELEMETRY_MEMORY_REPORTER_MS
Flags: needinfo?(gfritzsche)
Blocks: 1122482
No longer blocks: 1120356
Whiteboard: [rC] [unifiedTelemetry] → [unifiedTelemetry]
Priority: -- → P2
No longer blocks: 1122482
Rob, are you still working on this?

I've filed bug 1270798 about setting up a client health dashboard based on the histogram aggregate data for now.
Flags: needinfo?(rmiller)
Whiteboard: [unifiedTelemetry] → [measurement:client:tracking]
No, I haven't been working on this, I've removed myself from being assigned.
Assignee: rmiller → nobody
Flags: needinfo?(rmiller)
Assignee: nobody → yarik.sheptykin
Assignee: yarik.sheptykin → nobody
Whiteboard: [measurement:client:tracking]
Whiteboard: [measurement:client:tracking]
Depends on: 1344235
Priority: P2 → P4
Assignee: nobody → chutten
Depends on: 1399153, 1400351
Priority: P4 → P1
Depends on: 1407608
Assignee: chutten → nobody
Priority: P1 → P3
Summary: Create Telemetry client health monitoring page → [meta] Create Telemetry client health monitoring page
Whiteboard: [measurement:client:tracking] → [measurement:client:tracking][measurement:client:project]
Keywords: meta
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.