Automatic e-mail notifications of Telemetry submission rate spikes & drops

RESOLVED FIXED

Status

Webtools
Telemetry Server
RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: vladan, Assigned: mreid)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
We should set up automatic e-mail notifications for significant changes in Telemetry submission rates. 

perf@mozilla.com should probably be on the recipient list
CC'ing trink, as we already have these measures in heka, so I'm hoping this is just a small heka hack :)
(Assignee)

Updated

3 years ago
Assignee: nobody → mtrinkala
(Assignee)

Updated

3 years ago
Blocks: 972889
(Assignee)

Updated

3 years ago
No longer blocks: 972889
(Assignee)

Comment 2

3 years ago
See also: https://github.com/mozilla-services/heka/issues/677
(Assignee)

Updated

3 years ago
See Also: → bug 972889
(Assignee)

Comment 3

3 years ago
In addition to the Standard-Deviation-based alerting implemented by :trink, I've added a cron job to monitor the submission rates externally using :rvitillo's suggestion of a predictor based on Mann-Whitney's U test from https://gist.github.com/vitillo/9023560/.

It was added to the telemetry-server project in this commit: https://github.com/mozilla/telemetry-server/commit/abc644b0b70777889b8c29d45f05fb2eae69b302
Any idea what causes the 4am PST interruption to the data stream every Saturday (consistently generationg the 4:05 alert)?
(Assignee)

Comment 5

3 years ago
I think it's just due to Saturdays being consistently ~15-20% lower volume than Fridays, so it hits the normal stddev cutoff...
Mike, what are you using as reference distribution for your stddev approach?
The data as-is will not cause an alert.  There must have been an interruption in the data stream and when it was over the old data was backfilled correcting the graph.
(In reply to Mike Trinkala [:trink] from comment #7)
> The data as-is will not cause an alert.  There must have been an
> interruption in the data stream and when it was over the old data was
> backfilled correcting the graph.

Ignore the comment above... It helps if I look at the right graph (it just clips the threshold)

http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/alert_threshold.html?win=15&col=1&sd=1.5&file=TelemetryChannelMetrics60DaysAggregatorAlerting.ALL.cbuf
All this is in and has been running on the Ops supported shared Heka we just need to get the telemetry edge nodes updated to send the data there. https://heka.shared.us-west-2.prod.mozaws.net/ (you will probably need to ask whd for access to the dashboard)
Assignee: mtrinkala → mreid
(Assignee)

Comment 10

2 years ago
The automated alerting has been running for a long time.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.