Closed Bug 962811 Opened 10 years ago Closed 9 years ago

Automatic e-mail notifications of Telemetry submission rate spikes & drops

Categories

(Webtools Graveyard :: Telemetry Server, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vladan, Assigned: mreid)

References

Details

We should set up automatic e-mail notifications for significant changes in Telemetry submission rates. 

perf@mozilla.com should probably be on the recipient list
CC'ing trink, as we already have these measures in heka, so I'm hoping this is just a small heka hack :)
Assignee: nobody → mtrinkala
Blocks: 972889
No longer blocks: 972889
See Also: → 972889
In addition to the Standard-Deviation-based alerting implemented by :trink, I've added a cron job to monitor the submission rates externally using :rvitillo's suggestion of a predictor based on Mann-Whitney's U test from https://gist.github.com/vitillo/9023560/.

It was added to the telemetry-server project in this commit: https://github.com/mozilla/telemetry-server/commit/abc644b0b70777889b8c29d45f05fb2eae69b302
Any idea what causes the 4am PST interruption to the data stream every Saturday (consistently generationg the 4:05 alert)?
I think it's just due to Saturdays being consistently ~15-20% lower volume than Fridays, so it hits the normal stddev cutoff...
Mike, what are you using as reference distribution for your stddev approach?
The data as-is will not cause an alert.  There must have been an interruption in the data stream and when it was over the old data was backfilled correcting the graph.
(In reply to Mike Trinkala [:trink] from comment #7)
> The data as-is will not cause an alert.  There must have been an
> interruption in the data stream and when it was over the old data was
> backfilled correcting the graph.

Ignore the comment above... It helps if I look at the right graph (it just clips the threshold)

http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/alert_threshold.html?win=15&col=1&sd=1.5&file=TelemetryChannelMetrics60DaysAggregatorAlerting.ALL.cbuf
All this is in and has been running on the Ops supported shared Heka we just need to get the telemetry edge nodes updated to send the data there. https://heka.shared.us-west-2.prod.mozaws.net/ (you will probably need to ask whd for access to the dashboard)
Assignee: mtrinkala → mreid
The automated alerting has been running for a long time.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.