We should set up automatic e-mail notifications for significant changes in Telemetry submission rates. email@example.com should probably be on the recipient list
CC'ing trink, as we already have these measures in heka, so I'm hoping this is just a small heka hack :)
In addition to the Standard-Deviation-based alerting implemented by :trink, I've added a cron job to monitor the submission rates externally using :rvitillo's suggestion of a predictor based on Mann-Whitney's U test from https://gist.github.com/vitillo/9023560/. It was added to the telemetry-server project in this commit: https://github.com/mozilla/telemetry-server/commit/abc644b0b70777889b8c29d45f05fb2eae69b302
Any idea what causes the 4am PST interruption to the data stream every Saturday (consistently generationg the 4:05 alert)?
I think it's just due to Saturdays being consistently ~15-20% lower volume than Fridays, so it hits the normal stddev cutoff...
Mike, what are you using as reference distribution for your stddev approach?
The data as-is will not cause an alert. There must have been an interruption in the data stream and when it was over the old data was backfilled correcting the graph.
(In reply to Mike Trinkala [:trink] from comment #7) > The data as-is will not cause an alert. There must have been an > interruption in the data stream and when it was over the old data was > backfilled correcting the graph. Ignore the comment above... It helps if I look at the right graph (it just clips the threshold) http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/alert_threshold.html?win=15&col=1&sd=1.5&file=TelemetryChannelMetrics60DaysAggregatorAlerting.ALL.cbuf
All this is in and has been running on the Ops supported shared Heka we just need to get the telemetry edge nodes updated to send the data there. https://heka.shared.us-west-2.prod.mozaws.net/ (you will probably need to ask whd for access to the dashboard)
The automated alerting has been running for a long time.