Closed
Bug 962811
Opened 11 years ago
Closed 10 years ago
Automatic e-mail notifications of Telemetry submission rate spikes & drops
Categories
(Webtools Graveyard :: Telemetry Server, defect)
Webtools Graveyard
Telemetry Server
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: vladan, Assigned: mreid)
References
Details
We should set up automatic e-mail notifications for significant changes in Telemetry submission rates.
perf@mozilla.com should probably be on the recipient list
Comment 1•11 years ago
|
||
CC'ing trink, as we already have these measures in heka, so I'm hoping this is just a small heka hack :)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → mtrinkala
Assignee | ||
Comment 2•11 years ago
|
||
Assignee | ||
Comment 3•11 years ago
|
||
In addition to the Standard-Deviation-based alerting implemented by :trink, I've added a cron job to monitor the submission rates externally using :rvitillo's suggestion of a predictor based on Mann-Whitney's U test from https://gist.github.com/vitillo/9023560/.
It was added to the telemetry-server project in this commit: https://github.com/mozilla/telemetry-server/commit/abc644b0b70777889b8c29d45f05fb2eae69b302
Comment 4•11 years ago
|
||
Any idea what causes the 4am PST interruption to the data stream every Saturday (consistently generationg the 4:05 alert)?
Assignee | ||
Comment 5•11 years ago
|
||
I think it's just due to Saturdays being consistently ~15-20% lower volume than Fridays, so it hits the normal stddev cutoff...
Comment 6•11 years ago
|
||
Mike, what are you using as reference distribution for your stddev approach?
Comment 7•11 years ago
|
||
The data as-is will not cause an alert. There must have been an interruption in the data stream and when it was over the old data was backfilled correcting the graph.
Comment 8•11 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #7)
> The data as-is will not cause an alert. There must have been an
> interruption in the data stream and when it was over the old data was
> backfilled correcting the graph.
Ignore the comment above... It helps if I look at the right graph (it just clips the threshold)
http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/alert_threshold.html?win=15&col=1&sd=1.5&file=TelemetryChannelMetrics60DaysAggregatorAlerting.ALL.cbuf
Comment 9•11 years ago
|
||
All this is in and has been running on the Ops supported shared Heka we just need to get the telemetry edge nodes updated to send the data there. https://heka.shared.us-west-2.prod.mozaws.net/ (you will probably need to ask whd for access to the dashboard)
Assignee: mtrinkala → mreid
Assignee | ||
Comment 10•10 years ago
|
||
The automated alerting has been running for a long time.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Webtools → Webtools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•