Closed Bug 962811 Opened 11 years ago Closed 10 years ago

Automatic e-mail notifications of Telemetry submission rate spikes & drops

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: vladan, Assigned: mreid)

References

Details

Vladan Djeric (:vladan)

Reporter

Description

•

11 years ago

We should set up automatic e-mail notifications for significant changes in Telemetry submission rates. perf@mozilla.com should probably be on the recipient list

Jonas Finnemann Jensen (:jonasfj)

Comment 1

•

11 years ago

CC'ing trink, as we already have these measures in heka, so I'm hoping this is just a small heka hack :)

Mark Reid [:mreid]

Assignee

Updated

•

11 years ago

Assignee: nobody → mtrinkala

Mark Reid [:mreid]

Assignee

Updated

•

11 years ago

Blocks: 972889

Mark Reid [:mreid]

Assignee

Updated

•

11 years ago

No longer blocks: 972889

Mark Reid [:mreid]

Assignee

Comment 2

•

11 years ago

See also: https://github.com/mozilla-services/heka/issues/677

Mark Reid [:mreid]

Assignee

Updated

•

11 years ago

Comment 3

•

11 years ago

In addition to the Standard-Deviation-based alerting implemented by :trink, I've added a cron job to monitor the submission rates externally using :rvitillo's suggestion of a predictor based on Mann-Whitney's U test from https://gist.github.com/vitillo/9023560/. It was added to the telemetry-server project in this commit: https://github.com/mozilla/telemetry-server/commit/abc644b0b70777889b8c29d45f05fb2eae69b302

Mike Trinkala [:trink]

Comment 4

•

11 years ago

Any idea what causes the 4am PST interruption to the data stream every Saturday (consistently generationg the 4:05 alert)?

Mark Reid [:mreid]

Assignee

Comment 5

•

11 years ago

I think it's just due to Saturdays being consistently ~15-20% lower volume than Fridays, so it hits the normal stddev cutoff...

Roberto Agostino Vitillo (:rvitillo)

Comment 6

•

11 years ago

Mike, what are you using as reference distribution for your stddev approach?

Mike Trinkala [:trink]

Comment 7

•

11 years ago

The data as-is will not cause an alert. There must have been an interruption in the data stream and when it was over the old data was backfilled correcting the graph.

Mike Trinkala [:trink]

Comment 8

•

11 years ago

(In reply to Mike Trinkala [:trink] from comment #7) > The data as-is will not cause an alert. There must have been an > interruption in the data stream and when it was over the old data was > backfilled correcting the graph. Ignore the comment above... It helps if I look at the right graph (it just clips the threshold) http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/alert_threshold.html?win=15&col=1&sd=1.5&file=TelemetryChannelMetrics60DaysAggregatorAlerting.ALL.cbuf

Mike Trinkala [:trink]

Comment 9

•

11 years ago

All this is in and has been running on the Ops supported shared Heka we just need to get the telemetry edge nodes updated to send the data there. https://heka.shared.us-west-2.prod.mozaws.net/ (you will probably need to ask whd for access to the dashboard)

Assignee: mtrinkala → mreid

Mark Reid [:mreid]

Assignee

Comment 10

•

10 years ago

The automated alerting has been running for a long time.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Webtools → Webtools Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Automatic e-mail notifications of Telemetry submission rate spikes & drops

Categories

(Webtools Graveyard :: Telemetry Server, defect)

Tracking

(Not tracked)

People

(Reporter: vladan, Assigned: mreid)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated