Set CloudAMQP queue alerts per queue
Categories
(Tree Management :: Treeherder: Infrastructure, defect)
Tracking
(Not tracked)
People
(Reporter: armenzg, Assigned: armenzg)
References
Details
Attachments
(3 files)
We currently send an alert when we reach 2000 messages in the last 5 minutes for all the queues:
300 seconds 2000 messages Any queue
I suggest the following alerts:
120 seconds 500 messages log_parser.*
120 seconds 1000 messages store_pulse.*
300 seconds 1000 messages Any queue
Doing this will help us notice a bit earlier and we can adjust the values over time.
camd, sclements: works for you?
We can also set up alerts for the sheriffs email alias if it helps.
Assignee | ||
Comment 1•4 years ago
|
||
This shows the alarms as I've set them up for myself.
Assignee | ||
Comment 2•4 years ago
|
||
Ignore what happens to the left of 12:10pm ET (that's some pulse tasks from github PRs that get stuck in the queues).
I got pinged at 12:18 meaning that sheriffs can notice quite quickly that we fall behind.
Given that we might want to change this:
120 seconds 500 messages log_parser.*
to this:
60 seconds 50 messages log_parser_fail
120 seconds 200 messages log_parser
At some point getting emails does not necessarily beat a ping (since we're less likely to check our emails that often).
This extra granularity might not be necessarily but maybe it is.
Comment 3•4 years ago
|
||
I wonder if the 50 messages for log_parser_fail might be too low and then we get emails all the time, potentially drowning out the more serious issues.
Assignee | ||
Comment 4•4 years ago
|
||
I'm going to add Slack integration for the #treeherder-ops. I can get more easily notified through that than emails.
(In reply to Sarah Clements [:sclements] from comment #3)
I wonder if the 50 messages for log_parser_fail might be too low and then we get emails all the time, potentially drowning out the more serious issues.
I agree. At the moment, I'm setting up for me. We can easily adjust.
Assignee | ||
Comment 5•4 years ago
|
||
I've set-up treeherder-{stage,prod}
to report to Slack.
Assignee | ||
Updated•4 years ago
|
Description
•