Closed Bug 1641884 Opened 4 years ago Closed 4 years ago

Set CloudAMQP queue alerts per queue

Categories

(Tree Management :: Treeherder: Infrastructure, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(3 files)

We currently send an alert when we reach 2000 messages in the last 5 minutes for all the queues:

300 seconds 	2000 messages  Any queue

I suggest the following alerts:

 120 seconds 	 500 messages 	log_parser.*
 120 seconds 	1000 messages 	store_pulse.*
 300 seconds 	1000 messages 	Any queue

Doing this will help us notice a bit earlier and we can adjust the values over time.

camd, sclements: works for you?

We can also set up alerts for the sheriffs email alias if it helps.

This shows the alarms as I've set them up for myself.

Ignore what happens to the left of 12:10pm ET (that's some pulse tasks from github PRs that get stuck in the queues).

I got pinged at 12:18 meaning that sheriffs can notice quite quickly that we fall behind.

Given that we might want to change this:

120 seconds 	 500 messages 	log_parser.*

to this:

 60 seconds 	  50 messages 	log_parser_fail
120 seconds 	 200 messages 	log_parser

At some point getting emails does not necessarily beat a ping (since we're less likely to check our emails that often).

This extra granularity might not be necessarily but maybe it is.

I wonder if the 50 messages for log_parser_fail might be too low and then we get emails all the time, potentially drowning out the more serious issues.

I'm going to add Slack integration for the #treeherder-ops. I can get more easily notified through that than emails.

(In reply to Sarah Clements [:sclements] from comment #3)

I wonder if the 50 messages for log_parser_fail might be too low and then we get emails all the time, potentially drowning out the more serious issues.

I agree. At the moment, I'm setting up for me. We can easily adjust.

I've set-up treeherder-{stage,prod} to report to Slack.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: