Closed
Bug 1016416
Opened 10 years ago
Closed 10 years ago
Add nagios monitoring of rabbitmq messages to rabbit[1-2].webapp.scl3
Categories
(Infrastructure & Operations :: MOC: Service Requests, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cliang, Assigned: dgarvey)
References
Details
(Whiteboard: [triage:monitoring])
Please add a check similar to webops-rabbitmq-overview for rabbit[1-2].webapp.scl3. The warning level should be set at 5000 messages and the critical level set at 11000.
Assignee | ||
Comment 1•10 years ago
•
|
||
verified in rabbit.pp that the nagios check is there.
# SCL3 shared RabbitMQ cluster
node /^rabbit[12].webapp.scl3.mozilla.com$/ {
include rabbitmq::nagios
realize(Nrpe::Plugin['check_rabbitmq_overview_webops'])
On the box it looks like the check is there?
command[check_rabbitmq_overview]
[dgarvey@rabbit1.webapp.scl3 ~]$ /usr/bin/perl /usr/lib64/nagios/plugins/custom/check_rabbitmq_overview.pl -H localhost --port=55672 -u admin -p secret -w 5000,6000,7000 -c 11000,12000,13000
RABBITMQ_OVERVIEW OK - messages OK (1) messages_ready OK (1) messages_unacknowledged OK (0) | messages=1;5000;11000 messages_ready=1;6000;12000 messages_unacknowledged=0;7000;13000
[dgarvey@rabbit1.webapp.scl3 ~]$
Updated•10 years ago
|
Assignee: server-ops → dgarvey
Assignee | ||
Comment 2•10 years ago
|
||
C. Liang,
We currently have the alert to thresholds set to 10k and 15 is that ok?
'webops-rabbitmq-overview' => {
service_description => "WebOps Rabbit Unread Messages",
check_command => 'check_rabbitmq_overview!100000,100000,100000!250000,250000,250000,',
contact_groups => 'sysalertsonly',
hostgroups => $::fqdn ? {
'nagios1.private.phx1.mozilla.com' => [
'rabbitmq',
],
default => [
]
}
Flags: needinfo?(cliang)
Reporter | ||
Comment 3•10 years ago
|
||
The thresholds should be decreased to 5K (warn) and 11K (critical).
Do all alerts that show up in #sysadmins end up generating bug messages in the Server Operations:MOC component? Two weeks ago, one of the MDN components stopped processing messages, which lead to a large backlog that ended up growing even after functionality was restored - the queues hit a max of 280K messages [1] and I don't recall seeing any alerts for these.
[1] http://screencast.com/t/DEtkm5Jt7nS
Flags: needinfo?(cliang)
Comment 4•10 years ago
|
||
(In reply to C. Liang [:cyliang] from comment #3)
> The thresholds should be decreased to 5K (warn) and 11K (critical).
>
> Do all alerts that show up in #sysadmins end up generating bug messages in
> the Server Operations:MOC component? Two weeks ago, one of the MDN
> components stopped processing messages, which lead to a large backlog that
> ended up growing even after functionality was restored - the queues hit a
> max of 280K messages [1] and I don't recall seeing any alerts for these.
>
> [1] http://screencast.com/t/DEtkm5Jt7nS
C.
Thanks for pointing this out. We can easily ddos bugzilla with that kinda of action. Just yesterday I started looking a rate-limiting for that bugs submitter script.
dgarvey - I will get with you in the office to discuss this further.
Updated•10 years ago
|
Group: mozilla-employee-confidential
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
Updated•10 years ago
|
QA Contact: shyam → lypulong
Assignee | ||
Comment 5•10 years ago
|
||
cyliang,
do we still need to progress with this? Sorry rbryce has left and I didn't know what he is thinking.;)
Flags: needinfo?(cliang)
Comment 6•10 years ago
|
||
Comment 4 isn't relevant anymore. There are two rabbitmq checks these days but I'll defer which ones need to be assigned to these servers as well to :cyliang.
Reporter | ||
Comment 7•10 years ago
|
||
Reading through with fresh eyes:
We need a new rabbit check for the total number of messages added the rabbitmq cluster in SCL3.
This can be based on the existing check "webops-rabbitmq-overview", with the following changes:
- the check thresholds should be lower:
check_command => 'check_rabbitmq_overview!5000,5000,5000!11000,11000,11000,'
- it should apply to SCL3 rather than PHX1:
hostgroups => $::fqdn ? {
'nagios1.private.scl3.mozilla.com' => [
'rabbitmq',
],
default => [
]
}
Documentation for this check can be linked to the existing documentation (https://mana.mozilla.org/wiki/display/NAGIOS/Rabbit+Unread+Messages).
Flags: needinfo?(cliang)
Assignee | ||
Comment 8•10 years ago
|
||
Cyliang,
Done...
dgarvey@dgarvey-mozilla:~/bug1016416$ nc nagios1.private.scl3.mozilla.com 6557 < rabbit_mq.query
check_rabbitmq_overview!5000,5000,5000!11000,11000,11000,;sysalertsonly;irchilight;virtual,rabbitmq,generic;sysalerts;irc,pagerduty-funnel,sysadmin-oncall
dgarvey@dgarvey-mozilla:~/bug1139483$ cat rabbit_mq.query
GET services
Columns: check_command contact_groups contacts host_groups host_contact_groups host_contacts
Filter: host_name = rabbit1.webapp.scl3.mozilla.com
Filter: description = WebOps Rabbit Unread Messages again
dgarvey@dgarvey-mozilla:~/bug1139483$
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Whiteboard: [triage:monitoring]
You need to log in
before you can comment on or make changes to this bug.
Description
•