Closed Bug 1016416 Opened 10 years ago Closed 10 years ago

Add nagios monitoring of rabbitmq messages to rabbit[1-2].webapp.scl3

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cliang, Assigned: dgarvey)

References

Details

(Whiteboard: [triage:monitoring])

Please add a check similar to webops-rabbitmq-overview for rabbit[1-2].webapp.scl3. The warning level should be set at 5000 messages and the critical level set at 11000.
verified in rabbit.pp that the nagios check is there. # SCL3 shared RabbitMQ cluster node /^rabbit[12].webapp.scl3.mozilla.com$/ { include rabbitmq::nagios realize(Nrpe::Plugin['check_rabbitmq_overview_webops']) On the box it looks like the check is there? command[check_rabbitmq_overview] [dgarvey@rabbit1.webapp.scl3 ~]$ /usr/bin/perl /usr/lib64/nagios/plugins/custom/check_rabbitmq_overview.pl -H localhost --port=55672 -u admin -p secret -w 5000,6000,7000 -c 11000,12000,13000 RABBITMQ_OVERVIEW OK - messages OK (1) messages_ready OK (1) messages_unacknowledged OK (0) | messages=1;5000;11000 messages_ready=1;6000;12000 messages_unacknowledged=0;7000;13000 [dgarvey@rabbit1.webapp.scl3 ~]$
Assignee: server-ops → dgarvey
C. Liang, We currently have the alert to thresholds set to 10k and 15 is that ok? 'webops-rabbitmq-overview' => { service_description => "WebOps Rabbit Unread Messages", check_command => 'check_rabbitmq_overview!100000,100000,100000!250000,250000,250000,', contact_groups => 'sysalertsonly', hostgroups => $::fqdn ? { 'nagios1.private.phx1.mozilla.com' => [ 'rabbitmq', ], default => [ ] }
Flags: needinfo?(cliang)
The thresholds should be decreased to 5K (warn) and 11K (critical). Do all alerts that show up in #sysadmins end up generating bug messages in the Server Operations:MOC component? Two weeks ago, one of the MDN components stopped processing messages, which lead to a large backlog that ended up growing even after functionality was restored - the queues hit a max of 280K messages [1] and I don't recall seeing any alerts for these. [1] http://screencast.com/t/DEtkm5Jt7nS
Flags: needinfo?(cliang)
(In reply to C. Liang [:cyliang] from comment #3) > The thresholds should be decreased to 5K (warn) and 11K (critical). > > Do all alerts that show up in #sysadmins end up generating bug messages in > the Server Operations:MOC component? Two weeks ago, one of the MDN > components stopped processing messages, which lead to a large backlog that > ended up growing even after functionality was restored - the queues hit a > max of 280K messages [1] and I don't recall seeing any alerts for these. > > [1] http://screencast.com/t/DEtkm5Jt7nS C. Thanks for pointing this out. We can easily ddos bugzilla with that kinda of action. Just yesterday I started looking a rate-limiting for that bugs submitter script. dgarvey - I will get with you in the office to discuss this further.
Group: mozilla-employee-confidential
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → lypulong
cyliang, do we still need to progress with this? Sorry rbryce has left and I didn't know what he is thinking.;)
Flags: needinfo?(cliang)
Comment 4 isn't relevant anymore. There are two rabbitmq checks these days but I'll defer which ones need to be assigned to these servers as well to :cyliang.
Reading through with fresh eyes: We need a new rabbit check for the total number of messages added the rabbitmq cluster in SCL3. This can be based on the existing check "webops-rabbitmq-overview", with the following changes: - the check thresholds should be lower: check_command => 'check_rabbitmq_overview!5000,5000,5000!11000,11000,11000,' - it should apply to SCL3 rather than PHX1: hostgroups => $::fqdn ? { 'nagios1.private.scl3.mozilla.com' => [ 'rabbitmq', ], default => [ ] } Documentation for this check can be linked to the existing documentation (https://mana.mozilla.org/wiki/display/NAGIOS/Rabbit+Unread+Messages).
Flags: needinfo?(cliang)
Cyliang, Done... dgarvey@dgarvey-mozilla:~/bug1016416$ nc nagios1.private.scl3.mozilla.com 6557 < rabbit_mq.query check_rabbitmq_overview!5000,5000,5000!11000,11000,11000,;sysalertsonly;irchilight;virtual,rabbitmq,generic;sysalerts;irc,pagerduty-funnel,sysadmin-oncall dgarvey@dgarvey-mozilla:~/bug1139483$ cat rabbit_mq.query GET services Columns: check_command contact_groups contacts host_groups host_contact_groups host_contacts Filter: host_name = rabbit1.webapp.scl3.mozilla.com Filter: description = WebOps Rabbit Unread Messages again dgarvey@dgarvey-mozilla:~/bug1139483$
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [triage:monitoring]
You need to log in before you can comment on or make changes to this bug.