Closed
Bug 945834
Opened 11 years ago
Closed 11 years ago
Add nagios monitoring for rabbit[12].releng.webapp.scl3.mozilla.com
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cliang, Assigned: ashish)
References
Details
Please add a check similar to webops-rabbitmq-overview for the two hosts listed above, with the warning level set at 300 messages and the critical level set at 600 messages. Dustin: 1) The webops-rabbitmq-overview check (for number of RabbitMQ unread messages) normally goes to just sysalertsonly. Should anyone else be included in the contact group for that check? 2) I didn't know if it made sense for these hosts to have the same beam-procs check as the older releng RabbitMQ servers.
Comment 1•11 years ago
|
||
I'd prefer that these look as much like "normal" webops rabbitmq hosts as possible. We haven't had any problem with message overflow, so for the moment I think it's fine for those alerts to just go to sysalertsonly.
Assignee | ||
Comment 2•11 years ago
|
||
This is complete: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=rabbitmq-releng&style=detail
Assignee: server-ops → ashish
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 3•11 years ago
|
||
Err, wrong nagios instance https://nagios.mozilla.org/scl3/cgi-bin/status.cgi?hostgroup=rabbitmq-releng&style=detail Also, errors: rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - Access Refused : /api/overview rabbit2.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - Access Refused : /api/overview Checks are downtimed, please verify that they recover once fixed. Thanks!
Reporter | ||
Comment 4•11 years ago
|
||
So, between RabbitMQ 2.x and 3.x, they changed the default administrative port number from 55672 to 15672. The check was not passing along a port number, so it was defaulting to 55672, which will not work for this set of servers (since they are running RabbitMQ 3.2.1). I've modified the check so that it passes along a port option, taking the port number from 'rabbitmq_admin_port' in Hiera. There is a default 'rabbitmq_admin_port' entry of '55672' in hiera/site.yaml, which can be overridden by node-specific hiera files.
Assignee | ||
Comment 5•11 years ago
|
||
Nice, that's a great new approach. I like! Thanks C!
Assignee | ||
Comment 6•11 years ago
|
||
(In reply to C. Liang [:cyliang] from comment #4) > So, between RabbitMQ 2.x and 3.x, they changed the default administrative > port number from 55672 to 15672. The check was not passing along a port > number, so it was defaulting to 55672, which will not work for this set of > servers (since they are running RabbitMQ 3.2.1). > Something else too. The check still isn't working and running it by hand returns "Not a HASH reference at /usr/lib64/nagios/plugins/custom/check_rabbitmq_overview.pl line 119.". So I dug into the code and found that queue_totals was empty in /api/overview instead of containing message metrics: > "queue_totals": [], Should have been: > "queue_totals": { > "messages": 0, > "messages_details": { > "rate": 0 > }, > "messages_ready": 0, > "messages_ready_details": { > "rate": 0 > }, > "messages_unacknowledged": 0, > "messages_unacknowledged_details": { > "rate": 0 > } > }, I poked around the queues in the web UI and opening the test queue seemed to "initialise" it and the messages count is now 0 instead of "?". So please be aware to make sure that the queues are initialised while nuking the test queue :)
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•