Closed Bug 945834 Opened 11 years ago Closed 11 years ago

Add nagios monitoring for rabbit[12].releng.webapp.scl3.mozilla.com

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cliang, Assigned: ashish)

References

Details

Please add a check similar to webops-rabbitmq-overview for the two hosts listed above, with the warning level set at 300 messages and the critical level set at 600 messages.


Dustin: 

1) The webops-rabbitmq-overview check (for number of RabbitMQ unread messages) normally goes to just sysalertsonly.  Should anyone else be included in the contact group for that check?

2) I didn't know if it made sense for these hosts to have the same beam-procs check as the older releng RabbitMQ servers.
I'd prefer that these look as much like "normal" webops rabbitmq hosts as possible.

We haven't had any problem with message overflow, so for the moment I think it's fine for those alerts to just go to sysalertsonly.
Blocks: 934593
This is complete:

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=rabbitmq-releng&style=detail
Assignee: server-ops → ashish
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Err, wrong nagios instance

https://nagios.mozilla.org/scl3/cgi-bin/status.cgi?hostgroup=rabbitmq-releng&style=detail

Also, errors:

rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - Access Refused : /api/overview
rabbit2.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - Access Refused : /api/overview

Checks are downtimed, please verify that they recover once fixed. Thanks!
So, between RabbitMQ 2.x and 3.x, they changed the default administrative port number from 55672 to 15672.  The check was not passing along a port number, so it was defaulting to 55672, which will not work for this set of servers (since they are running RabbitMQ 3.2.1).

I've modified the check so that it passes along a port option, taking the port number from 'rabbitmq_admin_port' in Hiera.  There is a default 'rabbitmq_admin_port' entry of '55672' in hiera/site.yaml, which can be overridden by node-specific hiera files.
Nice, that's a great new approach. I like! Thanks C!
(In reply to C. Liang [:cyliang] from comment #4)
> So, between RabbitMQ 2.x and 3.x, they changed the default administrative
> port number from 55672 to 15672.  The check was not passing along a port
> number, so it was defaulting to 55672, which will not work for this set of
> servers (since they are running RabbitMQ 3.2.1).
> 

Something else too. The check still isn't working and running it by hand returns "Not a HASH reference at /usr/lib64/nagios/plugins/custom/check_rabbitmq_overview.pl line 119.". So I dug into the code and found that queue_totals was empty in /api/overview instead of containing message metrics:

>  "queue_totals": [],

Should have been:

>  "queue_totals": {
>    "messages": 0,
>    "messages_details": {
>      "rate": 0
>    },
>    "messages_ready": 0,
>   "messages_ready_details": {
>      "rate": 0
>    },
>    "messages_unacknowledged": 0,
>    "messages_unacknowledged_details": {
>      "rate": 0
>    }
>  },

I poked around the queues in the web UI and opening the test queue seemed to "initialise" it and the messages count is now 0 instead of "?". So please be aware to make sure that the queues are initialised while nuking the test queue :)
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.