618896 - Integrate pulse.mozilla.org with nagios (or whatever mozilla uses)

LegNeato

Reporter

Description

•

15 years ago

Integrate pulse.mozilla.org with nagios (or whatever mozilla uses). It'd be nice to have the server portion monitored.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Assignee: clegnitto → dustin

Jonathan Griffin (:jgriffin)

Comment 1

•

14 years ago

I'm not exactly sure how nagios works, but it would be great to be alerted if the RAM consumption by rabbitmq on this box exceeds 1GB. To find the RAM consumption, you can execute /usr/sbin/rabbitmqctl status, which produces output like this: Status of node 'rabbit@dp-pulse01' ... [{pid,3413}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","2.6.0"}, {webmachine,"webmachine","1.7.0-rmq2.6.0-hg0c4b60a"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","2.6.0"}, {amqp_client,"RabbitMQ AMQP Client","2.6.0"}, {rabbit,"RabbitMQ","2.6.0"}, {os_mon,"CPO CXC 138 46","2.2.6"}, {sasl,"SASL CXC 138 11","2.1.9.4"}, {rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.6.0"}, {mochiweb,"MochiMedia Web Server","1.3-rmq2.6.0-git9a53dbd"}, {inets,"INETS CXC 138 49","5.6"}, {mnesia,"MNESIA CXC 138 12","4.4.19"}, {stdlib,"ERTS CXC 138 10","1.17.4"}, {kernel,"ERTS CXC 138 10","2.14.4"}]}, {os,{unix,linux}}, {erlang_version, "Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"}, {memory, [{total,167177304}, {processes,99064976}, {processes_used,98955024}, {system,68112328}, {atom,1333337}, {atom_used,1307882}, {binary,18775888}, {code,14435840}, {ets,31873880}]}] ...done. It's the "total" field under "memory" that's the interesting number here. Can nagios periodically execute this and alert you and me if this number is > 1GB?

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

14 years ago

I'll see what I can whip up.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

14 years ago

This will require getting the hardware running first, which I haven't been pushing on much, lately. I'm checking up on it.

Jonathan Griffin (:jgriffin)

Comment 4

•

14 years ago

Might be useful: http://syslog.tv/rabbitmq-nagios/

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

14 years ago

We have some monitoring now - making sure the service is running, in general. I think that the thing to do on this bug is to monitor unread messages. RabbitMQ gives an easy count of those clusterwide. https://github.com/jamesc/nagios-plugins-rabbitmq can check that easily: [root@pulse-rabbit1.dmz.phx1 ~]# perl check_rabbitmq_overview -H localhost -u nagios -p <elided> -w 50,50,50 -c 200,200,200 RABBITMQ_OVERVIEW WARNING - messages WARNING (56) messages_ready WARNING (56), messages_unacknowledged OK (0) | messages=56;50;200 messages_ready=56;50;200 messages_unacknowledged=0;50;200 (this was with a test queue that wasn't being consumed from - from what I can see rabbit rarely gets over 10 unread messages in normal operation, so the 50 and 200 thresholds are probably good) Since nagios is very much up in the air right now, I'm not going to work on this at the moment, which means it will probably get handed to the dev services group first.

Jonathan Griffin (:jgriffin)

Comment 6

•

14 years ago

I agree that's exactly what we want.

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: dustin → server-ops

Component: Pulse → Server Operations

Product: Webtools → mozilla.org

QA Contact: pulse → phong

Version: Trunk → other

Rick Bryce [:rbryce]

Assignee

Updated

•

14 years ago

Assignee: server-ops → rbryce

Rick Bryce [:rbryce]

Assignee

Comment 7

•

13 years ago

I can added the nagios check listed in C5. I just need a list of nodes, and who to alert/escalate.

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

13 years ago

hosts: pulse-rabbit{1,2}.dmz.phx1 alert: sysadmins first, escalation to jgriffin and me

David Lawrence [:dkl]

Comment 9

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #8) > hosts: pulse-rabbit{1,2}.dmz.phx1 > alert: sysadmins first, escalation to jgriffin and me I will be involved with Pulse work and maintenance for the A-Team starting this quarter so please myself to the nagios alerts if possible. Thanks dkl

Shyam Mani [:fox2mike]

Comment 10

•

13 years ago

Rick, Can we get these added please?

Severity: enhancement → normal

QA Contact: phong → shyam

Rick Bryce [:rbryce]

Assignee

Comment 11

•

13 years ago

Added the check in Comment 5. The perl install on pulse-rabbit2.dmz.phx1 is horked. I tried to fix, but there seems to be a number of broken deps. I didnt want to make a bad situation worse. Dustin can you help here? Also, the check I added (thats working pulse-rabbit1)is in a CRITICAL state. I assume you would like to attend to those messages before we turn the alerting on.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

13 years ago

It looks like pulse-build-translator-whimboo isn't reading its messages. :dkl, do you want to take care of that? As for perl, uh, those machines should be configured identically. It looks like files used to be in the rpmforge-extras repo that aren't anymore (specifically, perl-IO-Compress-2.052-1.el6.rfx.noarch is installed on pulse-rabbit1, but not on 2). I don't know why that would happen, though.

Jonathan Griffin (:jgriffin)

Comment 13

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #12) > It looks like pulse-build-translator-whimboo isn't reading its messages. > :dkl, do you want to take care of that? > I was following this bug and I just nuked that queue.

Rick Bryce [:rbryce]

Assignee

Comment 14

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #12) > It looks like pulse-build-translator-whimboo isn't reading its messages. > :dkl, do you want to take care of that? > > As for perl, uh, those machines should be configured identically. It looks > like files used to be in the rpmforge-extras repo that aren't anymore > (specifically, perl-IO-Compress-2.052-1.el6.rfx.noarch is installed on > pulse-rabbit1, but not on 2). I don't know why that would happen, though. Thanks for the input dustin. I got the perl libs hammered out and the check script is now running on pulse-rabbit2. I still see that this check is in a CRITICAL STATE. https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-rabbits&style=overview

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

13 years ago

:dkl, looks like there are two more queues to nuke. I'm guessing we should probably have you notified directly when this alert fires, and set it not to page SREs, at least initially. Is that OK?

Rick Bryce [:rbryce]

Assignee

Comment 16

•

13 years ago

Added this doc https://mana.mozilla.org/wiki/display/NAGIOS/Rabbit+Unread+Messages Please help document the troubleshooting procedures and escalation paths.

Status: NEW → UNCONFIRMED

Ever confirmed: false

Jonathan Griffin (:jgriffin)

Comment 17

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #15) > :dkl, looks like there are two more queues to nuke. > > I'm guessing we should probably have you notified directly when this alert > fires, and set it not to page SREs, at least initially. Is that OK? FYI, these queues were the result of whimboo testing changes to the pulsetranslator. I've asked him to be careful not to use durable queues for this, and filed bug 860372 to make durable queues an option, rather than the default, for pulsetranslator.

David Lawrence [:dkl]

Comment 18

•

13 years ago

I get the following when accessing https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-rabbits&style=overview "It appears as though you do not have permission to view information for any of the hosts you requested... If you believe this is an error, check the HTTP server authentication requirements for accessing this CGI and check the authorization options in your CGI configuration file." dkl

Rick Bryce [:rbryce]

Assignee

Comment 19

•

13 years ago

(In reply to David Lawrence [:dkl] from comment #18) > I get the following when accessing > > https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse- > rabbits&style=overview > > "It appears as though you do not have permission to view information for any > of the hosts you requested... If you believe this is an error, check the > HTTP server authentication requirements for accessing this CGI and check the > authorization options in your CGI configuration file." > > dkl Had to configure the contactgroup properly to get you access. Please test again for me.

David Lawrence [:dkl]

Comment 20

•

13 years ago

(In reply to Rick Bryce [:rbryce] from comment #19) > (In reply to David Lawrence [:dkl] from comment #18) > > I get the following when accessing > > > > https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse- > > rabbits&style=overview > > > > "It appears as though you do not have permission to view information for any > > of the hosts you requested... If you believe this is an error, check the > > HTTP server authentication requirements for accessing this CGI and check the > > authorization options in your CGI configuration file." > > > > dkl > > Had to configure the contactgroup properly to get you access. Please test > again for me. FWIW, this is still broken for me. Sorry :( dkl

Rick Bryce [:rbryce]

Assignee

Comment 21

•

13 years ago

(In reply to David Lawrence [:dkl] from comment #20) > (In reply to Rick Bryce [:rbryce] from comment #19) > > (In reply to David Lawrence [:dkl] from comment #18) > > > I get the following when accessing > > > > > > https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse- > > > rabbits&style=overview > > > > > > "It appears as though you do not have permission to view information for any > > > of the hosts you requested... If you believe this is an error, check the > > > HTTP server authentication requirements for accessing this CGI and check the > > > authorization options in your CGI configuration file." > > > > > > dkl > > > > Had to configure the contactgroup properly to get you access. Please test > > again for me. > > FWIW, this is still broken for me. Sorry :( > > dkl dkl - I had to run on thursday to catch an airplane. I will get back on this issue today, and keep you posted.

Ashish Vijayaram [:ashish]

Comment 22

•

13 years ago

Fixed this up and confirmed with :dkl that it was working. Sorry for the tangle-ness!

Status: UNCONFIRMED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → mozilla.org Graveyard