Closed
Bug 618896
Opened 15 years ago
Closed 13 years ago
Integrate pulse.mozilla.org with nagios (or whatever mozilla uses)
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: christian, Assigned: rbryce)
Details
Integrate pulse.mozilla.org with nagios (or whatever mozilla uses). It'd be nice to have the server portion monitored.
Updated•14 years ago
|
Assignee: clegnitto → dustin
Comment 1•14 years ago
|
||
I'm not exactly sure how nagios works, but it would be great to be alerted if the RAM consumption by rabbitmq on this box exceeds 1GB. To find the RAM consumption, you can execute /usr/sbin/rabbitmqctl status, which produces output like this:
Status of node 'rabbit@dp-pulse01' ...
[{pid,3413},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","2.6.0"},
{webmachine,"webmachine","1.7.0-rmq2.6.0-hg0c4b60a"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","2.6.0"},
{amqp_client,"RabbitMQ AMQP Client","2.6.0"},
{rabbit,"RabbitMQ","2.6.0"},
{os_mon,"CPO CXC 138 46","2.2.6"},
{sasl,"SASL CXC 138 11","2.1.9.4"},
{rabbitmq_mochiweb,"RabbitMQ Mochiweb Embedding","2.6.0"},
{mochiweb,"MochiMedia Web Server","1.3-rmq2.6.0-git9a53dbd"},
{inets,"INETS CXC 138 49","5.6"},
{mnesia,"MNESIA CXC 138 12","4.4.19"},
{stdlib,"ERTS CXC 138 10","1.17.4"},
{kernel,"ERTS CXC 138 10","2.14.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,167177304},
{processes,99064976},
{processes_used,98955024},
{system,68112328},
{atom,1333337},
{atom_used,1307882},
{binary,18775888},
{code,14435840},
{ets,31873880}]}]
...done.
It's the "total" field under "memory" that's the interesting number here. Can nagios periodically execute this and alert you and me if this number is > 1GB?
Comment 2•14 years ago
|
||
I'll see what I can whip up.
Comment 3•14 years ago
|
||
This will require getting the hardware running first, which I haven't been pushing on much, lately. I'm checking up on it.
Comment 4•14 years ago
|
||
Might be useful:
http://syslog.tv/rabbitmq-nagios/
Comment 5•14 years ago
|
||
We have some monitoring now - making sure the service is running, in general.
I think that the thing to do on this bug is to monitor unread messages. RabbitMQ gives an easy count of those clusterwide.
https://github.com/jamesc/nagios-plugins-rabbitmq can check that easily:
[root@pulse-rabbit1.dmz.phx1 ~]# perl check_rabbitmq_overview -H localhost -u nagios -p <elided> -w 50,50,50 -c 200,200,200
RABBITMQ_OVERVIEW WARNING - messages WARNING (56) messages_ready WARNING (56), messages_unacknowledged OK (0) | messages=56;50;200 messages_ready=56;50;200 messages_unacknowledged=0;50;200
(this was with a test queue that wasn't being consumed from - from what I can see rabbit rarely gets over 10 unread messages in normal operation, so the 50 and 200 thresholds are probably good)
Since nagios is very much up in the air right now, I'm not going to work on this at the moment, which means it will probably get handed to the dev services group first.
Comment 6•14 years ago
|
||
I agree that's exactly what we want.
Updated•14 years ago
|
Assignee: dustin → server-ops
Component: Pulse → Server Operations
Product: Webtools → mozilla.org
QA Contact: pulse → phong
Version: Trunk → other
| Assignee | ||
Updated•14 years ago
|
Assignee: server-ops → rbryce
| Assignee | ||
Comment 7•13 years ago
|
||
I can added the nagios check listed in C5. I just need a list of nodes, and who to alert/escalate.
Comment 8•13 years ago
|
||
hosts: pulse-rabbit{1,2}.dmz.phx1
alert: sysadmins first, escalation to jgriffin and me
Comment 9•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> hosts: pulse-rabbit{1,2}.dmz.phx1
> alert: sysadmins first, escalation to jgriffin and me
I will be involved with Pulse work and maintenance for the A-Team starting this quarter so please myself to the nagios alerts if possible.
Thanks
dkl
Comment 10•13 years ago
|
||
Rick,
Can we get these added please?
Severity: enhancement → normal
QA Contact: phong → shyam
| Assignee | ||
Comment 11•13 years ago
|
||
Added the check in Comment 5. The perl install on pulse-rabbit2.dmz.phx1 is horked. I tried to fix, but there seems to be a number of broken deps. I didnt want to make a bad situation worse. Dustin can you help here?
Also, the check I added (thats working pulse-rabbit1)is in a CRITICAL state. I assume you would like to attend to those messages before we turn the alerting on.
Comment 12•13 years ago
|
||
It looks like pulse-build-translator-whimboo isn't reading its messages. :dkl, do you want to take care of that?
As for perl, uh, those machines should be configured identically. It looks like files used to be in the rpmforge-extras repo that aren't anymore (specifically, perl-IO-Compress-2.052-1.el6.rfx.noarch is installed on pulse-rabbit1, but not on 2). I don't know why that would happen, though.
Comment 13•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> It looks like pulse-build-translator-whimboo isn't reading its messages.
> :dkl, do you want to take care of that?
>
I was following this bug and I just nuked that queue.
| Assignee | ||
Comment 14•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> It looks like pulse-build-translator-whimboo isn't reading its messages.
> :dkl, do you want to take care of that?
>
> As for perl, uh, those machines should be configured identically. It looks
> like files used to be in the rpmforge-extras repo that aren't anymore
> (specifically, perl-IO-Compress-2.052-1.el6.rfx.noarch is installed on
> pulse-rabbit1, but not on 2). I don't know why that would happen, though.
Thanks for the input dustin. I got the perl libs hammered out and the check script is now running on pulse-rabbit2. I still see that this check is in a CRITICAL STATE.
https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-rabbits&style=overview
Comment 15•13 years ago
|
||
:dkl, looks like there are two more queues to nuke.
I'm guessing we should probably have you notified directly when this alert fires, and set it not to page SREs, at least initially. Is that OK?
| Assignee | ||
Comment 16•13 years ago
|
||
Added this doc https://mana.mozilla.org/wiki/display/NAGIOS/Rabbit+Unread+Messages
Please help document the troubleshooting procedures and escalation paths.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Comment 17•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> :dkl, looks like there are two more queues to nuke.
>
> I'm guessing we should probably have you notified directly when this alert
> fires, and set it not to page SREs, at least initially. Is that OK?
FYI, these queues were the result of whimboo testing changes to the pulsetranslator. I've asked him to be careful not to use durable queues for this, and filed bug 860372 to make durable queues an option, rather than the default, for pulsetranslator.
Comment 18•13 years ago
|
||
I get the following when accessing
https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-rabbits&style=overview
"It appears as though you do not have permission to view information for any of the hosts you requested... If you believe this is an error, check the HTTP server authentication requirements for accessing this CGI and check the authorization options in your CGI configuration file."
dkl
| Assignee | ||
Comment 19•13 years ago
|
||
(In reply to David Lawrence [:dkl] from comment #18)
> I get the following when accessing
>
> https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-
> rabbits&style=overview
>
> "It appears as though you do not have permission to view information for any
> of the hosts you requested... If you believe this is an error, check the
> HTTP server authentication requirements for accessing this CGI and check the
> authorization options in your CGI configuration file."
>
> dkl
Had to configure the contactgroup properly to get you access. Please test again for me.
Comment 20•13 years ago
|
||
(In reply to Rick Bryce [:rbryce] from comment #19)
> (In reply to David Lawrence [:dkl] from comment #18)
> > I get the following when accessing
> >
> > https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-
> > rabbits&style=overview
> >
> > "It appears as though you do not have permission to view information for any
> > of the hosts you requested... If you believe this is an error, check the
> > HTTP server authentication requirements for accessing this CGI and check the
> > authorization options in your CGI configuration file."
> >
> > dkl
>
> Had to configure the contactgroup properly to get you access. Please test
> again for me.
FWIW, this is still broken for me. Sorry :(
dkl
| Assignee | ||
Comment 21•13 years ago
|
||
(In reply to David Lawrence [:dkl] from comment #20)
> (In reply to Rick Bryce [:rbryce] from comment #19)
> > (In reply to David Lawrence [:dkl] from comment #18)
> > > I get the following when accessing
> > >
> > > https://nagios.mozilla.org/phx1/cgi-bin/status.cgi?hostgroup=pulse-
> > > rabbits&style=overview
> > >
> > > "It appears as though you do not have permission to view information for any
> > > of the hosts you requested... If you believe this is an error, check the
> > > HTTP server authentication requirements for accessing this CGI and check the
> > > authorization options in your CGI configuration file."
> > >
> > > dkl
> >
> > Had to configure the contactgroup properly to get you access. Please test
> > again for me.
>
> FWIW, this is still broken for me. Sorry :(
>
> dkl
dkl -
I had to run on thursday to catch an airplane. I will get back on this issue today, and keep you posted.
Comment 22•13 years ago
|
||
Fixed this up and confirmed with :dkl that it was working. Sorry for the tangle-ness!
Status: UNCONFIRMED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•