Closed
Bug 923571
Opened 11 years ago
Closed 9 years ago
Please add nagios monitoring to vertica staging and production cluster
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
Due Date:
People
(Reporter: aelliott, Assigned: mpressman)
References
Details
(Whiteboard: [data:monitoring] [2015q1])
Today (10/3/13) Our production vertica experienced a segfault and was not usable. No one received any alerts, and we only figured this out because someone could not access it. Please create nagios alerting for the two vertica clusters (staging an production) to monitor for this, along with any standard monitoring that is done for mysql instances. :cyborgshadow helped clear the issue
Updated•11 years ago
|
Whiteboard: [2014q1]
Updated•10 years ago
|
Assignee: server-ops-database → mpressman
Updated•10 years ago
|
Whiteboard: [2014q1] → [2014q2] April
Comment 1•10 years ago
|
||
So this command shows how to monitor: [dbadmin@vertica5.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster DB | Host | State ---------+------+------- metrics | ALL | UP
Comment 2•10 years ago
|
||
Here's what it looks like when things are down: [dbadmin@vertica4.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster DB | Host | State ---------+---------------+------- metrics | 192.168.100.7 | UP metrics | 192.168.100.8 | UP metrics | 192.168.100.9 | DOWN
Comment 3•10 years ago
|
||
Also: select * from vert_sys.v_catalog_nodes ;
Updated•10 years ago
|
Whiteboard: [2014q2] April → [2014q2] May
Updated•10 years ago
|
Whiteboard: [2014q2] May → [2014q2] June
Assignee | ||
Comment 4•10 years ago
|
||
Added plugin check_vertica_cluster.sh Committed revision 89747.
Assignee | ||
Comment 5•10 years ago
|
||
This is completed
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•10 years ago
|
Whiteboard: [2014q2] June → [monitoring]
Updated•10 years ago
|
Whiteboard: [monitoring] → [data: monitoring]
Comment 6•10 years ago
|
||
Adding the check to puppet is not the last step in this. Once the nrpe check is added to puppet (in modules/nrpe/files/plugins/), we need to: 0) make sure the check is defined in nrpe (modules/nrpe/manifests/plugins.pp) 1) make sure the proper machines get the check (realize(Nrpe::Plugin['plugin_name']) in the puppet manifest 2) make sure the check is defined in Nagios (modules/nagios/manifests/mozilla/checkcommands.pp) 3) make sure the service is defined in Nagios (modules/nagios/manifests/mozillaservices.pp) 4) make sure the proper hosts *use* the check (modules/nagios/manifests/hosts/xxx.pp) I've done all this in r94285. It may need tweaking, so I'll post here when the work is complete and the vertica cluster is actually being monitored. For now, I've set it up to go to the postgres DBA oncall, so I didn't have to make a separate "vertica" group.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 7•10 years ago
|
||
I made a new "vertica-services" hostgroup too, in modules/nagios/manifests/mozilla/hostgroups.pp - this only needs to be done when there isn't already a hostgroup for the check.
Comment 8•10 years ago
|
||
oh, there also needs to be a template in modules/nrpe/templates/nrpe.d which I added.
Comment 9•10 years ago
|
||
I got the script into nagios, and changed it so that if something is not UP it gives a warning. previously it checked for UP or DOWN, but the host could also be initializing. I've put it on vertica[4,5,6].stage.metrics and put the check into a downtime window for a month. We need to figure out how to run the check as the dbadmin user, or give the nagios user permissions to run admintools.
Comment 10•10 years ago
|
||
The script will also need to deal with the case that the db name isn't found - e.g. if there is no cluster running on the machine: [dbadmin@vertica4.stage.metrics.scl3 ~]$ /opt/vertica/bin/adminTools -t view_cluster DB | Host | State ----+------+------- [dbadmin@vertica4.stage.metrics.scl3 ~]$
Whiteboard: [data: monitoring] → [data:monitoring]
Updated•10 years ago
|
Whiteboard: [data:monitoring] → [data:monitoring] [2014q4]
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
Comment 12•10 years ago
|
||
This has been added to puppet.
Comment 13•9 years ago
|
||
This is done, except for what's in bug 1110972.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Whiteboard: [data:monitoring] [2014q4] → [data:monitoring] [2015q1]
Updated•9 years ago
|
Due Date: 2015-03-31
You need to log in
before you can comment on or make changes to this bug.
Description
•