Closed Bug 923571 Opened 11 years ago Closed 9 years ago

Please add nagios monitoring to vertica staging and production cluster

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Due Date:

People

(Reporter: aelliott, Assigned: mpressman)

References

Details

(Whiteboard: [data:monitoring] [2015q1])

Today (10/3/13) Our production vertica experienced a segfault and was not usable. No one received any alerts, and we only figured this out because someone could not access it.

Please create nagios alerting for the two vertica clusters (staging an production) to monitor for this, along with any standard monitoring that is done for mysql instances.

:cyborgshadow helped clear the issue
Whiteboard: [2014q1]
Assignee: server-ops-database → mpressman
Whiteboard: [2014q1] → [2014q2] April
So this command shows how to monitor:

[dbadmin@vertica5.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster DB      | Host | State 
---------+------+-------
 metrics | ALL  | UP
Here's what it looks like when things are down:

[dbadmin@vertica4.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster
 DB      | Host          | State 
---------+---------------+-------
 metrics | 192.168.100.7 | UP    
 metrics | 192.168.100.8 | UP    
 metrics | 192.168.100.9 | DOWN
Also:

select * from vert_sys.v_catalog_nodes ;
Whiteboard: [2014q2] April → [2014q2] May
Whiteboard: [2014q2] May → [2014q2] June
Added plugin check_vertica_cluster.sh
Committed revision 89747.
This is completed
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1032409
Whiteboard: [2014q2] June → [monitoring]
Whiteboard: [monitoring] → [data: monitoring]
Adding the check to puppet is not the last step in this. Once the nrpe check is added to puppet (in modules/nrpe/files/plugins/), we need to:

0) make sure the check is defined in nrpe (modules/nrpe/manifests/plugins.pp)
1) make sure the proper machines get the check (realize(Nrpe::Plugin['plugin_name']) in the puppet manifest
2) make sure the check is defined in Nagios (modules/nagios/manifests/mozilla/checkcommands.pp)
3) make sure the service is defined in Nagios (modules/nagios/manifests/mozillaservices.pp)
4) make sure the proper hosts *use* the check (modules/nagios/manifests/hosts/xxx.pp)

I've done all this in r94285. It may need tweaking, so I'll post here when the work is complete and the vertica cluster is actually being monitored.

For now, I've set it up to go to the postgres DBA oncall, so I didn't have to make a separate "vertica" group.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I made a new "vertica-services" hostgroup too, in modules/nagios/manifests/mozilla/hostgroups.pp - this only needs to be done when there isn't already a hostgroup for the check.
oh, there also needs to be a template in 

modules/nrpe/templates/nrpe.d

which I added.
I got the script into nagios, and changed it so that if something is not UP it gives a warning. previously it checked for UP or DOWN, but the host could also be initializing.

I've put it on vertica[4,5,6].stage.metrics and put the check into a downtime window for a month. 

We need to figure out how to run the check as the dbadmin user, or give the nagios user permissions to run admintools.
The script will also need to deal with the case that the db name isn't found - e.g. if there is no cluster running on the machine:

[dbadmin@vertica4.stage.metrics.scl3 ~]$ /opt/vertica/bin/adminTools -t view_cluster
 DB | Host | State 
----+------+-------

[dbadmin@vertica4.stage.metrics.scl3 ~]$
Whiteboard: [data: monitoring] → [data:monitoring]
Whiteboard: [data:monitoring] → [data:monitoring] [2014q4]
Product: mozilla.org → Data & BI Services Team
This has been added to puppet.
This is done, except for what's in bug 1110972.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Whiteboard: [data:monitoring] [2014q4] → [data:monitoring] [2015q1]
Due Date: 2015-03-31
You need to log in before you can comment on or make changes to this bug.