Closed Bug 923571 Opened 11 years ago Closed 9 years ago

Please add nagios monitoring to vertica staging and production cluster

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Due Date:

2015-03-31

People

(Reporter: aelliott, Assigned: mpressman)

References

Details

(Whiteboard: [data:monitoring] [2015q1])

Annie Elliott

Reporter

Description

•

11 years ago

Today (10/3/13) Our production vertica experienced a segfault and was not usable. No one received any alerts, and we only figured this out because someone could not access it.

Please create nagios alerting for the two vertica clusters (staging an production) to monitor for this, along with any standard monitoring that is done for mysql instances.

:cyborgshadow helped clear the issue

Sheeri Cabral [:sheeri]

Updated

•

11 years ago

Whiteboard: [2014q1]

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Assignee: server-ops-database → mpressman

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [2014q1] → [2014q2] April

Sheeri Cabral [:sheeri]

Comment 1

•

10 years ago

So this command shows how to monitor:

[dbadmin@vertica5.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster DB      | Host | State 
---------+------+-------
 metrics | ALL  | UP

Sheeri Cabral [:sheeri]

Comment 2

•

10 years ago

Here's what it looks like when things are down:

[dbadmin@vertica4.metrics.scl3 ~]$ /opt/vertica/bin/admintools -t view_cluster
 DB      | Host          | State 
---------+---------------+-------
 metrics | 192.168.100.7 | UP    
 metrics | 192.168.100.8 | UP    
 metrics | 192.168.100.9 | DOWN

Sheeri Cabral [:sheeri]

Comment 3

•

10 years ago

Also:

select * from vert_sys.v_catalog_nodes ;

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [2014q2] April → [2014q2] May

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [2014q2] May → [2014q2] June

Matt Pressman [:mpressman]

Assignee

Comment 4

•

10 years ago

Added plugin check_vertica_cluster.sh
Committed revision 89747.

Matt Pressman [:mpressman]

Assignee

Comment 5

•

10 years ago

This is completed

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Matt Pressman [:mpressman]

Assignee

Updated

•

10 years ago

Blocks: 1032409

Matt Pressman [:mpressman]

Assignee

Updated

•

10 years ago

Whiteboard: [2014q2] June → [monitoring]

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [monitoring] → [data: monitoring]

Sheeri Cabral [:sheeri]

Comment 6

•

10 years ago

Adding the check to puppet is not the last step in this. Once the nrpe check is added to puppet (in modules/nrpe/files/plugins/), we need to:

0) make sure the check is defined in nrpe (modules/nrpe/manifests/plugins.pp)
1) make sure the proper machines get the check (realize(Nrpe::Plugin['plugin_name']) in the puppet manifest
2) make sure the check is defined in Nagios (modules/nagios/manifests/mozilla/checkcommands.pp)
3) make sure the service is defined in Nagios (modules/nagios/manifests/mozillaservices.pp)
4) make sure the proper hosts *use* the check (modules/nagios/manifests/hosts/xxx.pp)

I've done all this in r94285. It may need tweaking, so I'll post here when the work is complete and the vertica cluster is actually being monitored.

For now, I've set it up to go to the postgres DBA oncall, so I didn't have to make a separate "vertica" group.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Sheeri Cabral [:sheeri]

Comment 7

•

10 years ago

I made a new "vertica-services" hostgroup too, in modules/nagios/manifests/mozilla/hostgroups.pp - this only needs to be done when there isn't already a hostgroup for the check.

Sheeri Cabral [:sheeri]

Comment 8

•

10 years ago

oh, there also needs to be a template in 

modules/nrpe/templates/nrpe.d

which I added.

Sheeri Cabral [:sheeri]

Comment 9

•

10 years ago

I got the script into nagios, and changed it so that if something is not UP it gives a warning. previously it checked for UP or DOWN, but the host could also be initializing.

I've put it on vertica[4,5,6].stage.metrics and put the check into a downtime window for a month. 

We need to figure out how to run the check as the dbadmin user, or give the nagios user permissions to run admintools.

Sheeri Cabral [:sheeri]

Comment 10

•

10 years ago

The script will also need to deal with the case that the db name isn't found - e.g. if there is no cluster running on the machine:

[dbadmin@vertica4.stage.metrics.scl3 ~]$ /opt/vertica/bin/adminTools -t view_cluster
 DB | Host | State 
----+------+-------

[dbadmin@vertica4.stage.metrics.scl3 ~]$

Whiteboard: [data: monitoring] → [data:monitoring]

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [data:monitoring] → [data:monitoring] [2014q4]

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

Sheeri Cabral [:sheeri]

Comment 12

•

10 years ago

This has been added to puppet.

Sheeri Cabral [:sheeri]

Comment 13

•

9 years ago

This is done, except for what's in bug 1110972.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 9 years ago

Resolution: --- → FIXED

Whiteboard: [data:monitoring] [2014q4] → [data:monitoring] [2015q1]

Sheeri Cabral [:sheeri]

Updated

•

9 years ago

Due Date: 2015-03-31

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Please add nagios monitoring to vertica staging and production cluster

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: aelliott, Assigned: mpressman)

References

Details

(Whiteboard: [data:monitoring] [2015q1])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Updated

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Comment 12

Comment 13

Updated