Improve nagios checks for ES



5 years ago
2 years ago


(Reporter: laura, Assigned: cliang)



(Whiteboard: [kanban:] )


(1 attachment)



5 years ago
As per, solarce said:
"This nagios check, , appears to offer a lot more logic and error output than the current one, so I am going to file a bug to get it added to our Nagios setup and we can test it on stage ES"

Let's try it out.

Comment 1

5 years ago
I've updated the script (so that it will work with ES 0.90.x clusters).[1]  In particular, I noted the following:
        # To be of any use in detecting split-brains, this value must be
        # set to the *total* number of master-eligible nodes in the
        # cluster, not whatever you set in ElasticSearch's
        # 'discovery.zen.minimum_master_nodes' configuration parameter.
        # (See ES issue #2488.) Of course, this will trip whenever a
        # node is taken down for maintenance, so we raise only a warning
        # -- not a critical -- status condition.

I updated Nagios configs to add a check command "check_elasticsearch_nodes" that will run "[..]/check_elasticsearch --master-nodes=<num>" where <num> is defined in puppet hiera as 'elasticsearch_master_nodes'.
The NRPE Elasticsearch plugin has been realized on the socorro stage ES servers.

[1] From the SVN log: 
r88398 | | 2014-06-02 16:08:12 -0500 (Mon, 02 Jun 2014) | 1 line

Update NPRE Elasticsearch script (BZ 1017194)

Comment 2

5 years ago
Created attachment 8432880 [details] [diff] [review]

ashish: Would you be willing to review the proposed changes?  I believe that this will add the check I created to the socorro stage servers, with the alerts going to just #sysadmins.  If I'm wrong, please let me know.
Attachment #8432880 - Flags: review?(ashish)
Comment on attachment 8432880 [details] [diff] [review]

Review of attachment 8432880 [details] [diff] [review]:

No syntax issues. However, I'd suggest "notification_period   => 'usworkinghours'" for staging. I'm for replacing the current checks but before this goes into the production nodes, please make sure documentation is updated, in order to assist the oncalls. The documentation link for this check would be:
Attachment #8432880 - Flags: review?(ashish) → review+

Comment 4

5 years ago
Checks made live.  I tested deliberately lying to one of the servers about the number of maximum of master-ready nodes in the cluster (telling it that there ought to be 5 nodes rather than the actual 3).  Obligingly, that server did complain that it only saw three nodes.

I've stubbed out basic documentation at  I'm letting this check run for a few days before adding it to other clusters.


5 years ago
Whiteboard: [kanban]
Whiteboard: [kanban] → [kanban:]

Comment 5

5 years ago
There are now two versions of the check:
   - elasticsearch-nodes-workhours
   - elasticsearch-nodes

Now running on MDN stage and prod clusters.
Set to start running on socorro prod and SCL3 prod clusters.

Once those look good, I'll start removing the older check from these clusters and put out a general notice about changing the remaining clusters to use this check.

Comment 6

5 years ago
All remaining active clusters are now using the new check.  A separate bug has been filed for removing the old check. 

If this check needs to be implemented on a new set of cluster, what needs to happen is:
  - In the puppet definition for each node in the cluster, realize the new check: 
    "realize (Nrpe::Plugin['elasticsearch'])"
  - add or edit hiera files for the nodes in the cluster, adding the correct value for "elasticsearch_master_nodes"
  - if needed, edit puppet to create a Nagios hostgroup for the cluster
  - edit puppet to add the hostgroup to either elasticsearch-nodes-workhours or elasticsearch-nodes


5 years ago
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.