Closed Bug 1017194 Opened 10 years ago Closed 10 years ago

Improve nagios checks for ES

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: laura, Assigned: cliang)

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/443] )

Attachments

(1 file)

nrpe_es.diff 10 years ago C. Liang [:cyliang] 992 bytes, patch	ashish : review+	Details \| Diff \| Splinter Review

Laura Thomson :laura

Reporter

Description

•

10 years ago

As per https://bugzilla.mozilla.org/show_bug.cgi?id=945248#c12, solarce said:
"This nagios check, https://github.com/anchor/nagios-plugin-elasticsearch , appears to offer a lot more logic and error output than the current one, so I am going to file a bug to get it added to our Nagios setup and we can test it on stage ES"

Let's try it out.

C. Liang [:cyliang]

Assignee

Comment 1

•

10 years ago

I've updated the script (so that it will work with ES 0.90.x clusters).[1]  In particular, I noted the following:
        # To be of any use in detecting split-brains, this value must be
        # set to the *total* number of master-eligible nodes in the
        # cluster, not whatever you set in ElasticSearch's
        # 'discovery.zen.minimum_master_nodes' configuration parameter.
        # (See ES issue #2488.) Of course, this will trip whenever a
        # node is taken down for maintenance, so we raise only a warning
        # -- not a critical -- status condition.


I updated Nagios configs to add a check command "check_elasticsearch_nodes" that will run "[..]/check_elasticsearch --master-nodes=<num>" where <num> is defined in puppet hiera as 'elasticsearch_master_nodes'.
The NRPE Elasticsearch plugin has been realized on the socorro stage ES servers.


[1] From the SVN log: 
------------------------------------------------------------------------
r88398 | cliang@mozilla.com | 2014-06-02 16:08:12 -0500 (Mon, 02 Jun 2014) | 1 line

Update NPRE Elasticsearch script (BZ 1017194)
------------------------------------------------------------------------

C. Liang [:cyliang]

Assignee

Comment 2

•

10 years ago

Attached patch nrpe_es.diff — Details — Splinter Review

ashish: Would you be willing to review the proposed changes?  I believe that this will add the check I created to the socorro stage servers, with the alerts going to just #sysadmins.  If I'm wrong, please let me know.

Attachment #8432880 - Flags: review?(ashish)

Ashish Vijayaram [:ashish]

Comment 3

•

10 years ago

Comment on attachment 8432880 [details] [diff] [review]
nrpe_es.diff

Review of attachment 8432880 [details] [diff] [review]:
-----------------------------------------------------------------

No syntax issues. However, I'd suggest "notification_period   => 'usworkinghours'" for staging. I'm for replacing the current checks but before this goes into the production nodes, please make sure documentation is updated, in order to assist the oncalls. The documentation link for this check would be: http://m.mozilla.org/nodes+-+Elasticsearch

Attachment #8432880 - Flags: review?(ashish) → review+

C. Liang [:cyliang]

Assignee

Comment 4

•

10 years ago

Checks made live.  I tested deliberately lying to one of the servers about the number of maximum of master-ready nodes in the cluster (telling it that there ought to be 5 nodes rather than the actual 3).  Obligingly, that server did complain that it only saw three nodes.

I've stubbed out basic documentation at http://m.mozilla.org/nodes+-+Elasticsearch.  I'm letting this check run for a few days before adding it to other clusters.

C. Liang [:cyliang]

Assignee

Updated

•

10 years ago

Whiteboard: [kanban]

Philippe M. Chiasson (:gozer)

Updated

•

10 years ago

Whiteboard: [kanban] → [kanban:https://kanbanize.com/ctrl_board/4/443]

C. Liang [:cyliang]

Assignee

Comment 5

•

10 years ago

There are now two versions of the check:
   - elasticsearch-nodes-workhours
   - elasticsearch-nodes

Now running on MDN stage and prod clusters.
Set to start running on socorro prod and SCL3 prod clusters.

Once those look good, I'll start removing the older check from these clusters and put out a general notice about changing the remaining clusters to use this check.

C. Liang [:cyliang]

Assignee

Comment 6

•

10 years ago

All remaining active clusters are now using the new check.  A separate bug has been filed for removing the old check. 

If this check needs to be implemented on a new set of cluster, what needs to happen is:
  - In the puppet definition for each node in the cluster, realize the new check: 
    "realize (Nrpe::Plugin['elasticsearch'])"
  - add or edit hiera files for the nodes in the cluster, adding the correct value for "elasticsearch_master_nodes"
  - if needed, edit puppet to create a Nagios hostgroup for the cluster
  - edit puppet to add the hostgroup to either elasticsearch-nodes-workhours or elasticsearch-nodes

C. Liang [:cyliang]

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Improve nagios checks for ES

Categories

(Infrastructure & Operations Graveyard :: WebOps: Socorro, task)

Tracking

(Not tracked)

People

(Reporter: laura, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/443] )

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Comment 6

Updated

Updated

Attachment

General

Description

File Name

Content Type