945248 - Add nagios checks to our ES clusters

Reporter

Description

•

11 years ago

I'd like to check for two things:
1. That the cluster isn't red
2. That the number of nodes reported is equal to the number we expect

You can get both these things from e.g.
curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true'

This will return JSON something like:
{
  "cluster_name" : "sss_stage",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 225,
  "active_shards" : 450,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}

status should be green, we should WARN on yellow and CRITICAL on red.

number_of_nodes should be whatever it is for that cluster, we should CRITICAL if it's less.

We should probably check this on each node, and each node should report the same things. 

This will help us notice issues like bug 944762.

Brandon Burton [:solarce]

Updated

•

11 years ago

Assignee: server-ops-webops → bburton

Brandon Burton [:solarce]

Comment 1

•

11 years ago

-> % svn diff
Index: hosts/phx1.pp
===================================================================
--- hosts/phx1.pp	(revision 79320)
+++ hosts/phx1.pp	(working copy)
@@ -3265,6 +3265,33 @@
                 'elasticsearch-it',
             ]
         },
+        'socorro-es1.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
+        'socorro-es2.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
+        'socorro-es3.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
         'tbpl1.db.phx1.mozilla.com' => {
             parents => 'boa-d09-b1.console.phx1.mozilla.com',
             hostgroups => [
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79320)
+++ mozilla/services.pp	(working copy)
@@ -1075,7 +1075,7 @@
                 default => [
                     'generic',
                     "!elasticsearch-it",
-                    "!socorro-es",
+                    "!socorro-stage-elasticsearch",
                     "!load-100",
                     "!fuzzer-hosts",
                     '!generic-preprod'
@@ -1090,7 +1090,7 @@
             hostgroups => $::fqdn ? {
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
-                    'socorro-es',
+                    'socorro-stage-elasticsearch',
                 ],
                 default => [
                 ]
@@ -3937,7 +3937,7 @@
                     'elasticsearch-jenkins',
                     'elasticsearch-it',
                     'elasticsearch-vip',
-#                    "socorro-es",
+                    "socorro-stage-elasticsearch",
                 ],
                 'nagios1.private.scl3.mozilla.com' => [
                     'elasticsearch-it',
Index: mozilla/hostgroups.pp
===================================================================
--- mozilla/hostgroups.pp	(revision 79320)
+++ mozilla/hostgroups.pp	(working copy)
@@ -951,9 +951,12 @@
         'graphs-vip' => {
             alias => "VIP for graphs.m.o",
         },
-        'socorro-es' => {
-            alias => "Socorro Search Service nodes",
+        'socorro-stage-elasticsearch' => {
+            alias => "Socorro Stage ElasticSearch nodes",
         },
+        'socorro-elasticsearch' => {
+            alias => "Socorro ElasticSearch nodes",
+        },
         'jenkins-servers' => {
             alias => "Jenkins nodes",
         },
bburton@althalus [07:02:55] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es color check to socorro stage es, bug 945248"
Sending        manifests/hosts/phx1.pp
Sending        manifests/mozilla/hostgroups.pp
Sending        manifests/mozilla/services.pp
Transmitting file data ...
Committed revision 79330.

Brandon Burton [:solarce]

Comment 2

•

11 years ago

Things are looking good on stage, gonna roll out to production

Brandon Burton [:solarce]

Comment 3

•

11 years ago

Adding to prod, checks are all green

-> % svn diff
Index: hosts/phx1.pp
===================================================================
--- hosts/phx1.pp	(revision 79348)
+++ hosts/phx1.pp	(working copy)
@@ -3292,6 +3292,51 @@
                 'socorro-stage-elasticsearch',
             ]
         },
+        'socorro-es1.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es2.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es3.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es1.dev.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es2.dev.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
         'tbpl1.db.phx1.mozilla.com' => {
             parents => 'boa-d09-b1.console.phx1.mozilla.com',
             hostgroups => [
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79348)
+++ mozilla/services.pp	(working copy)
@@ -1091,6 +1091,7 @@
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
                     'socorro-stage-elasticsearch',
+                    'socorro-elasticsearch',
                 ],
                 default => [
                 ]
@@ -3938,6 +3939,7 @@
                     'elasticsearch-it',
                     'elasticsearch-vip',
                     "socorro-stage-elasticsearch",
+                    "socorro-elasticsearch",
                 ],
                 'nagios1.private.scl3.mozilla.com' => [
                     'elasticsearch-it',
bburton@althalus [10:03:33] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es color check to socorro prod es, bug 945248"
Sending        manifests/hosts/phx1.pp
Sending        manifests/mozilla/services.pp
Transmitting file data ..
Committed revision 79351.

Brandon Burton [:solarce]

Comment 4

•

11 years ago

Adding check for proper number of running processes

-> % svn diff
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79352)
+++ mozilla/services.pp	(working copy)
@@ -3965,7 +3965,8 @@
             hostgroups => $::fqdn ? {
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
-#                    "socorro-es",
+                    'socorro-stage-elasticsearch',
+                    "socorro-elasticsearch",
                 ],
                 default => [
                 ]
bburton@althalus [10:24:23] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es procs check to socorro stage and prod es, bug 945248"
Sending        manifests/mozilla/services.pp
Transmitting file data .
Committed revision 79353.

Status: NEW → ASSIGNED

Brandon Burton [:solarce]

Comment 5

•

11 years ago

(In reply to Laura Thomson :laura from comment #0)
> I'd like to check for two things:
> 1. That the cluster isn't red
> 2. That the number of nodes reported is equal to the number we expect
> 
> You can get both these things from e.g.
> curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true'
> 
> This will return JSON something like:
> {
>   "cluster_name" : "sss_stage",
>   "status" : "green",
>   "timed_out" : false,
>   "number_of_nodes" : 2,
>   "number_of_data_nodes" : 2,
>   "active_primary_shards" : 225,
>   "active_shards" : 450,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 0
> }
> 
> status should be green, we should WARN on yellow and CRITICAL on red.
> 
> number_of_nodes should be whatever it is for that cluster, we should
> CRITICAL if it's less.
>

> We should probably check this on each node, and each node should report the
> same things. 
> 
> This will help us notice issues like bug 944762.


As of now, stage and prod have the same monitoring we use on other ES clusters. We'll get alerted if the "color" goes to yellow or red and we'll get alerted if the ES process is not running.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Laura Thomson :laura

Reporter

Comment 6

•

11 years ago

I'm not sure the checks you have will detect split brain.  Thoughts?

Laura Thomson :laura

Reporter

Comment 7

•

11 years ago

In fact, I'm sure it won't. Can we get that check in place?

Brandon Burton [:solarce]

Comment 8

•

11 years ago

(In reply to Laura Thomson :laura from comment #7)
> In fact, I'm sure it won't. Can we get that check in place?

We don't have a check that currently does that, what we have I've added

I'll discuss other check possibilities with :phrawzty and see what he knows

Daniel Maher [:phrawzty]

Comment 9

•

11 years ago

(In reply to Laura Thomson :laura from comment #6)
> I'm not sure the checks you have will detect split brain.  Thoughts?

If your particular cluster is in split-brain it will (at least) be yellow, and likely red, insofar as colour goes.  That said, the literal condition of split-brain would not be explicitly detected by the commits noted above.

I would suggest adding additional configuration parameters to the cluster that make explicit declarations about the number of expected nodes, which will actually prevent split-brain from happening in the first place (by disallowing interactions on split nodes).  This functionally boils down to setting 
"discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of nodes in the cluster.

Laura Thomson :laura

Reporter

Comment 10

•

11 years ago

(In reply to Daniel Maher [:phrawzty] from comment #9)
> (In reply to Laura Thomson :laura from comment #6)
> > I'm not sure the checks you have will detect split brain.  Thoughts?
> 
> If your particular cluster is in split-brain it will (at least) be yellow,
> and likely red, insofar as colour goes.  That said, the literal condition of
> split-brain would not be explicitly detected by the commits noted above.
> 
> I would suggest adding additional configuration parameters to the cluster
> that make explicit declarations about the number of expected nodes, which
> will actually prevent split-brain from happening in the first place (by
> disallowing interactions on split nodes).  This functionally boils down to
> setting 
> "discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of
> nodes in the cluster.

This doesn't always work, fwiw. I've seen the problem in flight and it's also described here: http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/ (in the conclusion - there's also a link to the ES issue). He suggests monitoring the number of nodes each node sees as a detection mechanism.

Daniel Maher [:phrawzty]

Comment 11

•

11 years ago

(In reply to Laura Thomson :laura from comment #10)
> This doesn't always work, fwiw. I've seen the problem in flight and it's
> also described here:
> http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-
> elasticsearch/ (in the conclusion - there's also a link to the ES issue). He
> suggests monitoring the number of nodes each node sees as a detection
> mechanism.

The URL you link to specifically addresses this calculus in a two-node cluster only, and specifically notes that a three-node cluster "solves" the scenario; in fact, the calculus works for clusters of any odd number greater than one.  Furthermore, IIRC your cluster experienced split-brain when it had four nodes, no ?

Brandon Burton [:solarce]

Comment 12

•

11 years ago

This nagios check, https://github.com/anchor/nagios-plugin-elasticsearch , appears to offer a lot more logic and error output than the current one, so I am going to file a bug to get it added to our Nagios setup and we can test it on stage ES

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brandon Burton [:solarce]

Updated

•

10 years ago

Assignee: bburton → server-ops-webops

Laura Thomson :laura

Reporter

Comment 13

•

10 years ago

Filed a separate bug for the plugin in comment 12.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Add nagios checks to our ES clusters

Categories

(Infrastructure & Operations Graveyard :: WebOps: Socorro, task)

Tracking

(Not tracked)

People

(Reporter: laura, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Updated