Closed Bug 945248 Opened 11 years ago Closed 10 years ago

Add nagios checks to our ES clusters

Categories

(Infrastructure & Operations Graveyard :: WebOps: Socorro, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Unassigned)

Details

I'd like to check for two things:
1. That the cluster isn't red
2. That the number of nodes reported is equal to the number we expect

You can get both these things from e.g.
curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true'

This will return JSON something like:
{
  "cluster_name" : "sss_stage",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 225,
  "active_shards" : 450,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}

status should be green, we should WARN on yellow and CRITICAL on red.

number_of_nodes should be whatever it is for that cluster, we should CRITICAL if it's less.

We should probably check this on each node, and each node should report the same things. 

This will help us notice issues like bug 944762.
Assignee: server-ops-webops → bburton
-> % svn diff
Index: hosts/phx1.pp
===================================================================
--- hosts/phx1.pp	(revision 79320)
+++ hosts/phx1.pp	(working copy)
@@ -3265,6 +3265,33 @@
                 'elasticsearch-it',
             ]
         },
+        'socorro-es1.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
+        'socorro-es2.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
+        'socorro-es3.stage.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-4.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-stage-elasticsearch',
+            ]
+        },
         'tbpl1.db.phx1.mozilla.com' => {
             parents => 'boa-d09-b1.console.phx1.mozilla.com',
             hostgroups => [
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79320)
+++ mozilla/services.pp	(working copy)
@@ -1075,7 +1075,7 @@
                 default => [
                     'generic',
                     "!elasticsearch-it",
-                    "!socorro-es",
+                    "!socorro-stage-elasticsearch",
                     "!load-100",
                     "!fuzzer-hosts",
                     '!generic-preprod'
@@ -1090,7 +1090,7 @@
             hostgroups => $::fqdn ? {
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
-                    'socorro-es',
+                    'socorro-stage-elasticsearch',
                 ],
                 default => [
                 ]
@@ -3937,7 +3937,7 @@
                     'elasticsearch-jenkins',
                     'elasticsearch-it',
                     'elasticsearch-vip',
-#                    "socorro-es",
+                    "socorro-stage-elasticsearch",
                 ],
                 'nagios1.private.scl3.mozilla.com' => [
                     'elasticsearch-it',
Index: mozilla/hostgroups.pp
===================================================================
--- mozilla/hostgroups.pp	(revision 79320)
+++ mozilla/hostgroups.pp	(working copy)
@@ -951,9 +951,12 @@
         'graphs-vip' => {
             alias => "VIP for graphs.m.o",
         },
-        'socorro-es' => {
-            alias => "Socorro Search Service nodes",
+        'socorro-stage-elasticsearch' => {
+            alias => "Socorro Stage ElasticSearch nodes",
         },
+        'socorro-elasticsearch' => {
+            alias => "Socorro ElasticSearch nodes",
+        },
         'jenkins-servers' => {
             alias => "Jenkins nodes",
         },
bburton@althalus [07:02:55] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es color check to socorro stage es, bug 945248"
Sending        manifests/hosts/phx1.pp
Sending        manifests/mozilla/hostgroups.pp
Sending        manifests/mozilla/services.pp
Transmitting file data ...
Committed revision 79330.
Things are looking good on stage, gonna roll out to production
Adding to prod, checks are all green

-> % svn diff
Index: hosts/phx1.pp
===================================================================
--- hosts/phx1.pp	(revision 79348)
+++ hosts/phx1.pp	(working copy)
@@ -3292,6 +3292,51 @@
                 'socorro-stage-elasticsearch',
             ]
         },
+        'socorro-es1.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es2.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es3.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es1.dev.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
+        'socorro-es2.dev.webapp.phx1.mozilla.com' => {
+            parents => 'boa-b1.r101-8.console.phx1.mozilla.com',
+            ganglia_phx1_cluster => "Socorro Stage ElasticSearch",
+            hostgroups => [
+                'hp-servers',
+                'generic',
+                'socorro-elasticsearch',
+            ]
+        },
         'tbpl1.db.phx1.mozilla.com' => {
             parents => 'boa-d09-b1.console.phx1.mozilla.com',
             hostgroups => [
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79348)
+++ mozilla/services.pp	(working copy)
@@ -1091,6 +1091,7 @@
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
                     'socorro-stage-elasticsearch',
+                    'socorro-elasticsearch',
                 ],
                 default => [
                 ]
@@ -3938,6 +3939,7 @@
                     'elasticsearch-it',
                     'elasticsearch-vip',
                     "socorro-stage-elasticsearch",
+                    "socorro-elasticsearch",
                 ],
                 'nagios1.private.scl3.mozilla.com' => [
                     'elasticsearch-it',
bburton@althalus [10:03:33] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es color check to socorro prod es, bug 945248"
Sending        manifests/hosts/phx1.pp
Sending        manifests/mozilla/services.pp
Transmitting file data ..
Committed revision 79351.
Adding check for proper number of running processes

-> % svn diff
Index: mozilla/services.pp
===================================================================
--- mozilla/services.pp	(revision 79352)
+++ mozilla/services.pp	(working copy)
@@ -3965,7 +3965,8 @@
             hostgroups => $::fqdn ? {
                 'nagios1.private.phx1.mozilla.com' => [
                     'elasticsearch-it',
-#                    "socorro-es",
+                    'socorro-stage-elasticsearch',
+                    "socorro-elasticsearch",
                 ],
                 default => [
                 ]
bburton@althalus [10:24:23] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests]
-> % svn ci -m "adding es procs check to socorro stage and prod es, bug 945248"
Sending        manifests/mozilla/services.pp
Transmitting file data .
Committed revision 79353.
Status: NEW → ASSIGNED
(In reply to Laura Thomson :laura from comment #0)
> I'd like to check for two things:
> 1. That the cluster isn't red
> 2. That the number of nodes reported is equal to the number we expect
> 
> You can get both these things from e.g.
> curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true'
> 
> This will return JSON something like:
> {
>   "cluster_name" : "sss_stage",
>   "status" : "green",
>   "timed_out" : false,
>   "number_of_nodes" : 2,
>   "number_of_data_nodes" : 2,
>   "active_primary_shards" : 225,
>   "active_shards" : 450,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 0
> }
> 
> status should be green, we should WARN on yellow and CRITICAL on red.
> 
> number_of_nodes should be whatever it is for that cluster, we should
> CRITICAL if it's less.
>

> We should probably check this on each node, and each node should report the
> same things. 
> 
> This will help us notice issues like bug 944762.


As of now, stage and prod have the same monitoring we use on other ES clusters. We'll get alerted if the "color" goes to yellow or red and we'll get alerted if the ES process is not running.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I'm not sure the checks you have will detect split brain.  Thoughts?
In fact, I'm sure it won't. Can we get that check in place?
(In reply to Laura Thomson :laura from comment #7)
> In fact, I'm sure it won't. Can we get that check in place?

We don't have a check that currently does that, what we have I've added

I'll discuss other check possibilities with :phrawzty and see what he knows
(In reply to Laura Thomson :laura from comment #6)
> I'm not sure the checks you have will detect split brain.  Thoughts?

If your particular cluster is in split-brain it will (at least) be yellow, and likely red, insofar as colour goes.  That said, the literal condition of split-brain would not be explicitly detected by the commits noted above.

I would suggest adding additional configuration parameters to the cluster that make explicit declarations about the number of expected nodes, which will actually prevent split-brain from happening in the first place (by disallowing interactions on split nodes).  This functionally boils down to setting 
"discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of nodes in the cluster.
(In reply to Daniel Maher [:phrawzty] from comment #9)
> (In reply to Laura Thomson :laura from comment #6)
> > I'm not sure the checks you have will detect split brain.  Thoughts?
> 
> If your particular cluster is in split-brain it will (at least) be yellow,
> and likely red, insofar as colour goes.  That said, the literal condition of
> split-brain would not be explicitly detected by the commits noted above.
> 
> I would suggest adding additional configuration parameters to the cluster
> that make explicit declarations about the number of expected nodes, which
> will actually prevent split-brain from happening in the first place (by
> disallowing interactions on split nodes).  This functionally boils down to
> setting 
> "discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of
> nodes in the cluster.

This doesn't always work, fwiw. I've seen the problem in flight and it's also described here: http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/ (in the conclusion - there's also a link to the ES issue). He suggests monitoring the number of nodes each node sees as a detection mechanism.
(In reply to Laura Thomson :laura from comment #10)
> This doesn't always work, fwiw. I've seen the problem in flight and it's
> also described here:
> http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-
> elasticsearch/ (in the conclusion - there's also a link to the ES issue). He
> suggests monitoring the number of nodes each node sees as a detection
> mechanism.

The URL you link to specifically addresses this calculus in a two-node cluster only, and specifically notes that a three-node cluster "solves" the scenario; in fact, the calculus works for clusters of any odd number greater than one.  Furthermore, IIRC your cluster experienced split-brain when it had four nodes, no ?
This nagios check, https://github.com/anchor/nagios-plugin-elasticsearch , appears to offer a lot more logic and error output than the current one, so I am going to file a bug to get it added to our Nagios setup and we can test it on stage ES
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: bburton → server-ops-webops
Filed a separate bug for the plugin in comment 12.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.