Closed
Bug 945248
Opened 11 years ago
Closed 10 years ago
Add nagios checks to our ES clusters
Categories
(Infrastructure & Operations Graveyard :: WebOps: Socorro, task)
Infrastructure & Operations Graveyard
WebOps: Socorro
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: laura, Unassigned)
Details
I'd like to check for two things: 1. That the cluster isn't red 2. That the number of nodes reported is equal to the number we expect You can get both these things from e.g. curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true' This will return JSON something like: { "cluster_name" : "sss_stage", "status" : "green", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 2, "active_primary_shards" : 225, "active_shards" : 450, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0 } status should be green, we should WARN on yellow and CRITICAL on red. number_of_nodes should be whatever it is for that cluster, we should CRITICAL if it's less. We should probably check this on each node, and each node should report the same things. This will help us notice issues like bug 944762.
Updated•11 years ago
|
Assignee: server-ops-webops → bburton
Comment 1•11 years ago
|
||
-> % svn diff Index: hosts/phx1.pp =================================================================== --- hosts/phx1.pp (revision 79320) +++ hosts/phx1.pp (working copy) @@ -3265,6 +3265,33 @@ 'elasticsearch-it', ] }, + 'socorro-es1.stage.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-4.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-stage-elasticsearch', + ] + }, + 'socorro-es2.stage.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-4.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-stage-elasticsearch', + ] + }, + 'socorro-es3.stage.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-4.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-stage-elasticsearch', + ] + }, 'tbpl1.db.phx1.mozilla.com' => { parents => 'boa-d09-b1.console.phx1.mozilla.com', hostgroups => [ Index: mozilla/services.pp =================================================================== --- mozilla/services.pp (revision 79320) +++ mozilla/services.pp (working copy) @@ -1075,7 +1075,7 @@ default => [ 'generic', "!elasticsearch-it", - "!socorro-es", + "!socorro-stage-elasticsearch", "!load-100", "!fuzzer-hosts", '!generic-preprod' @@ -1090,7 +1090,7 @@ hostgroups => $::fqdn ? { 'nagios1.private.phx1.mozilla.com' => [ 'elasticsearch-it', - 'socorro-es', + 'socorro-stage-elasticsearch', ], default => [ ] @@ -3937,7 +3937,7 @@ 'elasticsearch-jenkins', 'elasticsearch-it', 'elasticsearch-vip', -# "socorro-es", + "socorro-stage-elasticsearch", ], 'nagios1.private.scl3.mozilla.com' => [ 'elasticsearch-it', Index: mozilla/hostgroups.pp =================================================================== --- mozilla/hostgroups.pp (revision 79320) +++ mozilla/hostgroups.pp (working copy) @@ -951,9 +951,12 @@ 'graphs-vip' => { alias => "VIP for graphs.m.o", }, - 'socorro-es' => { - alias => "Socorro Search Service nodes", + 'socorro-stage-elasticsearch' => { + alias => "Socorro Stage ElasticSearch nodes", }, + 'socorro-elasticsearch' => { + alias => "Socorro ElasticSearch nodes", + }, 'jenkins-servers' => { alias => "Jenkins nodes", }, bburton@althalus [07:02:55] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests] -> % svn ci -m "adding es color check to socorro stage es, bug 945248" Sending manifests/hosts/phx1.pp Sending manifests/mozilla/hostgroups.pp Sending manifests/mozilla/services.pp Transmitting file data ... Committed revision 79330.
Comment 2•11 years ago
|
||
Things are looking good on stage, gonna roll out to production
Comment 3•11 years ago
|
||
Adding to prod, checks are all green -> % svn diff Index: hosts/phx1.pp =================================================================== --- hosts/phx1.pp (revision 79348) +++ hosts/phx1.pp (working copy) @@ -3292,6 +3292,51 @@ 'socorro-stage-elasticsearch', ] }, + 'socorro-es1.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-8.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-elasticsearch', + ] + }, + 'socorro-es2.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-8.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-elasticsearch', + ] + }, + 'socorro-es3.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-8.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-elasticsearch', + ] + }, + 'socorro-es1.dev.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-8.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-elasticsearch', + ] + }, + 'socorro-es2.dev.webapp.phx1.mozilla.com' => { + parents => 'boa-b1.r101-8.console.phx1.mozilla.com', + ganglia_phx1_cluster => "Socorro Stage ElasticSearch", + hostgroups => [ + 'hp-servers', + 'generic', + 'socorro-elasticsearch', + ] + }, 'tbpl1.db.phx1.mozilla.com' => { parents => 'boa-d09-b1.console.phx1.mozilla.com', hostgroups => [ Index: mozilla/services.pp =================================================================== --- mozilla/services.pp (revision 79348) +++ mozilla/services.pp (working copy) @@ -1091,6 +1091,7 @@ 'nagios1.private.phx1.mozilla.com' => [ 'elasticsearch-it', 'socorro-stage-elasticsearch', + 'socorro-elasticsearch', ], default => [ ] @@ -3938,6 +3939,7 @@ 'elasticsearch-it', 'elasticsearch-vip', "socorro-stage-elasticsearch", + "socorro-elasticsearch", ], 'nagios1.private.scl3.mozilla.com' => [ 'elasticsearch-it', bburton@althalus [10:03:33] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests] -> % svn ci -m "adding es color check to socorro prod es, bug 945248" Sending manifests/hosts/phx1.pp Sending manifests/mozilla/services.pp Transmitting file data .. Committed revision 79351.
Comment 4•11 years ago
|
||
Adding check for proper number of running processes -> % svn diff Index: mozilla/services.pp =================================================================== --- mozilla/services.pp (revision 79352) +++ mozilla/services.pp (working copy) @@ -3965,7 +3965,8 @@ hostgroups => $::fqdn ? { 'nagios1.private.phx1.mozilla.com' => [ 'elasticsearch-it', -# "socorro-es", + 'socorro-stage-elasticsearch', + "socorro-elasticsearch", ], default => [ ] bburton@althalus [10:24:23] [~/code/mozilla/sysadmins/puppet/trunk/modules/nagios/manifests] -> % svn ci -m "adding es procs check to socorro stage and prod es, bug 945248" Sending manifests/mozilla/services.pp Transmitting file data . Committed revision 79353.
Status: NEW → ASSIGNED
Comment 5•11 years ago
|
||
(In reply to Laura Thomson :laura from comment #0) > I'd like to check for two things: > 1. That the cluster isn't red > 2. That the number of nodes reported is equal to the number we expect > > You can get both these things from e.g. > curl -XGET 'http://<hostname>:<port>/_cluster/health?pretty=true' > > This will return JSON something like: > { > "cluster_name" : "sss_stage", > "status" : "green", > "timed_out" : false, > "number_of_nodes" : 2, > "number_of_data_nodes" : 2, > "active_primary_shards" : 225, > "active_shards" : 450, > "relocating_shards" : 0, > "initializing_shards" : 0, > "unassigned_shards" : 0 > } > > status should be green, we should WARN on yellow and CRITICAL on red. > > number_of_nodes should be whatever it is for that cluster, we should > CRITICAL if it's less. > > We should probably check this on each node, and each node should report the > same things. > > This will help us notice issues like bug 944762. As of now, stage and prod have the same monitoring we use on other ES clusters. We'll get alerted if the "color" goes to yellow or red and we'll get alerted if the ES process is not running.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 6•11 years ago
|
||
I'm not sure the checks you have will detect split brain. Thoughts?
Reporter | ||
Comment 7•11 years ago
|
||
In fact, I'm sure it won't. Can we get that check in place?
Comment 8•11 years ago
|
||
(In reply to Laura Thomson :laura from comment #7) > In fact, I'm sure it won't. Can we get that check in place? We don't have a check that currently does that, what we have I've added I'll discuss other check possibilities with :phrawzty and see what he knows
Comment 9•11 years ago
|
||
(In reply to Laura Thomson :laura from comment #6) > I'm not sure the checks you have will detect split brain. Thoughts? If your particular cluster is in split-brain it will (at least) be yellow, and likely red, insofar as colour goes. That said, the literal condition of split-brain would not be explicitly detected by the commits noted above. I would suggest adding additional configuration parameters to the cluster that make explicit declarations about the number of expected nodes, which will actually prevent split-brain from happening in the first place (by disallowing interactions on split nodes). This functionally boils down to setting "discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of nodes in the cluster.
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #9) > (In reply to Laura Thomson :laura from comment #6) > > I'm not sure the checks you have will detect split brain. Thoughts? > > If your particular cluster is in split-brain it will (at least) be yellow, > and likely red, insofar as colour goes. That said, the literal condition of > split-brain would not be explicitly detected by the commits noted above. > > I would suggest adding additional configuration parameters to the cluster > that make explicit declarations about the number of expected nodes, which > will actually prevent split-brain from happening in the first place (by > disallowing interactions on split nodes). This functionally boils down to > setting > "discovery.zen.minimum_master_nodes = (N/2)+1" where N is the number of > nodes in the cluster. This doesn't always work, fwiw. I've seen the problem in flight and it's also described here: http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/ (in the conclusion - there's also a link to the ES issue). He suggests monitoring the number of nodes each node sees as a detection mechanism.
Comment 11•11 years ago
|
||
(In reply to Laura Thomson :laura from comment #10) > This doesn't always work, fwiw. I've seen the problem in flight and it's > also described here: > http://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in- > elasticsearch/ (in the conclusion - there's also a link to the ES issue). He > suggests monitoring the number of nodes each node sees as a detection > mechanism. The URL you link to specifically addresses this calculus in a two-node cluster only, and specifically notes that a three-node cluster "solves" the scenario; in fact, the calculus works for clusters of any odd number greater than one. Furthermore, IIRC your cluster experienced split-brain when it had four nodes, no ?
Comment 12•11 years ago
|
||
This nagios check, https://github.com/anchor/nagios-plugin-elasticsearch , appears to offer a lot more logic and error output than the current one, so I am going to file a bug to get it added to our Nagios setup and we can test it on stage ES
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•10 years ago
|
Assignee: bburton → server-ops-webops
Reporter | ||
Comment 13•10 years ago
|
||
Filed a separate bug for the plugin in comment 12.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•