Closed Bug 1042336 Opened 10 years ago Closed 10 years ago

elasticsearch[1-3].bugs.scl3.mozilla.com is WARNING: Elasticsearch Health is Yellow

Categories

(mozilla.org Graveyard :: Server Operations: MOC, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Unassigned)

References

()

Details

(Whiteboard: [id=nagios1.private.scl3.mozilla.com:387065])

Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: elasticsearch1.bugs.scl3.mozilla.com
Service:  nodes - Elasticsearch
State:    WARNING
Output:   One or more indexes are missing replica shards.  Use -vv to list them.

Runbook:  http://m.allizom.org/nodes+-+Elasticsearch
Automated alert acknowledgement: (rbryce)es2 and es3 reinitializing slow but sure
Status: NEW → ASSIGNED
kicked es3 as it seemed to be wedged.  according to the paramedic app, es3 is stuck initializing.  Not sure the action to take.
Summary: nodes - Elasticsearch on elasticsearch1.bugs.scl3.mozilla.com is WARNING: One or more indexes are missing replica shards. Use -vv to list them. → elasticsearch[1-3].bugs.scl3.mozilla.com is WARNING: Elasticsearch Health is Yellow
Automated alert acknowledgement: (w0ts0n)aware and looking
Brief update:

This morning, the state of the cluster was:
   elasticsearch1.bugs.scl3 (ES1) -- held master copy of almost all index shards
   elasticsearch2.bugs.scl3 (ES2) -- held master copy of a shard for bmo_show_bug index, few replica shards
   elasticsearch3.bugs.scl3 (ES3) -- held bupkis (no masters or replicas)

The last message in the ES3 ES logs show that it had lost connectivity with the ES master (ES1).  There was a corresponding message in the logs on ES1.

Currently, we're going to try to let the cluster recover gracefully:
   1. wait for ES1 stop GCing (it's been going on for about 40 minutes now)
   2. re-add ES2 and ES3 to the cluster if they've dropped out due to communication issues
   3. see if the cluster will self-heal (restart initializing nodes)

If not, the plan is work with Mark Cote to see if there is a need for more aggressive recovery plan.  (The normal point of contact, Kyle, is out.)

In trying to restart replication and making a strategic choice to sacrifice bmo_show_bug if replication could be recovered, the current state of the cluster is:

   ES1 -- holds master copy of most index shards
   ES2 -- holds bupkis
   ES3 -- holds master copy of bmo_allizom_show_bug shards 3+4, bmo_show_bug shard1
 
I see attempts to initialize shards on ES2 and ES3, but they never leave the initializing state.  The theory is that ES1 is too busy with GC to respond correctly.
TIL: charts.mozilla.org will fall back to using the public bugzilla cluster if (rightly or wrongly) it determines that it can not talk to the private bugzilla cluster.

The memory pressures on ES1 don't seem to be decreasing at all.  The current plan is to restart the cluster and force a rebuild of the indexes later this afternoon (when fubar is available).
Automated alert recovery:

Hostname: elasticsearch1.bugs.scl3.mozilla.com
Service:  nodes - Elasticsearch
State:    OK
Output:   Monitoring cluster 'bugs_public'
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
In a stroke of luck, it looks like we *might* not have to do any rebuilding of indexes.

Doing a full shutdown and restart of the cluster, it looks like all the shards came up cleanly, even the one shard of bmo_show_bug that it had been complaining about.  It might be the case that some writes may be missing, but neither fubar or I have the tools or knowledge to verify that.  In the worst case, one or more indexes may need to be rebuilt but the cluster should now be back to being 1) fully replicated and 2) responsive (since query load should be spread back out among the servers).
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.