Automated alert report from nagios1.private.scl3.mozilla.com: Hostname: elasticsearch1.bugs.scl3.mozilla.com Service: nodes - Elasticsearch State: WARNING Output: One or more indexes are missing replica shards. Use -vv to list them. Runbook: http://m.allizom.org/nodes+-+Elasticsearch
Automated alert acknowledgement: (rbryce)es2 and es3 reinitializing slow but sure
kicked es3 as it seemed to be wedged. according to the paramedic app, es3 is stuck initializing. Not sure the action to take.
Automated alert acknowledgement: (w0ts0n)aware and looking
Brief update: This morning, the state of the cluster was: elasticsearch1.bugs.scl3 (ES1) -- held master copy of almost all index shards elasticsearch2.bugs.scl3 (ES2) -- held master copy of a shard for bmo_show_bug index, few replica shards elasticsearch3.bugs.scl3 (ES3) -- held bupkis (no masters or replicas) The last message in the ES3 ES logs show that it had lost connectivity with the ES master (ES1). There was a corresponding message in the logs on ES1. Currently, we're going to try to let the cluster recover gracefully: 1. wait for ES1 stop GCing (it's been going on for about 40 minutes now) 2. re-add ES2 and ES3 to the cluster if they've dropped out due to communication issues 3. see if the cluster will self-heal (restart initializing nodes) If not, the plan is work with Mark Cote to see if there is a need for more aggressive recovery plan. (The normal point of contact, Kyle, is out.) In trying to restart replication and making a strategic choice to sacrifice bmo_show_bug if replication could be recovered, the current state of the cluster is: ES1 -- holds master copy of most index shards ES2 -- holds bupkis ES3 -- holds master copy of bmo_allizom_show_bug shards 3+4, bmo_show_bug shard1 I see attempts to initialize shards on ES2 and ES3, but they never leave the initializing state. The theory is that ES1 is too busy with GC to respond correctly.
TIL: charts.mozilla.org will fall back to using the public bugzilla cluster if (rightly or wrongly) it determines that it can not talk to the private bugzilla cluster. The memory pressures on ES1 don't seem to be decreasing at all. The current plan is to restart the cluster and force a rebuild of the indexes later this afternoon (when fubar is available).
Automated alert recovery: Hostname: elasticsearch1.bugs.scl3.mozilla.com Service: nodes - Elasticsearch State: OK Output: Monitoring cluster 'bugs_public'
In a stroke of luck, it looks like we *might* not have to do any rebuilding of indexes. Doing a full shutdown and restart of the cluster, it looks like all the shards came up cleanly, even the one shard of bmo_show_bug that it had been complaining about. It might be the case that some writes may be missing, but neither fubar or I have the tools or knowledge to verify that. In the worst case, one or more indexes may need to be rebuilt but the cluster should now be back to being 1) fully replicated and 2) responsive (since query load should be spread back out among the servers).