Closed Bug 1016215 Opened 11 years ago Closed 11 years ago

color - Elasticsearch on elasticsearch1.webapp.scl3.mozilla.com is CRITICAL: Elasticsearch Health is red

Categories

(mozilla.org Graveyard :: Server Operations: MOC, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Unassigned)

References

()

Details

(Whiteboard: [id=nagios1.private.scl3.mozilla.com:362045])

Automated alert report from nagios1.private.scl3.mozilla.com: Hostname: elasticsearch1.webapp.scl3.mozilla.com Service: color - Elasticsearch State: CRITICAL Output: Elasticsearch Health is red Runbook: http://m.allizom.org/color+-+Elasticsearch
Automated alert acknowledgement: (w0ts0n)should recover.
Status: NEW → ASSIGNED
CC'ing :cyliang. Looking at the head and paramedic plugins, I found 1 initialising/unassigned shard in the autolog index, which still hasn't found its home :)
Automated alert recovery: Hostname: elasticsearch1.webapp.scl3.mozilla.com Service: color - Elasticsearch State: OK Output: Elasticsearch Health is Green
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
ericz pinged me about this. There was one ES shard that would repeatedly fail to initialize. Looking at paramedic, it looked like it was having issues with shard initializing on elasticsearch4.webapp.scl3. Digging into the elasticsearch log showed: [2014-05-27 17:23:58,939][WARN ][cluster.action.shard ] [elasticsearch4_scl3] [autolog][3] sending failed shard for [autolog][3], node[wV4W-9YbTrG99pYEMjB6xQ], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[autolog][3] failed recovery]; nested: EngineCreationFailureException[[autolog][3] failed to create engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /var/lib/elasticsearch/es_prod_scl3/nodes/0/indices/autolog/3/index/write.lock]; ]] lsof showed that the ES process itself was the only thing that had that file open. I was able to successfully get the shard initialized by: - stopping ES on elasticsearch4.webapp.scl3 - removing the lock file from the file system - restarting ES (If I tried to restart ES without removing the lock file first, it would spew a fresh batch of "Cannot forcefully unlock a NativeFSLock" errors.) ES was then able to initialize the problematic shard.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.