Closed
Bug 1016215
Opened 11 years ago
Closed 11 years ago
color - Elasticsearch on elasticsearch1.webapp.scl3.mozilla.com is CRITICAL: Elasticsearch Health is red
Categories
(mozilla.org Graveyard :: Server Operations: MOC, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nagiosapi, Unassigned)
References
()
Details
(Whiteboard: [id=nagios1.private.scl3.mozilla.com:362045])
Automated alert report from nagios1.private.scl3.mozilla.com:
Hostname: elasticsearch1.webapp.scl3.mozilla.com
Service: color - Elasticsearch
State: CRITICAL
Output: Elasticsearch Health is red
Runbook: http://m.allizom.org/color+-+Elasticsearch
| Reporter | ||
Comment 1•11 years ago
|
||
Automated alert acknowledgement: (w0ts0n)should recover.
Status: NEW → ASSIGNED
Comment 2•11 years ago
|
||
CC'ing :cyliang. Looking at the head and paramedic plugins, I found 1 initialising/unassigned shard in the autolog index, which still hasn't found its home :)
| Reporter | ||
Comment 3•11 years ago
|
||
Automated alert recovery:
Hostname: elasticsearch1.webapp.scl3.mozilla.com
Service: color - Elasticsearch
State: OK
Output: Elasticsearch Health is Green
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 4•11 years ago
|
||
ericz pinged me about this. There was one ES shard that would repeatedly fail to initialize. Looking at paramedic, it looked like it was having issues with shard initializing on elasticsearch4.webapp.scl3. Digging into the elasticsearch log showed:
[2014-05-27 17:23:58,939][WARN ][cluster.action.shard ] [elasticsearch4_scl3] [autolog][3] sending failed shard for [autolog][3], node[wV4W-9YbTrG99pYEMjB6xQ], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[autolog][3] failed recovery]; nested: EngineCreationFailureException[[autolog][3] failed to create engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /var/lib/elasticsearch/es_prod_scl3/nodes/0/indices/autolog/3/index/write.lock]; ]]
lsof showed that the ES process itself was the only thing that had that file open.
I was able to successfully get the shard initialized by:
- stopping ES on elasticsearch4.webapp.scl3
- removing the lock file from the file system
- restarting ES
(If I tried to restart ES without removing the lock file first, it would spew a fresh batch of "Cannot forcefully unlock a NativeFSLock" errors.)
ES was then able to initialize the problematic shard.
| Assignee | ||
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•