[tracker] elasticsearch cluster stability issues
Categories
(Socorro :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Unassigned)
References
Details
Periodically, we have instability in our Elasticsearch cluster. We don't have a good log of elasticsearch issues anywhere. We do have some bugs related to some of the outages especially ones that cover engineering work:
- bug #1319896 - evaluate increasing the elasticsearch shard count (2016/12)
- bug #1322629 - [tracker] improve the reliability of the elasticsearch cluster (2019/2)
- bug #1339636 - Add additional data nodes to production Socorro elasticsearch cluster (2017/2)
- bug #1465552 - evaluate stability of prod elasticsearch (2018/10)
We had a case where there was a crash stats outage due to elasticsearch cluster going unresponsive in 2023/12.
- bug #1872150 - crash stats outage (15 minutes) due to elasticsearch cluster going unresponsive (2023/12)
This bug is for keeping track of stability issues going forward, thoughts on what to do, etc.
Reporter | ||
Comment 1•1 year ago
|
||
For the 2023 12 27 incident, Jason said this:
I've seen cluster instability due to JVM GC collections in the past, not for Socorro specifically. Although I am surprised that cluster didn't become "unhealthy" in the elasticsearch sense, switching from green to yellow/red.
In my experience, a JVM pause (looks like "old" generation major GC, a stop world event that seems to have happened on multiple nodes for greater than 10s around the same time) would have caused nodes to drop out of a cluster, but the health status of the cluster based on grafana/influx metrics shows "green" throughout the period. Either way, I see node heap usage behave similarly, as in large heap changes, on nodes but not all at once, https://earthangel-b40313e5.influxcloud.net/d/VOh0nhSVz/socorro-production-infra?orgId=1&from=now-7d&to=now&viewPanel=98, not certain if these are "old" generation, there is no indication in logs.ES node
ip-172-31-27-222
has been running for quite some time, these pauses seem to happen a couple times a year, not sure if those times can correlate back to outages: https://gist.github.com/jasonthomas/9ef3eac8384b563673355f36ee4c8226Based on the investigation it doesn't indicate that we saw increased request traffic that could have caused a spike.
Then this:
There is a couple of options i'd consider investigating if stability issues continue, not in any specific order:
- Add additional nodes , increase index replica factor to spread traffic. This would spread out the requests across more nodes and JVM GC might be able to handle clean up events quicker easier.
- Enable JVM GC logging. This will give us a better idea of what sort of JVM GC events are happening. Right now we don't seem to log anything other than old generation events. I'd also not recommend this, it's a rabbit hole and a pain.
- Investigate switching to G1 GC. The article mentioned by @bdanforth@mozilla.com references this. It's not recommended for ES 1.x version IIRC, but I've used it before with AMO and resolved all the large pause issues.
- Upgrade ES? There are have been a lot of improvements since, but I know that's already being considered.
Description
•