Open Bug 1881059 Opened 1 year ago Updated 1 year ago

[tracker] elasticsearch cluster stability issues

Categories

(Socorro :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: willkg, Unassigned)

References

Details

Periodically, we have instability in our Elasticsearch cluster. We don't have a good log of elasticsearch issues anywhere. We do have some bugs related to some of the outages especially ones that cover engineering work:

  • bug #1319896 - evaluate increasing the elasticsearch shard count (2016/12)
  • bug #1322629 - [tracker] improve the reliability of the elasticsearch cluster (2019/2)
  • bug #1339636 - Add additional data nodes to production Socorro elasticsearch cluster (2017/2)
  • bug #1465552 - evaluate stability of prod elasticsearch (2018/10)

We had a case where there was a crash stats outage due to elasticsearch cluster going unresponsive in 2023/12.

This bug is for keeping track of stability issues going forward, thoughts on what to do, etc.

For the 2023 12 27 incident, Jason said this:

I've seen cluster instability due to JVM GC collections in the past, not for Socorro specifically. Although I am surprised that cluster didn't become "unhealthy" in the elasticsearch sense, switching from green to yellow/red.
In my experience, a JVM pause (looks like "old" generation major GC, a stop world event that seems to have happened on multiple nodes for greater than 10s around the same time) would have caused nodes to drop out of a cluster, but the health status of the cluster based on grafana/influx metrics shows "green" throughout the period. Either way, I see node heap usage behave similarly, as in large heap changes, on nodes but not all at once, https://earthangel-b40313e5.influxcloud.net/d/VOh0nhSVz/socorro-production-infra?orgId=1&from=now-7d&to=now&viewPanel=98, not certain if these are "old" generation, there is no indication in logs.

ES node ip-172-31-27-222 has been running for quite some time, these pauses seem to happen a couple times a year, not sure if those times can correlate back to outages: https://gist.github.com/jasonthomas/9ef3eac8384b563673355f36ee4c8226

Based on the investigation it doesn't indicate that we saw increased request traffic that could have caused a spike.

Then this:

There is a couple of options i'd consider investigating if stability issues continue, not in any specific order:

  1. Add additional nodes , increase index replica factor to spread traffic. This would spread out the requests across more nodes and JVM GC might be able to handle clean up events quicker easier.
  2. Enable JVM GC logging. This will give us a better idea of what sort of JVM GC events are happening. Right now we don't seem to log anything other than old generation events. I'd also not recommend this, it's a rabbit hole and a pain.
  3. Investigate switching to G1 GC. The article mentioned by @bdanforth@mozilla.com references this. It's not recommended for ES 1.x version IIRC, but I've used it before with AMO and resolved all the large pause issues.
  4. Upgrade ES? There are have been a lot of improvements since, but I know that's already being considered.
See Also: → 1908974
You need to log in before you can comment on or make changes to this bug.