Closed
Bug 1451449
Opened 7 years ago
Closed 7 years ago
Deal with elasticsearch fielddata growth in -new-prod
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: brian, Assigned: brian)
Details
Background reading: https://www.elastic.co/guide/en/elasticsearch/guide/1.x/_limiting_memory_usage.html#circuit-breaker
In our old prod ES cluster, field data grew to 70GB used. [0]
In our new cluster, we only have 48GB available (5 nodes * 16GB heap * 60% circuit breaker). Usage has already grown to 32GB [1]
To address this I see a few options
1) Have elasticsearch evict fielddata once it grows too large. This would impact performance to a degree that may or may not be tolerable. To do this we would set indices.fielddata.cache.size to a value less than 60%. This setting cannot be changed dynamically, we have to update elasticsearch.yml and perform a rolling restart.
2) Add more nodes to the cluster. 7 nodes would get us to 67GB, and 9 would get us to 86GB.
3) Replace our existing m4.2xlarge nodes ($0.4, 32GB) with r4.2xlarge ($0.532, 61GB) or m4.4xlarge ($0.8, 64GB) in order to get more than 90GB for fielddata.
[0] https://app.datadoghq.com/metric/explorer?live=false&page=0&is_auto=false&from_ts=1514786400000&to_ts=1522558799999&tile_size=m&exp_metric=elasticsearch.fielddata.size&exp_scope=environment%3Aprod&exp_agg=sum&exp_row_type=metric
[1] https://app.datadoghq.com/metric/explorer?live=true&page=0&is_auto=false&from_ts=1522263365705&to_ts=1522868165705&tile_size=m&exp_metric=elasticsearch.fielddata.size&exp_scope=app%3Asocorro%2Cenv%3Aprod&exp_agg=sum&exp_row_type=metric
Assignee | ||
Comment 1•7 years ago
|
||
Actually, my comment about 70GB of fielddata usage was misleading. That was looking at it since the start of the year, around the time all data was wiped out. So 70Gb is only for roughly 3 months of data that we have now. Data will grow until we have a years worth.
Looking at all of 2017, fielddata usage topped at 190GB!
https://app.datadoghq.com/metric/explorer?live=false&page=0&is_auto=false&from_ts=1483596000000&to_ts=1514786399999&tile_size=m&exp_metric=elasticsearch.fielddata.size&exp_scope=environment%3Aprod&exp_agg=sum&exp_row_type=metric
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → bpitts
Assignee | ||
Comment 2•7 years ago
|
||
Miles pointed out that 190GB figure was before they did some data cleanup in ES, and he thinks a more realistic number is December's highwater, which was 130GB.
My thinking is to first, limit fielddata to 50% of the heap on the existing 5 nodes. Then, watch performance [0] once we begin to evict fielddata to see if its acceptable. If not, then add two more nodes and start to watch performance again.
I want to limit it before adding more nodes since otherwise we'll eventually grow to more than a dozen m4.2xlarge nodes based on that 130Gb figure. It would be nice to determine that's really necessary before doing it. I also want to limit it because I think the failure mode of "some queries get slower" is preferable to "some queries stop executing.
[0] https://app.datadoghq.com/dash/466613/socorro-elasticsearch--new-prod?live=true&page=0&is_auto=false&from_ts=1522856552335&to_ts=1522942952335&tile_size=m tracks query latency, and I added fieldata evictions to it.
Comment 3•7 years ago
|
||
All of the literature about evictions basically says "Don't let this happen, it's baaad." [0] So my inclination is that we shouldn't rely on evictions as part of our regular flow.
What do you think about changing the instance type? m4.2xlarge=>r4.2xlarge [1] is a relatively cheap transition for doubling the cluster's available memory. It's a PITA for sure, but I think it's the right decision long-term.
[0] https://www.elastic.co/guide/en/elasticsearch/guide/1.x/_limiting_memory_usage.html
[1] https://ec2instances.info/?selected=m4.2xlarge,r4.2xlarge
Assignee | ||
Comment 4•7 years ago
|
||
I'm not opposed to increasing node size. We could either do a rolling restart or spin up a new cluster and backup/restore to it.
Fielddata is now limited to 50% of heap. My theory is that if people searches perform acceptably when no fielddata is loaded (lie after we restart ES), they'll still perform acceptably when enough is loaded that some is being evicted. If that proves untrue, I'll reopen this.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 5•7 years ago
|
||
We ran into issues on Monday (4/16) when garbage collections triggered by field data evictions caused significant timeouts between nodes within the ES cluster and from the webapp and processor querying the ES cluster.
As a stop-gap solution, I bumped the cluster size up from 5 => 7 nodes as discussed above.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 6•7 years ago
|
||
Closing this in favor of Bug 1465552
Assignee | ||
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•