Closed
Bug 1465552
Opened 7 years ago
Closed 6 years ago
evaluate stability of prod elasticsearch
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: miles, Assigned: brian)
References
Details
Attachments
(1 file)
366.21 KB,
image/png
|
Details |
We've had several instances of flapping monitors around JVM heap utilization and CPU utilization.
We've previously discussed concerns around this in bug 1451449.
Looking at Sentry [0] we see ~80 instances of requests where connections to ES result in timeouts over the period of time during which we saw ES heap / CPU issues.
We saw fielddata evictions in that time period that likely account for the JVM heap utilization and CPU utilization. We're sending alerts when we're close to hitting the circuit breaker, then we hit the circuit breaker and heap utilization goes down, CPU spikes momentarily while the cluster adjusts, and we drop a few requests.
This happens infrequently when large queries stack up. Hitting the fielddata circuit breaker is not an ideal case, but we've accepted that as part of our reality because of the amount of data we're querying and the newly downsized cluster.
It's possible that the solution here is to tune down the monitoring around the cluster and allow it to handle evictions. Another answer would be to keep adding nodes to the cluster, particularly with resized instances that have more memory - feed the cluster RAM.
This bug covers relevant discussions.
[0] https://sentry.prod.mozaws.net/operations/socorro-new-prod/searches/1385/?statsPeriod=14d
Assignee | ||
Comment 1•7 years ago
|
||
In May we had five heap alert warnings that quickly cleared. These were all from host i-050ddd453c4f78ac5. It had slightly more heap usage than the other hosts, but much more frequent old GC. I restarted it on the 28th and to deal with that.
https://app.datadoghq.com/metric/explorer?live=true&page=0&is_auto=false&from_ts=1525112607935&to_ts=1527704607935&tile_size=m&exp_metric=system.cpu.user%2Cjvm.gc.collectors.old.count%2Cjvm.mem.heap_in_use&exp_scope=app%3Asocorro%2Cenv%3Aprod%2Ctype%3Aelasticsearch&exp_group=host&exp_agg=avg&exp_row_type=metric
The main troubling thing is whatever happened on the 28th. I am working on a postmortem for that now. My preliminary conclusion is that although it didn't help that heap usage on the prod cluster was high, the primary problem is an ES query that consumes all search threads for multiple minutes.
Assignee | ||
Comment 2•7 years ago
|
||
"Postmortem" notes at https://docs.google.com/document/d/1W_w0rdZoGNJd7hxmCL5DcZyvL_7nUJ9DBJAkQOz3Stg/edit
Assignee | ||
Comment 3•7 years ago
|
||
I added two new nodes a month ago. Low-water memory usage on the two new nodes has remained lower than the older five nodes by an absolute ~10%. I plan to perform a rolling restart of the older five nodes in order to drop their memory usage to the same levels as the new two.
Assignee | ||
Comment 4•7 years ago
|
||
heap usage stats per node during four hours of normal traffic
Assignee | ||
Comment 5•7 years ago
|
||
We're adding two more nodes, since some nodes were seeing sustained heap usage > 75%.
Assignee | ||
Comment 6•7 years ago
|
||
I'm thinking allocating 50% of heap for field data is just too much for these instances with 16GB heap. Reducing that limit to 40% is easier than changing instance size at this time.
Assignee | ||
Comment 7•6 years ago
|
||
We haven't had any issues for a couple months now.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 8•6 years ago
|
||
This was fixed for good in bug 1323033
You need to log in
before you can comment on or make changes to this bug.
Description
•