Closed Bug 1465552 Opened 7 years ago Closed 6 years ago

evaluate stability of prod elasticsearch

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: miles, Assigned: brian)

References

Details

Attachments

(1 file)

es_heap_normal.png 7 years ago Brian Pitts 366.21 KB, image/png		Details

Miles Crabill [:miles]

Reporter

Description

•

7 years ago

We've had several instances of flapping monitors around JVM heap utilization and CPU utilization. We've previously discussed concerns around this in bug 1451449. Looking at Sentry [0] we see ~80 instances of requests where connections to ES result in timeouts over the period of time during which we saw ES heap / CPU issues. We saw fielddata evictions in that time period that likely account for the JVM heap utilization and CPU utilization. We're sending alerts when we're close to hitting the circuit breaker, then we hit the circuit breaker and heap utilization goes down, CPU spikes momentarily while the cluster adjusts, and we drop a few requests. This happens infrequently when large queries stack up. Hitting the fielddata circuit breaker is not an ideal case, but we've accepted that as part of our reality because of the amount of data we're querying and the newly downsized cluster. It's possible that the solution here is to tune down the monitoring around the cluster and allow it to handle evictions. Another answer would be to keep adding nodes to the cluster, particularly with resized instances that have more memory - feed the cluster RAM. This bug covers relevant discussions. [0] https://sentry.prod.mozaws.net/operations/socorro-new-prod/searches/1385/?statsPeriod=14d

Brian Pitts

Assignee

Comment 1

•

7 years ago

In May we had five heap alert warnings that quickly cleared. These were all from host i-050ddd453c4f78ac5. It had slightly more heap usage than the other hosts, but much more frequent old GC. I restarted it on the 28th and to deal with that. https://app.datadoghq.com/metric/explorer?live=true&page=0&is_auto=false&from_ts=1525112607935&to_ts=1527704607935&tile_size=m&exp_metric=system.cpu.user%2Cjvm.gc.collectors.old.count%2Cjvm.mem.heap_in_use&exp_scope=app%3Asocorro%2Cenv%3Aprod%2Ctype%3Aelasticsearch&exp_group=host&exp_agg=avg&exp_row_type=metric The main troubling thing is whatever happened on the 28th. I am working on a postmortem for that now. My preliminary conclusion is that although it didn't help that heap usage on the prod cluster was high, the primary problem is an ES query that consumes all search threads for multiple minutes.

Brian Pitts

Assignee

Comment 2

•

7 years ago

"Postmortem" notes at https://docs.google.com/document/d/1W_w0rdZoGNJd7hxmCL5DcZyvL_7nUJ9DBJAkQOz3Stg/edit

Lonnen :lonnen

Updated

•

7 years ago

Comment 3

•

7 years ago

I added two new nodes a month ago. Low-water memory usage on the two new nodes has remained lower than the older five nodes by an absolute ~10%. I plan to perform a rolling restart of the older five nodes in order to drop their memory usage to the same levels as the new two.

Brian Pitts

Assignee

Comment 4

•

7 years ago

Attached image es_heap_normal.png — Details

heap usage stats per node during four hours of normal traffic

Brian Pitts

Assignee

Comment 5

•

7 years ago

We're adding two more nodes, since some nodes were seeing sustained heap usage > 75%.

Brian Pitts

Assignee

Comment 6

•

7 years ago

I'm thinking allocating 50% of heap for field data is just too much for these instances with 16GB heap. Reducing that limit to 40% is easier than changing instance size at this time.

Brian Pitts

Assignee

Comment 7

•

6 years ago

We haven't had any issues for a couple months now.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Brian Pitts

Assignee

Comment 8

•

6 years ago

This was fixed for good in bug 1323033

You need to log in before you can comment on or make changes to this bug.

Bugzilla

evaluate stability of prod elasticsearch

Categories

(Socorro :: Infra, task)

Tracking

(Not tracked)

People

(Reporter: miles, Assigned: brian)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type