972236 - Tune memory boundary for Bugs ES clusters

Reporter

Description

•

10 years ago

As noted in bug 968318, the upper memory boundary for the ES JVM on the Bugs clusters needs to be adjusted down to something reasonable.  I would suggest starting at 50 % of total system RAM, and seeing how that feels; for reference, the "generic" IT-managed ES clsuters are set at 66 %.

This will require the JVMs to be restarted, which effectively translates to a cluster restart, so this will likely need to be co-ordinated with the interested parties on A-Team (et al.).

Daniel Maher [:phrawzty]

Reporter

Updated

•

10 years ago

Comment 1

•

10 years ago

As per :ekyle, this needs to be co-ordinated with :harsha.
09:43:30 < ekyle> phrawzty: you can bounce the cluster anytime that i good for harsha (in #metrics)

Flags: needinfo?(schintalapani)

Adrian J Fernandez [:Aj]

Comment 2

•

10 years ago

Just for reference, the following nodes are alerting;
Fri 09:44:13 PST [5435] elasticsearch5.bugs.scl3.mozilla.com:color - Elasticsearch is WARNING: Elasticsearch Health is Yellow

Fri 09:44:13 PST [5437] elasticsearch6.bugs.scl3.mozilla.com:color - Elasticsearch is WARNING: Elasticsearch Health is Yellow

Adrian J Fernandez [:Aj]

Comment 3

•

10 years ago

Add elasticsearch4.bugs.scl3.mozilla.com to the list.

Kyle Lahnakoski [:ekyle]

Comment 4

•

10 years ago

It appears the yellow status is caused by the cluster *thinking* there are four nodes (or there was four nodes, or there was another node by a different name)  This can be caused by the nodes being bounced individually AND there being no fixed name to each node.    The ES config files can let you specify the node names and the cluster members explicitly to avoid this happening

Daniel Maher [:phrawzty]

Reporter

Comment 5

•

10 years ago

(In reply to Kyle Lahnakoski [:ekyle] from comment #4)
> It appears the yellow status is caused by the cluster *thinking* there are
> four nodes (or there was four nodes, or there was another node by a
> different name)  This can be caused by the nodes being bounced individually
> AND there being no fixed name to each node.    The ES config files can let
> you specify the node names and the cluster members explicitly to avoid this
> happening

The colours are actually quite easy to define :
Green: All shards of all indices are completely available.
Yellow: All indices are available, but some shards are not (replicas).
Red: One or more indices are not completely available.

The *reasons* for the colours are much more varied.  The explanation you've provided above is one potential scenario, but it is not the case for the Bugs ES clusters, as those machines have fully deterministic node names based on their hostname and data centre location.

Jake Maul [:jakem]

Comment 6

•

10 years ago

When we do this, it would be nice to enable the Java GC log as well. That's easy, the packages come with a simple way to enable them via /etc/sysconfig, just like we're managing mem size. Here's my untested patch that does this for the staging cluster (it's the same change to prod):

--- bugzilla.pp	(revision 82580)
+++ bugzilla.pp	(working copy)
@@ -901,6 +901,7 @@

   # ES container can consume up to 90% of total RAM.
   $es_max_mem = inline_template('<%= @memorysize =~ /^(\d+)/; val = ( ( $1.to_i * 1024) / 1.05 ).to_i %>m')
+  $logging = 1
   $ver = '0.90.10-2.el6'
   $javapackage = 'java-1.7.0-openjdk'
   $logging = { 'file' => 'logging.yml' }
@@ -908,7 +909,8 @@
   $sysconfig = {
     'file' => 'sysconfig.es',
     'var' => {
-      'ES_MAX_MEM' => $es_max_mem
+      'ES_MAX_MEM' => $es_max_mem,
+      'ES_USE_GC_LOGGING' => $logging,
     }
   }


It might also be worthwhile to set ES_MIN_MEM too... right now it starts at 256MB and grows to the size specified in ES_MAX_MEM. However, if we expect it to grow up to that size quickly anyway, then it's more efficient to just start there. Less pausing to grow the heap on-demand. I've also seen behavior with some JVM's (Sun JVM 6 for sure, but I haven't checked 7) where if you let the heap grow like this, the size of the "new" generation doesn't grow proportionally... it stays fixed at the same size it was when the process started. This can lead to excessive garbage collections because the new generation is filling up quickly.

You can set the min and max size to be the same by setting ES_HEAP_SIZE instead of the two settings individually. The GC logs will help us know for sure if there's any value here, but I'd bet money that we'd be at least slightly better off doing this. :)

We can also set the size of the "new" generation directly, but I recommend not doing that until we've had a change to do some analysis on the garbage collector's efficiency... which we need the GC logs for. :)

Jake Maul [:jakem]

Comment 7

•

10 years ago

I have done a bit of tuning... min mem is set to 25%, max is set to 75%. However the math is slightly wrong (it uses integer math on the total amount of system memory, so it gets 7GB instead of 7.69GB. The net effect is that these values are really closer to 15% and 65%, respectively.

I also turned on the GC logging. However, this turns out not to actually work out of the box. The init script launching ES does not actually pass along that variable, even though future scripts look for it. I hacked up the init script on elasticsearch1.bugs.scl3 to do so, so it's now creating the file, but none of the others are.

In a few days we can analyze the file and see where we want to go from here.

Harsha [:harsha]

Comment 8

•

10 years ago

Jake,
     I ran 3 months of data export from ES instances. Didn't had any issues unlike previous times.
So far it looks good to me.
Thanks,
Harsha

Flags: needinfo?(schintalapani)

Kyle Lahnakoski [:ekyle]

Comment 9

•

10 years ago

Harsha, 

There are still issues, (https://bugzilla.mozilla.org/show_bug.cgi?id=976408).  But looking at the date/time of Jake's comment here, it seems to have affected the ETL jobs.

Philippe M. Chiasson (:gozer)

Updated

•

10 years ago

Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/172]

Jake Maul [:jakem]

Comment 10

•

10 years ago

Not sure if there's anything more we can do here or just add RAM/nodes, but in any case things have changed a bit since the last comment and this has a new home. :)

Assignee: server-ops-webops → nobody

Component: WebOps: IT-Managed Tools → Infrastructure

Product: Infrastructure & Operations → bugzilla.mozilla.org

QA Contact: nmaul → mcote

Daniel Maher [:phrawzty]

Reporter

Comment 11

•

9 years ago

I'm closing out some old bugs of mine - can this one be speedily resolved in some way ?

Kendall Libby [:fubar] (he/him)

Comment 12

•

9 years ago

we haven't had mem issues on those boxes in forever.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Bugzilla

Quick Search

Tune memory boundary for Bugs ES clusters

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

Tracking

()

People

(Reporter: dmaher, Unassigned)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/172] )

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12