Closed Bug 972236 Opened 10 years ago Closed 9 years ago

Tune memory boundary for Bugs ES clusters

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: dmaher, Unassigned)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/172] )

As noted in bug 968318, the upper memory boundary for the ES JVM on the Bugs clusters needs to be adjusted down to something reasonable.  I would suggest starting at 50 % of total system RAM, and seeing how that feels; for reference, the "generic" IT-managed ES clsuters are set at 66 %.

This will require the JVMs to be restarted, which effectively translates to a cluster restart, so this will likely need to be co-ordinated with the interested parties on A-Team (et al.).
See Also: → 968318
As per :ekyle, this needs to be co-ordinated with :harsha.
09:43:30 < ekyle> phrawzty: you can bounce the cluster anytime that i good for harsha (in #metrics)
Flags: needinfo?(schintalapani)
Just for reference, the following nodes are alerting;
Fri 09:44:13 PST [5435] elasticsearch5.bugs.scl3.mozilla.com:color - Elasticsearch is WARNING: Elasticsearch Health is Yellow

Fri 09:44:13 PST [5437] elasticsearch6.bugs.scl3.mozilla.com:color - Elasticsearch is WARNING: Elasticsearch Health is Yellow
Add elasticsearch4.bugs.scl3.mozilla.com to the list.
It appears the yellow status is caused by the cluster *thinking* there are four nodes (or there was four nodes, or there was another node by a different name)  This can be caused by the nodes being bounced individually AND there being no fixed name to each node.    The ES config files can let you specify the node names and the cluster members explicitly to avoid this happening
(In reply to Kyle Lahnakoski [:ekyle] from comment #4)
> It appears the yellow status is caused by the cluster *thinking* there are
> four nodes (or there was four nodes, or there was another node by a
> different name)  This can be caused by the nodes being bounced individually
> AND there being no fixed name to each node.    The ES config files can let
> you specify the node names and the cluster members explicitly to avoid this
> happening

The colours are actually quite easy to define :
Green: All shards of all indices are completely available.
Yellow: All indices are available, but some shards are not (replicas).
Red: One or more indices are not completely available.

The *reasons* for the colours are much more varied.  The explanation you've provided above is one potential scenario, but it is not the case for the Bugs ES clusters, as those machines have fully deterministic node names based on their hostname and data centre location.
When we do this, it would be nice to enable the Java GC log as well. That's easy, the packages come with a simple way to enable them via /etc/sysconfig, just like we're managing mem size. Here's my untested patch that does this for the staging cluster (it's the same change to prod):

--- bugzilla.pp	(revision 82580)
+++ bugzilla.pp	(working copy)
@@ -901,6 +901,7 @@

   # ES container can consume up to 90% of total RAM.
   $es_max_mem = inline_template('<%= @memorysize =~ /^(\d+)/; val = ( ( $1.to_i * 1024) / 1.05 ).to_i %>m')
+  $logging = 1
   $ver = '0.90.10-2.el6'
   $javapackage = 'java-1.7.0-openjdk'
   $logging = { 'file' => 'logging.yml' }
@@ -908,7 +909,8 @@
   $sysconfig = {
     'file' => 'sysconfig.es',
     'var' => {
-      'ES_MAX_MEM' => $es_max_mem
+      'ES_MAX_MEM' => $es_max_mem,
+      'ES_USE_GC_LOGGING' => $logging,
     }
   }


It might also be worthwhile to set ES_MIN_MEM too... right now it starts at 256MB and grows to the size specified in ES_MAX_MEM. However, if we expect it to grow up to that size quickly anyway, then it's more efficient to just start there. Less pausing to grow the heap on-demand. I've also seen behavior with some JVM's (Sun JVM 6 for sure, but I haven't checked 7) where if you let the heap grow like this, the size of the "new" generation doesn't grow proportionally... it stays fixed at the same size it was when the process started. This can lead to excessive garbage collections because the new generation is filling up quickly.

You can set the min and max size to be the same by setting ES_HEAP_SIZE instead of the two settings individually. The GC logs will help us know for sure if there's any value here, but I'd bet money that we'd be at least slightly better off doing this. :)

We can also set the size of the "new" generation directly, but I recommend not doing that until we've had a change to do some analysis on the garbage collector's efficiency... which we need the GC logs for. :)
I have done a bit of tuning... min mem is set to 25%, max is set to 75%. However the math is slightly wrong (it uses integer math on the total amount of system memory, so it gets 7GB instead of 7.69GB. The net effect is that these values are really closer to 15% and 65%, respectively.

I also turned on the GC logging. However, this turns out not to actually work out of the box. The init script launching ES does not actually pass along that variable, even though future scripts look for it. I hacked up the init script on elasticsearch1.bugs.scl3 to do so, so it's now creating the file, but none of the others are.

In a few days we can analyze the file and see where we want to go from here.
Jake,
     I ran 3 months of data export from ES instances. Didn't had any issues unlike previous times.
So far it looks good to me.
Thanks,
Harsha
Flags: needinfo?(schintalapani)
Harsha, 

There are still issues, (https://bugzilla.mozilla.org/show_bug.cgi?id=976408).  But looking at the date/time of Jake's comment here, it seems to have affected the ETL jobs.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/172]
Not sure if there's anything more we can do here or just add RAM/nodes, but in any case things have changed a bit since the last comment and this has a new home. :)
Assignee: server-ops-webops → nobody
Component: WebOps: IT-Managed Tools → Infrastructure
Product: Infrastructure & Operations → bugzilla.mozilla.org
QA Contact: nmaul → mcote
I'm closing out some old bugs of mine - can this one be speedily resolved in some way ?
we haven't had mem issues on those boxes in forever.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.