We're getting a major slow down on Elastic Search and it's causing tests to time out again https://ci.mozilla.org/job/amo-master/4963/console This has happened before and jbalogh's diagnosis was that ES does not like how we treat it during tests (maybe the setup/teardown stuff). One possible workaround for now is to restart it periodically.
How much memory/CPU is elasticsearch taking up? This seemed to be a never-ending garbage collection issue.
Shyam, can you check this out?
Assignee: server-ops → shyam
See also bug 704526. Here is our build trend in which a pattern might be emerging as to how long it takes before ES dies https://ci.mozilla.org/job/amo-master/buildTimeTrend (the 9hr builds are where ES croaks)
Created attachment 580379 [details] Config diff for elasticsearch.in.sh (flags which control JVM are defined here) It would seem like the JVM is running out of memory. Attached patch enables logging of GC events. (It uses /var/log/elasticsearch/gc.log.XXXXXXXX as the log destination) --
IIUC this is a single node "ES cluster" -> number_of_replicas should be set to 0 (it is currently set to 1) --
Next step is to limit the size of thread pool used for search and indexing (defaults to ever increasing pool of threads (prior to last restart, it was at ~4000 threads)) --
List of config changes (yet to be synced with Puppet manifest):  set HEAP_MIN = HEAP_MAX (/opt/elasticsearch/bin/elasticsearch.in.sh)  enable GC logging etc (/opt/elasticsearch/bin/elasticsearch.in.sh)  set number_of_replicas=0 (/opt/elasticsearch/config/elasticsearch.yml.new)  limit number of threads (/opt/elasticsearch/config/elasticsearch.yml.new) In case ES daemon needs to be restarted before these changes are merged with Puppet manifest, please copy /opt/elasticsearch/config/elasticsearch.yml.new to /opt/elasticsearch/config/elasticsearch.yml before restarting the daemon. --
We're hitting a lot of timeouts today (with some new exceptions this time too). So is sumo https://ci.mozilla.org/job/amo-master/buildTimeTrend https://ci.mozilla.org/job/sumo-master/1255/console Can someone take a look at ES in CI?
both SUMO and AMO test suites are blocked on this
Severity: normal → major
tmary: If I understand your comments correctly. Your /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. I can implement your changes but I don't know what the syntax for #4 in Comment 7 should be. Otherwise your patch is ready to roll, all I have to do is commit.
(In reply to Kumar McMillan [:kumar] from comment #8) > We're hitting a lot of timeouts today (with some new exceptions this time > too). So is sumo > > https://ci.mozilla.org/job/amo-master/buildTimeTrend > https://ci.mozilla.org/job/sumo-master/1255/console > > Can someone take a look at ES in CI? Typo in config (120 instead of 120s) caused these exceptions ("ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]") - fixed. --
(In reply to Rick Bryce [:rbryce] from comment #10) > tmary: > > If I understand your comments correctly. Your > /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. > I can implement your changes but I don't know what the syntax for #4 in > Comment 7 should be. Otherwise your patch is ready to roll, all I have to > do is commit. Config changes for threadpool, queue sizing etc are part of elasticsearch.conf.new --
One or more of the *amo*tests seem to POST data (/_bulk) whose size increases with every request, within the same build (ex: build# 5093). In this case, it grew to about 5 MB. POST data (json) from one such request is available at ssh://people.mozilla.org:/tmp/payload.111221.1.json --
Alright, I've been out of the loop for far too long on this one. What's the plan moving forward? tmary? kumar?
(In reply to Shyam Mani [:fox2mike] from comment #14) > Alright, I've been out of the loop for far too long on this one. What's the > plan moving forward? tmary? kumar? Need info on the large requests (https://bugzilla.mozilla.org/show_bug.cgi?id=706944#c13) --
SUMO has been doing pretty well--we haven't had our tests fail from ES paging out since December 21st. We've had 20 runs since then.
https://github.com/mozilla/zamboni/commit/804be6f Things should be good now. Please let me know if this issue appears resolved (or not).
All good here? Any more issues?
We're doing pretty well with ES, Jenkins and SUMO tests. Haven't seen an ES-related test failure in a while now. Thumbs-up from the SUMO team.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.