Closed
Bug 706944
Opened 13 years ago
Closed 12 years ago
Elastic Search is timing out on ci.mozilla.org - webdev Jenkins
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kumar, Assigned: fox2mike)
References
Details
Attachments
(1 file)
We're getting a major slow down on Elastic Search and it's causing tests to time out again https://ci.mozilla.org/job/amo-master/4963/console This has happened before and jbalogh's diagnosis was that ES does not like how we treat it during tests (maybe the setup/teardown stuff). One possible workaround for now is to restart it periodically.
Comment 1•13 years ago
|
||
How much memory/CPU is elasticsearch taking up? This seemed to be a never-ending garbage collection issue.
Reporter | ||
Comment 3•13 years ago
|
||
See also bug 704526. Here is our build trend in which a pattern might be emerging as to how long it takes before ES dies https://ci.mozilla.org/job/amo-master/buildTimeTrend (the 9hr builds are where ES croaks)
Comment 4•13 years ago
|
||
It would seem like the JVM is running out of memory. Attached patch enables logging of GC events. (It uses /var/log/elasticsearch/gc.log.XXXXXXXX as the log destination) --
Comment 5•13 years ago
|
||
IIUC this is a single node "ES cluster" -> number_of_replicas should be set to 0 (it is currently set to 1) --
Comment 6•13 years ago
|
||
Next step is to limit the size of thread pool used for search and indexing (defaults to ever increasing pool of threads (prior to last restart, it was at ~4000 threads)) --
Comment 7•13 years ago
|
||
List of config changes (yet to be synced with Puppet manifest): [1] set HEAP_MIN = HEAP_MAX (/opt/elasticsearch/bin/elasticsearch.in.sh) [2] enable GC logging etc (/opt/elasticsearch/bin/elasticsearch.in.sh) [3] set number_of_replicas=0 (/opt/elasticsearch/config/elasticsearch.yml.new) [4] limit number of threads (/opt/elasticsearch/config/elasticsearch.yml.new) In case ES daemon needs to be restarted before these changes are merged with Puppet manifest, please copy /opt/elasticsearch/config/elasticsearch.yml.new to /opt/elasticsearch/config/elasticsearch.yml before restarting the daemon. --
Reporter | ||
Comment 8•13 years ago
|
||
We're hitting a lot of timeouts today (with some new exceptions this time too). So is sumo https://ci.mozilla.org/job/amo-master/buildTimeTrend https://ci.mozilla.org/job/sumo-master/1255/console Can someone take a look at ES in CI?
Reporter | ||
Comment 9•13 years ago
|
||
both SUMO and AMO test suites are blocked on this
Severity: normal → major
Comment 10•13 years ago
|
||
tmary: If I understand your comments correctly. Your /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. I can implement your changes but I don't know what the syntax for #4 in Comment 7 should be. Otherwise your patch is ready to roll, all I have to do is commit.
Updated•13 years ago
|
Severity: major → normal
Comment 11•13 years ago
|
||
(In reply to Kumar McMillan [:kumar] from comment #8) > We're hitting a lot of timeouts today (with some new exceptions this time > too). So is sumo > > https://ci.mozilla.org/job/amo-master/buildTimeTrend > https://ci.mozilla.org/job/sumo-master/1255/console > > Can someone take a look at ES in CI? Typo in config (120 instead of 120s) caused these exceptions ("ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]") - fixed. --
Comment 12•13 years ago
|
||
(In reply to Rick Bryce [:rbryce] from comment #10) > tmary: > > If I understand your comments correctly. Your > /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. > I can implement your changes but I don't know what the syntax for #4 in > Comment 7 should be. Otherwise your patch is ready to roll, all I have to > do is commit. Config changes for threadpool, queue sizing etc are part of elasticsearch.conf.new --
Comment 13•13 years ago
|
||
One or more of the *amo*tests seem to POST data (/_bulk) whose size increases with every request, within the same build (ex: build# 5093). In this case, it grew to about 5 MB. POST data (json) from one such request is available at ssh://people.mozilla.org:/tmp/payload.111221.1.json --
Assignee | ||
Comment 14•13 years ago
|
||
Alright, I've been out of the loop for far too long on this one. What's the plan moving forward? tmary? kumar?
Comment 15•13 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #14) > Alright, I've been out of the loop for far too long on this one. What's the > plan moving forward? tmary? kumar? Need info on the large requests (https://bugzilla.mozilla.org/show_bug.cgi?id=706944#c13) --
Comment 16•13 years ago
|
||
SUMO has been doing pretty well--we haven't had our tests fail from ES paging out since December 21st. We've had 20 runs since then.
Comment 17•13 years ago
|
||
https://github.com/mozilla/zamboni/commit/804be6f Things should be good now. Please let me know if this issue appears resolved (or not).
Assignee | ||
Comment 20•12 years ago
|
||
All good here? Any more issues?
Comment 21•12 years ago
|
||
We're doing pretty well with ES, Jenkins and SUMO tests. Haven't seen an ES-related test failure in a while now. Thumbs-up from the SUMO team.
Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•