Closed
Bug 706944
Opened 13 years ago
Closed 13 years ago
Elastic Search is timing out on ci.mozilla.org - webdev Jenkins
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kumar, Assigned: fox2mike)
References
Details
Attachments
(1 file)
We're getting a major slow down on Elastic Search and it's causing tests to time out again https://ci.mozilla.org/job/amo-master/4963/console
This has happened before and jbalogh's diagnosis was that ES does not like how we treat it during tests (maybe the setup/teardown stuff). One possible workaround for now is to restart it periodically.
Comment 1•13 years ago
|
||
How much memory/CPU is elasticsearch taking up? This seemed to be a never-ending garbage collection issue.
Reporter | ||
Comment 3•13 years ago
|
||
See also bug 704526. Here is our build trend in which a pattern might be emerging as to how long it takes before ES dies https://ci.mozilla.org/job/amo-master/buildTimeTrend (the 9hr builds are where ES croaks)
![]() |
||
Comment 4•13 years ago
|
||
It would seem like the JVM is running out of memory. Attached patch enables logging of GC events.
(It uses /var/log/elasticsearch/gc.log.XXXXXXXX as the log destination)
--
![]() |
||
Comment 5•13 years ago
|
||
IIUC this is a single node "ES cluster" -> number_of_replicas should be set to 0 (it is currently set to 1)
--
![]() |
||
Comment 6•13 years ago
|
||
Next step is to limit the size of thread pool used for search and indexing (defaults to ever increasing pool of threads (prior to last restart, it was at ~4000 threads))
--
![]() |
||
Comment 7•13 years ago
|
||
List of config changes (yet to be synced with Puppet manifest):
[1] set HEAP_MIN = HEAP_MAX (/opt/elasticsearch/bin/elasticsearch.in.sh)
[2] enable GC logging etc (/opt/elasticsearch/bin/elasticsearch.in.sh)
[3] set number_of_replicas=0 (/opt/elasticsearch/config/elasticsearch.yml.new)
[4] limit number of threads (/opt/elasticsearch/config/elasticsearch.yml.new)
In case ES daemon needs to be restarted before these changes are merged with Puppet manifest, please copy /opt/elasticsearch/config/elasticsearch.yml.new to /opt/elasticsearch/config/elasticsearch.yml before restarting the daemon.
--
Reporter | ||
Comment 8•13 years ago
|
||
We're hitting a lot of timeouts today (with some new exceptions this time too). So is sumo
https://ci.mozilla.org/job/amo-master/buildTimeTrend
https://ci.mozilla.org/job/sumo-master/1255/console
Can someone take a look at ES in CI?
Reporter | ||
Comment 9•13 years ago
|
||
both SUMO and AMO test suites are blocked on this
Severity: normal → major
![]() |
||
Comment 10•13 years ago
|
||
tmary:
If I understand your comments correctly. Your /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. I can implement your changes but I don't know what the syntax for #4 in Comment 7 should be. Otherwise your patch is ready to roll, all I have to do is commit.
Updated•13 years ago
|
Severity: major → normal
![]() |
||
Comment 11•13 years ago
|
||
(In reply to Kumar McMillan [:kumar] from comment #8)
> We're hitting a lot of timeouts today (with some new exceptions this time
> too). So is sumo
>
> https://ci.mozilla.org/job/amo-master/buildTimeTrend
> https://ci.mozilla.org/job/sumo-master/1255/console
>
> Can someone take a look at ES in CI?
Typo in config (120 instead of 120s) caused these exceptions ("ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]") - fixed.
--
![]() |
||
Comment 12•13 years ago
|
||
(In reply to Rick Bryce [:rbryce] from comment #10)
> tmary:
>
> If I understand your comments correctly. Your
> /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet.
> I can implement your changes but I don't know what the syntax for #4 in
> Comment 7 should be. Otherwise your patch is ready to roll, all I have to
> do is commit.
Config changes for threadpool, queue sizing etc are part of elasticsearch.conf.new
--
![]() |
||
Comment 13•13 years ago
|
||
One or more of the *amo*tests seem to POST data (/_bulk) whose size increases with every request, within the same build (ex: build# 5093). In this case, it grew to about 5 MB. POST data (json) from one such request is available at ssh://people.mozilla.org:/tmp/payload.111221.1.json
--
![]() |
Assignee | |
Comment 14•13 years ago
|
||
Alright, I've been out of the loop for far too long on this one. What's the plan moving forward? tmary? kumar?
![]() |
||
Comment 15•13 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #14)
> Alright, I've been out of the loop for far too long on this one. What's the
> plan moving forward? tmary? kumar?
Need info on the large requests (https://bugzilla.mozilla.org/show_bug.cgi?id=706944#c13)
--
Comment 16•13 years ago
|
||
SUMO has been doing pretty well--we haven't had our tests fail from ES paging out since December 21st. We've had 20 runs since then.
Comment 17•13 years ago
|
||
https://github.com/mozilla/zamboni/commit/804be6f
Things should be good now. Please let me know if this issue appears resolved (or not).
![]() |
Assignee | |
Comment 20•13 years ago
|
||
All good here? Any more issues?
Comment 21•13 years ago
|
||
We're doing pretty well with ES, Jenkins and SUMO tests. Haven't seen an ES-related test failure in a while now. Thumbs-up from the SUMO team.
![]() |
Assignee | |
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•