Elastic Search is timing out on ci.mozilla.org - webdev Jenkins

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: kumar, Assigned: fox2mike)

Tracking

Details

Attachments

(1 attachment)

We're getting a major slow down on Elastic Search and it's causing tests to time out again https://ci.mozilla.org/job/amo-master/4963/console

This has happened before and jbalogh's diagnosis was that ES does not like how we treat it during tests (maybe the setup/teardown stuff).  One possible workaround for now is to restart it periodically.
How much memory/CPU is elasticsearch taking up? This seemed to be a never-ending garbage collection issue.

Comment 2

7 years ago
Shyam, can you check this out?
Assignee: server-ops → shyam
See also bug 704526. Here is our build trend in which a pattern might be emerging as to how long it takes before ES dies https://ci.mozilla.org/job/amo-master/buildTimeTrend (the 9hr builds are where ES croaks)

Comment 4

7 years ago
Created attachment 580379 [details]
Config diff for elasticsearch.in.sh (flags which control JVM are defined here)

It would seem like the JVM is running out of memory. Attached patch enables logging of GC events.
(It uses /var/log/elasticsearch/gc.log.XXXXXXXX as the log destination)

--

Comment 5

6 years ago
IIUC this is a single node "ES cluster" -> number_of_replicas should be set to 0 (it is currently set to 1)

--

Comment 6

6 years ago
Next step is to limit the size of thread pool used for search and indexing (defaults to ever increasing pool of threads (prior to last restart, it was at ~4000 threads))

--

Comment 7

6 years ago
List of config changes (yet to be synced with Puppet manifest):

[1] set HEAP_MIN = HEAP_MAX (/opt/elasticsearch/bin/elasticsearch.in.sh)
[2] enable GC logging etc (/opt/elasticsearch/bin/elasticsearch.in.sh)
[3] set number_of_replicas=0 (/opt/elasticsearch/config/elasticsearch.yml.new)
[4] limit number of threads (/opt/elasticsearch/config/elasticsearch.yml.new)

In case ES daemon needs to be restarted before these changes are merged with Puppet manifest, please copy /opt/elasticsearch/config/elasticsearch.yml.new to /opt/elasticsearch/config/elasticsearch.yml before restarting the daemon.

--
We're hitting a lot of timeouts today (with some new exceptions this time too). So is sumo

https://ci.mozilla.org/job/amo-master/buildTimeTrend
https://ci.mozilla.org/job/sumo-master/1255/console

Can someone take a look at ES in CI?
both SUMO and AMO test suites are blocked on this
Severity: normal → major
tmary:

If I understand your comments correctly.  Your /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet.  I can implement your changes but I don't know what the syntax for #4 in Comment 7 should be.  Otherwise your patch is ready to roll, all I have to do is commit.
Severity: major → normal
(In reply to Kumar McMillan [:kumar] from comment #8)
> We're hitting a lot of timeouts today (with some new exceptions this time
> too). So is sumo
> 
> https://ci.mozilla.org/job/amo-master/buildTimeTrend
> https://ci.mozilla.org/job/sumo-master/1255/console
> 
> Can someone take a look at ES in CI?

Typo in config (120 instead of 120s) caused these exceptions ("ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]") - fixed.

--
(In reply to Rick Bryce [:rbryce] from comment #10)
> tmary:
> 
> If I understand your comments correctly.  Your
> /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. 
> I can implement your changes but I don't know what the syntax for #4 in
> Comment 7 should be.  Otherwise your patch is ready to roll, all I have to
> do is commit.

Config changes for threadpool, queue sizing etc are part of elasticsearch.conf.new 

--
One or more of the *amo*tests seem to POST  data (/_bulk) whose size increases with every request, within the same build (ex: build# 5093). In this case, it grew to about 5 MB.  POST data (json) from one such request is available at ssh://people.mozilla.org:/tmp/payload.111221.1.json

--
(Assignee)

Comment 14

6 years ago
Alright, I've been out of the loop for far too long on this one. What's the plan moving forward? tmary? kumar?
(In reply to Shyam Mani [:fox2mike] from comment #14)
> Alright, I've been out of the loop for far too long on this one. What's the
> plan moving forward? tmary? kumar?

Need info on the large requests (https://bugzilla.mozilla.org/show_bug.cgi?id=706944#c13)

--
SUMO has been doing pretty well--we haven't had our tests fail from ES paging out since December 21st.  We've had 20 runs since then.
https://github.com/mozilla/zamboni/commit/804be6f

Things should be good now. Please let me know if this issue appears resolved (or not).
Duplicate of this bug: 711175
Duplicate of this bug: 706258
(Assignee)

Comment 20

6 years ago
All good here? Any more issues?
We're doing pretty well with ES, Jenkins and SUMO tests. Haven't seen an ES-related test failure in a while now. Thumbs-up from the SUMO team.
(Assignee)

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.