Closed Bug 706944 Opened 13 years ago Closed 12 years ago

Elastic Search is timing out on ci.mozilla.org - webdev Jenkins

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: kumar, Assigned: fox2mike)

References

Details

Attachments

(1 file)

Config diff for elasticsearch.in.sh (flags which control JVM are defined here) 13 years ago T [:tmary] Meyarivan 1.25 KB, text/plain		Details

Kumar McMillan [:kumar]

Reporter

Description

•

13 years ago

We're getting a major slow down on Elastic Search and it's causing tests to time out again https://ci.mozilla.org/job/amo-master/4963/console

This has happened before and jbalogh's diagnosis was that ES does not like how we treat it during tests (maybe the setup/teardown stuff).  One possible workaround for now is to restart it periodically.

Jeff Balogh (:jbalogh)

Comment 1

•

13 years ago

How much memory/CPU is elasticsearch taking up? This seemed to be a never-ending garbage collection issue.

Jeremy Orem [:oremj]

Comment 2

•

13 years ago

Shyam, can you check this out?

Assignee: server-ops → shyam

Kumar McMillan [:kumar]

Reporter

Comment 3

•

13 years ago

See also bug 704526. Here is our build trend in which a pattern might be emerging as to how long it takes before ES dies https://ci.mozilla.org/job/amo-master/buildTimeTrend (the 9hr builds are where ES croaks)

T [:tmary] Meyarivan

Comment 4

•

13 years ago

Attached file Config diff for elasticsearch.in.sh (flags which control JVM are defined here) — Details

It would seem like the JVM is running out of memory. Attached patch enables logging of GC events.
(It uses /var/log/elasticsearch/gc.log.XXXXXXXX as the log destination)

--

T [:tmary] Meyarivan

Comment 5

•

13 years ago

IIUC this is a single node "ES cluster" -> number_of_replicas should be set to 0 (it is currently set to 1)

--

T [:tmary] Meyarivan

Comment 6

•

13 years ago

Next step is to limit the size of thread pool used for search and indexing (defaults to ever increasing pool of threads (prior to last restart, it was at ~4000 threads))

--

T [:tmary] Meyarivan

Comment 7

•

13 years ago

List of config changes (yet to be synced with Puppet manifest):

[1] set HEAP_MIN = HEAP_MAX (/opt/elasticsearch/bin/elasticsearch.in.sh)
[2] enable GC logging etc (/opt/elasticsearch/bin/elasticsearch.in.sh)
[3] set number_of_replicas=0 (/opt/elasticsearch/config/elasticsearch.yml.new)
[4] limit number of threads (/opt/elasticsearch/config/elasticsearch.yml.new)

In case ES daemon needs to be restarted before these changes are merged with Puppet manifest, please copy /opt/elasticsearch/config/elasticsearch.yml.new to /opt/elasticsearch/config/elasticsearch.yml before restarting the daemon.

--

Kumar McMillan [:kumar]

Reporter

Comment 8

•

13 years ago

We're hitting a lot of timeouts today (with some new exceptions this time too). So is sumo

https://ci.mozilla.org/job/amo-master/buildTimeTrend
https://ci.mozilla.org/job/sumo-master/1255/console

Can someone take a look at ES in CI?

Kumar McMillan [:kumar]

Reporter

Comment 9

•

13 years ago

both SUMO and AMO test suites are blocked on this

Severity: normal → major

Rick Bryce [:rbryce]

Comment 10

•

13 years ago

tmary:

If I understand your comments correctly.  Your /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet.  I can implement your changes but I don't know what the syntax for #4 in Comment 7 should be.  Otherwise your patch is ready to roll, all I have to do is commit.

Will Kahn-Greene [:willkg] ET needinfo? me

Updated

•

13 years ago

Severity: major → normal

T [:tmary] Meyarivan

Comment 11

•

13 years ago

(In reply to Kumar McMillan [:kumar] from comment #8)
> We're hitting a lot of timeouts today (with some new exceptions this time
> too). So is sumo
> 
> https://ci.mozilla.org/job/amo-master/buildTimeTrend
> https://ci.mozilla.org/job/sumo-master/1255/console
> 
> Can someone take a look at ES in CI?

Typo in config (120 instead of 120s) caused these exceptions ("ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]") - fixed.

--

T [:tmary] Meyarivan

Comment 12

•

13 years ago

(In reply to Rick Bryce [:rbryce] from comment #10)
> tmary:
> 
> If I understand your comments correctly.  Your
> /opt/elasticsearch/config/elasticsearch.yml.new got knocked out by puppet. 
> I can implement your changes but I don't know what the syntax for #4 in
> Comment 7 should be.  Otherwise your patch is ready to roll, all I have to
> do is commit.

Config changes for threadpool, queue sizing etc are part of elasticsearch.conf.new 

--

T [:tmary] Meyarivan

Comment 13

•

13 years ago

One or more of the *amo*tests seem to POST  data (/_bulk) whose size increases with every request, within the same build (ex: build# 5093). In this case, it grew to about 5 MB.  POST data (json) from one such request is available at ssh://people.mozilla.org:/tmp/payload.111221.1.json

--

Shyam Mani [:fox2mike]

Assignee

Comment 14

•

13 years ago

Alright, I've been out of the loop for far too long on this one. What's the plan moving forward? tmary? kumar?

T [:tmary] Meyarivan

Comment 15

•

13 years ago

(In reply to Shyam Mani [:fox2mike] from comment #14)
> Alright, I've been out of the loop for far too long on this one. What's the
> plan moving forward? tmary? kumar?

Need info on the large requests (https://bugzilla.mozilla.org/show_bug.cgi?id=706944#c13)

--

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 16

•

13 years ago

SUMO has been doing pretty well--we haven't had our tests fail from ES paging out since December 21st.  We've had 20 runs since then.

Christopher Van Wiemeersch [:cvan]

Comment 17

•

12 years ago

https://github.com/mozilla/zamboni/commit/804be6f

Things should be good now. Please let me know if this issue appears resolved (or not).

Shyam Mani [:fox2mike]

Assignee

Comment 20

•

12 years ago

All good here? Any more issues?

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 21

•

12 years ago

We're doing pretty well with ES, Jenkins and SUMO tests. Haven't seen an ES-related test failure in a while now. Thumbs-up from the SUMO team.

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.