ci.mozilla.org unbearably slow for amo-master

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
critical
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: clouserw, Assigned: fox2mike)

Tracking

Details

(Reporter)

Description

6 years ago
AMO took 5 hours and 13 minutes to run tests last time, and the current run looks like it's following that trend.

Looks like fox2mike got paged last night that the box was out of memory - perhaps it's related?
Assignee: server-ops → shyam
(Reporter)

Comment 1

6 years ago
It's now taking over 8 hours to run tests and then they get aborted.  -> critical
Severity: normal → critical
(Reporter)

Comment 2

6 years ago
I think just restarting jenkins will fix this in the short term
(Assignee)

Comment 3

6 years ago
This isn't jenkins, it's elasticsearch.

I see a continuous loop when I trace the process :

[root@jenkins1.dmz.phx1 ~]# strace -p 5291
Process 5291 attached - interrupt to quit
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540566 4907 15692238 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/3", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 5 entries */, 32768)   = 144
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
futex(0x7fd34400361c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344003618, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd3440035f0, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540568 4907 15692241 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540569 4907 15692243 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34408e34c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd34408e348, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd34408e320, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

And this never ends. 

Killing elasticsearch allowed the build to complete. I tried upping the memory needed by ES from 256M min and 512M max to 1024M min and 2048M max and that didn't do much either.

How would you like to proceed? 

https://ci.mozilla.org/job/amo-master/4935/console
(Reporter)

Comment 4

6 years ago
Is there something different about ES on hudson than in production?  What values does production use?
(Assignee)

Comment 5

6 years ago
(In reply to Wil Clouser [:clouserw] from comment #4)
> Is there something different about ES on hudson than in production?  What
> values does production use?

The ES used for jenkins is a standalone instance.

class {
        'it-elasticsearch':
            mem_min => '1024',
            mem_max => '2048',
            cluster_name => 'jenkins',
            expected_nodes => '1',
            minimum_master_nodes => '1',
            recover_after_nodes => '1';
       }

The ES on production is a clustered instance.

 class {
        'it-elasticsearch':
            mem_max => '22544',
            cluster_name => "phxweb",
            minimum_master_nodes => '2';
    }

The changes to the jenkins instance were mainly to get it to run standalone, so it can be used for testing.
(Assignee)

Updated

6 years ago
Summary: ci.mozilla.org unbearably slow → ci.mozilla.org unbearably slow for amo-master
(Reporter)

Comment 6

6 years ago
Perhaps it's as easy as having the test script empty elastic search when it starts?  We should be doing that anyway so we don't have old cached results
(Assignee)

Comment 7

6 years ago
Sure, but that's on your side? :)
(Reporter)

Comment 8

6 years ago
Yeah, it would be.  I think this is all I need to run?

> curl -XDELETE http://localhost:9201/celery/
> curl -XDELETE http://localhost:9201/images/
> curl -XDELETE http://localhost:9201/bulk/

Can you verify localhost:9201 is correct and that those commands work?  Also, I'm not totally sure about the names of the queues, oremj would know what we run in production.

If you can verify that's right, I'll add it to the CI script
(Assignee)

Comment 9

6 years ago
Jeremy, what are the queues in prod?

Wil, I found a hanging ES process, killed it and fired it up afresh. Then ran :

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/celery/
{"error":"IndexMissingException[[celery] missing]","status":404}

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/images/
{"error":"IndexMissingException[[images] missing]","status":404}

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/bulk/
{"error":"IndexMissingException[[bulk] missing]","status":404}

So, I'm not sure if it was missing because of our restart or...they just don't exist.
(Reporter)

Comment 10

6 years ago
It's possible the tests are using different queues than production too, I suppose.  What did the ES instance you killed look like?  Queues are visible on the command line as the -Q flag (I think).
(Reporter)

Comment 11

6 years ago
I mixed up celery queues and ES indexes.  Can you try these instead:

> curl -XDELETE http://localhost:9201/amo/
> curl -XDELETE http://localhost:9201/amo_stats/
(Assignee)

Comment 12

6 years ago
I also think ES is hanging after sometime on this machine. Or doing something weird...which is what is causing these issues.
(Assignee)

Comment 13

6 years ago
Any updates here? Still seeing issues? If so I can spend sometime with Kumar, Wil before the Monday morning meeting to look at this.
(Reporter)

Comment 14

6 years ago
Build times have been around 26min since Dec 1st but I'll let Kumar comment since he was closer to this and knows if ES is fixed for real or just slowly filling up.  I haven't changed anything.
Hi Shyam. It looked like ES locked up again on Dec 5th -- actually, I assumed you restarted it or something because it eventually resolved itself. You can see the build trend here: https://ci.mozilla.org/job/amo-master/buildTimeTrend  The broken state is the 9hr build :(  Hmm, it looks like there might be a pattern to when it enters back into a broken state.
(Assignee)

Comment 16

6 years ago
Are we still seeing issues here?
(Reporter)

Comment 17

6 years ago
Builds are averaging around 25 minutes now.  Still sad for server class hardware (when compared to how long it takes on dev's laptops) but certainly not 5 hours.  I think we can call this fixed.
(Assignee)

Comment 18

6 years ago
I'm not sure about what you're doing with ES. If we can point this at some other ES instance other than the one on this box and it performs better, I'm more than happy to do that. I don't think this has anything to do with the hardware and everything to do with ES craziness (this used to build in 14 mins till something changed and then just went crazy).
(Assignee)

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.