Closed Bug 704526 Opened 13 years ago Closed 12 years ago

ci.mozilla.org unbearably slow for amo-master

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: fox2mike)

Details

AMO took 5 hours and 13 minutes to run tests last time, and the current run looks like it's following that trend.

Looks like fox2mike got paged last night that the box was out of memory - perhaps it's related?
Assignee: server-ops → shyam
It's now taking over 8 hours to run tests and then they get aborted.  -> critical
Severity: normal → critical
I think just restarting jenkins will fix this in the short term
This isn't jenkins, it's elasticsearch.

I see a continuous loop when I trace the process :

[root@jenkins1.dmz.phx1 ~]# strace -p 5291
Process 5291 attached - interrupt to quit
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540566 4907 15692238 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/3", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 5 entries */, 32768)   = 144
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
futex(0x7fd34400361c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344003618, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd3440035f0, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540568 4907 15692241 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519
getdents(519, /* 2 entries */, 32768)   = 48
getdents(519, /* 0 entries */, 32768)   = 0
close(519)                              = 0
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 519
read(519, "cpu  65540569 4907 15692243 4228"..., 8192) = 3125
close(519)                              = 0
futex(0x7fd34408e34c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd34408e348, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd34408e320, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

And this never ends. 

Killing elasticsearch allowed the build to complete. I tried upping the memory needed by ES from 256M min and 512M max to 1024M min and 2048M max and that didn't do much either.

How would you like to proceed? 

https://ci.mozilla.org/job/amo-master/4935/console
Is there something different about ES on hudson than in production?  What values does production use?
(In reply to Wil Clouser [:clouserw] from comment #4)
> Is there something different about ES on hudson than in production?  What
> values does production use?

The ES used for jenkins is a standalone instance.

class {
        'it-elasticsearch':
            mem_min => '1024',
            mem_max => '2048',
            cluster_name => 'jenkins',
            expected_nodes => '1',
            minimum_master_nodes => '1',
            recover_after_nodes => '1';
       }

The ES on production is a clustered instance.

 class {
        'it-elasticsearch':
            mem_max => '22544',
            cluster_name => "phxweb",
            minimum_master_nodes => '2';
    }

The changes to the jenkins instance were mainly to get it to run standalone, so it can be used for testing.
Summary: ci.mozilla.org unbearably slow → ci.mozilla.org unbearably slow for amo-master
Perhaps it's as easy as having the test script empty elastic search when it starts?  We should be doing that anyway so we don't have old cached results
Sure, but that's on your side? :)
Yeah, it would be.  I think this is all I need to run?

> curl -XDELETE http://localhost:9201/celery/
> curl -XDELETE http://localhost:9201/images/
> curl -XDELETE http://localhost:9201/bulk/

Can you verify localhost:9201 is correct and that those commands work?  Also, I'm not totally sure about the names of the queues, oremj would know what we run in production.

If you can verify that's right, I'll add it to the CI script
Jeremy, what are the queues in prod?

Wil, I found a hanging ES process, killed it and fired it up afresh. Then ran :

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/celery/
{"error":"IndexMissingException[[celery] missing]","status":404}

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/images/
{"error":"IndexMissingException[[images] missing]","status":404}

[root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/bulk/
{"error":"IndexMissingException[[bulk] missing]","status":404}

So, I'm not sure if it was missing because of our restart or...they just don't exist.
It's possible the tests are using different queues than production too, I suppose.  What did the ES instance you killed look like?  Queues are visible on the command line as the -Q flag (I think).
I mixed up celery queues and ES indexes.  Can you try these instead:

> curl -XDELETE http://localhost:9201/amo/
> curl -XDELETE http://localhost:9201/amo_stats/
I also think ES is hanging after sometime on this machine. Or doing something weird...which is what is causing these issues.
Any updates here? Still seeing issues? If so I can spend sometime with Kumar, Wil before the Monday morning meeting to look at this.
Build times have been around 26min since Dec 1st but I'll let Kumar comment since he was closer to this and knows if ES is fixed for real or just slowly filling up.  I haven't changed anything.
Hi Shyam. It looked like ES locked up again on Dec 5th -- actually, I assumed you restarted it or something because it eventually resolved itself. You can see the build trend here: https://ci.mozilla.org/job/amo-master/buildTimeTrend  The broken state is the 9hr build :(  Hmm, it looks like there might be a pattern to when it enters back into a broken state.
Are we still seeing issues here?
Builds are averaging around 25 minutes now.  Still sad for server class hardware (when compared to how long it takes on dev's laptops) but certainly not 5 hours.  I think we can call this fixed.
I'm not sure about what you're doing with ES. If we can point this at some other ES instance other than the one on this box and it performs better, I'm more than happy to do that. I don't think this has anything to do with the hardware and everything to do with ES craziness (this used to build in 14 mins till something changed and then just went crazy).
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.