Closed
Bug 704526
Opened 13 years ago
Closed 12 years ago
ci.mozilla.org unbearably slow for amo-master
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: clouserw, Assigned: fox2mike)
Details
AMO took 5 hours and 13 minutes to run tests last time, and the current run looks like it's following that trend. Looks like fox2mike got paged last night that the box was out of memory - perhaps it's related?
Updated•13 years ago
|
Assignee: server-ops → shyam
Reporter | ||
Comment 1•13 years ago
|
||
It's now taking over 8 hours to run tests and then they get aborted. -> critical
Severity: normal → critical
Reporter | ||
Comment 2•13 years ago
|
||
I think just restarting jenkins will fix this in the short term
Assignee | ||
Comment 3•13 years ago
|
||
This isn't jenkins, it's elasticsearch. I see a continuous loop when I trace the process : [root@jenkins1.dmz.phx1 ~]# strace -p 5291 Process 5291 attached - interrupt to quit stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/proc/stat", O_RDONLY|O_CLOEXEC) = 519 read(519, "cpu 65540566 4907 15692238 4228"..., 8192) = 3125 close(519) = 0 futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/3", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 5 entries */, 32768) = 144 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 futex(0x7fd34400361c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344003618, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd3440035f0, FUTEX_WAKE_PRIVATE, 1) = 1 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/1/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/proc/stat", O_RDONLY|O_CLOEXEC) = 519 read(519, "cpu 65540568 4907 15692241 4228"..., 8192) = 3125 close(519) = 0 futex(0x7fd34410103c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd344101038, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd344101010, FUTEX_WAKE_PRIVATE, 1) = 1 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/index", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 519 getdents(519, /* 2 entries */, 32768) = 48 getdents(519, /* 0 entries */, 32768) = 0 close(519) = 0 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/2/translog", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/proc/stat", O_RDONLY|O_CLOEXEC) = 519 read(519, "cpu 65540569 4907 15692243 4228"..., 8192) = 3125 close(519) = 0 futex(0x7fd34408e34c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd34408e348, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fd34408e320, FUTEX_WAKE_PRIVATE, 1) = 1 stat("/usr/local/elasticsearch/data/jenkins/nodes/0/indices/test_amo/4/index", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 And this never ends. Killing elasticsearch allowed the build to complete. I tried upping the memory needed by ES from 256M min and 512M max to 1024M min and 2048M max and that didn't do much either. How would you like to proceed? https://ci.mozilla.org/job/amo-master/4935/console
Reporter | ||
Comment 4•13 years ago
|
||
Is there something different about ES on hudson than in production? What values does production use?
Assignee | ||
Comment 5•13 years ago
|
||
(In reply to Wil Clouser [:clouserw] from comment #4) > Is there something different about ES on hudson than in production? What > values does production use? The ES used for jenkins is a standalone instance. class { 'it-elasticsearch': mem_min => '1024', mem_max => '2048', cluster_name => 'jenkins', expected_nodes => '1', minimum_master_nodes => '1', recover_after_nodes => '1'; } The ES on production is a clustered instance. class { 'it-elasticsearch': mem_max => '22544', cluster_name => "phxweb", minimum_master_nodes => '2'; } The changes to the jenkins instance were mainly to get it to run standalone, so it can be used for testing.
Assignee | ||
Updated•13 years ago
|
Summary: ci.mozilla.org unbearably slow → ci.mozilla.org unbearably slow for amo-master
Reporter | ||
Comment 6•13 years ago
|
||
Perhaps it's as easy as having the test script empty elastic search when it starts? We should be doing that anyway so we don't have old cached results
Assignee | ||
Comment 7•13 years ago
|
||
Sure, but that's on your side? :)
Reporter | ||
Comment 8•13 years ago
|
||
Yeah, it would be. I think this is all I need to run?
> curl -XDELETE http://localhost:9201/celery/
> curl -XDELETE http://localhost:9201/images/
> curl -XDELETE http://localhost:9201/bulk/
Can you verify localhost:9201 is correct and that those commands work? Also, I'm not totally sure about the names of the queues, oremj would know what we run in production.
If you can verify that's right, I'll add it to the CI script
Assignee | ||
Comment 9•13 years ago
|
||
Jeremy, what are the queues in prod? Wil, I found a hanging ES process, killed it and fired it up afresh. Then ran : [root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/celery/ {"error":"IndexMissingException[[celery] missing]","status":404} [root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/images/ {"error":"IndexMissingException[[images] missing]","status":404} [root@jenkins1.dmz.phx1 ~]# curl -XDELETE http://localhost:9200/bulk/ {"error":"IndexMissingException[[bulk] missing]","status":404} So, I'm not sure if it was missing because of our restart or...they just don't exist.
Reporter | ||
Comment 10•13 years ago
|
||
It's possible the tests are using different queues than production too, I suppose. What did the ES instance you killed look like? Queues are visible on the command line as the -Q flag (I think).
Reporter | ||
Comment 11•13 years ago
|
||
I mixed up celery queues and ES indexes. Can you try these instead:
> curl -XDELETE http://localhost:9201/amo/
> curl -XDELETE http://localhost:9201/amo_stats/
Assignee | ||
Comment 12•13 years ago
|
||
I also think ES is hanging after sometime on this machine. Or doing something weird...which is what is causing these issues.
Assignee | ||
Comment 13•13 years ago
|
||
Any updates here? Still seeing issues? If so I can spend sometime with Kumar, Wil before the Monday morning meeting to look at this.
Reporter | ||
Comment 14•13 years ago
|
||
Build times have been around 26min since Dec 1st but I'll let Kumar comment since he was closer to this and knows if ES is fixed for real or just slowly filling up. I haven't changed anything.
Comment 15•13 years ago
|
||
Hi Shyam. It looked like ES locked up again on Dec 5th -- actually, I assumed you restarted it or something because it eventually resolved itself. You can see the build trend here: https://ci.mozilla.org/job/amo-master/buildTimeTrend The broken state is the 9hr build :( Hmm, it looks like there might be a pattern to when it enters back into a broken state.
Assignee | ||
Comment 16•12 years ago
|
||
Are we still seeing issues here?
Reporter | ||
Comment 17•12 years ago
|
||
Builds are averaging around 25 minutes now. Still sad for server class hardware (when compared to how long it takes on dev's laptops) but certainly not 5 hours. I think we can call this fixed.
Assignee | ||
Comment 18•12 years ago
|
||
I'm not sure about what you're doing with ES. If we can point this at some other ES instance other than the one on this box and it performs better, I'm more than happy to do that. I don't think this has anything to do with the hardware and everything to do with ES craziness (this used to build in 14 mins till something changed and then just went crazy).
Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•