Closed
Bug 749081
Opened 13 years ago
Closed 13 years ago
Clobberer's busted for mozilla-inbound
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Every other tree I tried loads just fine, but https://build.mozilla.org/clobberer/?branch=mozilla-inbound gives a 500 "Service Unavailable" error. We may or may not be able to elude the current bustage I wanted to clobber with a backout, but as often as we seem to land clobber-required things, it's only a matter of time before that becomes a blocker instead of just critical.
Comment 1•13 years ago
|
||
I've just checked the URL and it's loading slow, but still usable. I'll keep an eye on it.
Severity: critical → normal
Reporter | ||
Comment 2•13 years ago
|
||
Interesting, because I get the 500 fairly quickly (in clobberer terms, at least). Wonder whether that's because of the $SPECIAL_PEOPLE handling (though I don't see an obvious spot for it to be).
Severity: normal → critical
Reporter | ||
Comment 3•13 years ago
|
||
And from a discussion in #developers when we needed to clobber something and nobody but releng people could load it, apparently it is either VPN-related, or $SPECIAL_PEOPLE related. I just checked every branch including the odd releasey bits like "None", and inbound is the only one that gives a quick 500 to unspecial people. Be interesting to know what it might be throwing in the server error log when it does.
Comment 4•13 years ago
|
||
FWIW, I'm not in $SPECIAL_PEOPLE but while connected to the [scl3] Build-VPN I am able to load it, though it does take a while to load.
Comment 5•13 years ago
|
||
I think the difference here from the old is that the load balancer produces a 500 pretty quickly. I thought clobberer was fixed, or at least didn't suck as much as before.
I bumped what I *believe* is the correct timeout in zeus to 90 seconds. If you see the 500 with a red "Service Unavailable" in less than 90 seconds, let me know.
When connected to Build-VPN, you aren't traversing the load balancer, so you may see a longer timeout.
Reporter | ||
Comment 6•13 years ago
|
||
It is the red "Service Unavailable" (still), in what looks like 30 seconds plus connection time.
Updated•13 years ago
|
Assignee: nobody → dmaher
Comment 7•13 years ago
|
||
The connection and response timeouts in Zeus for the HTTP site were set to 40 and 30 seconds respectively; and for the HTTPS site, both were set to 30 seconds.
I have adjusted all of those to 90 seconds, as requested.
Updated•13 years ago
|
Assignee: dmaher → nobody
Comment 8•13 years ago
|
||
Thanks phrawzty :)
Philor, re-open if you're still seeing failures in <90s, or if 90s isn't enough?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 9•13 years ago
|
||
[10:20:51.192] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 33692ms]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 10•13 years ago
|
||
Was that a zeus error or from Apache?
The logfile is odd.. is this an automated script of some sort?
Comment 11•13 years ago
|
||
Ben, can you take a look and see if there's a particular query here that's running too slowly, that sheeri might be able to help us with?
Reporter | ||
Comment 12•13 years ago
|
||
It was the same red "Service Unavailable," and the odd thing looking like a logfile line is just what you get when you copy-paste from Firefox's Tools - Web Console, which was handier than using wget to know how long it was taking.
Comment 13•13 years ago
|
||
Not working for me either:
[09:36:51.269] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 30471ms]
Yeah, I'm getting ~30000ms long timeouts as well:
[03:37:43.374] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 30199ms]
--
[03:38:20.015] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 32639ms]
Comment 15•13 years ago
|
||
Well, I think the fix is to make this service take less than 30s to respond, but let's see if we can make the timeouts longer, too.
Comment 16•13 years ago
|
||
So, internally (bypassing zeus), I managed:
--2012-05-01 10:32:25-- https://build.mozilla.org/clobberer/?branch=mozilla-inbound
Resolving build.mozilla.org... 10.22.74.128
Connecting to build.mozilla.org|10.22.74.128|:443... connected.
HTTP request sent, awaiting response... 401 Authorization Required
Connecting to build.mozilla.org|10.22.74.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: “index.html?branch=mozilla-inbound”
[ <=> ] 347,175 99.5K/s in 3.4s
2012-05-01 10:33:07 (99.5 KB/s) - “index.html?branch=mozilla-inbound” saved [347175]
real 0m47.218s
user 0m0.016s
sys 0m0.003s
having taken less than 15s to type my password. So I think the timeout here is still on the Zeus side.
Indeed, it looks like phrawzty missed the setting on the pools - I upped that to 90s.
Comment 17•13 years ago
|
||
WFM now with the longer timeout, thank you :-)
[18:44:18.485] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 200 OK 40226ms]
(In reply to Dustin J. Mitchell [:dustin] from comment #16)
Works for me in ~42 seconds now:
[12:46:10.525] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 200 OK 42735ms]
--
[12:46:47.220] GET https://build.mozilla.org/clobberer/clobberer.css [HTTP/1.1 200 OK 289ms]
[12:46:47.225] GET https://build.mozilla.org/clobberer/jquery.min.js [HTTP/1.1 200 OK 513ms]
Reporter | ||
Comment 19•13 years ago
|
||
The query would be "SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch=mozilla-inbound ORDER BY branch ASC, buildername ASC" but I don't know where the db lives, so I can't do anything about getting sheeri to help us fix it.
Comment 20•13 years ago
|
||
It's at tm-b01-master01.m.o, db is clobberer. So the clobberer instance in scl3 is reaching back to sjc1 to do the db queries.
Comment 21•13 years ago
|
||
Whilst I can successfully load the clobberer for inbound, selecting machines to mark for clobber and submitting the page, just returns the same page back again with none of the machines marked as clobbered by me.
Comment 22•13 years ago
|
||
Phil - I know where the clobberer db is (or rather, I have docs). And nthomas knew, too.
off to check this out, maybe there's an easy fix like adding an index.
Comment 23•13 years ago
|
||
:philor - if that's actually the query, this might be the problem:
mysql> explain SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch=mozilla-inbound ORDER BY branch ASC, buildername ASC\G
ERROR 1054 (42S22): Unknown column 'mozilla' in 'where clause'
The problem is that "mozilla-inbound" isn't quoted. That wouldn't cause a timeout, just an error.
Here's what the EXPLAIN looks like when "mozilla-inbound" is quoted:
mysql> explain SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch='mozilla-inbound' ORDER BY branch ASC, buildername ASC\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: builds
type: ref
possible_keys: ix_branch
key: ix_branch
key_len: 53
ref: const
rows: 1561
Extra: Using where; Using filesort
1 row in set (0.01 sec)
This should not take very long, and indeed I just ran it and it came back right away (0.01 sec) with 1416 rows.
I tried this on production, both the read-only server and the read-write server, and the results were the same.
Dustin can you check the VIPs? We should probably be using the VIP for the b1 cluster for ro and rw, instead of whatever vip for tm-b01-master01 and -slave01. Maybe the problem is the zeus forwarding adds a little extra time? I can tell you that the query is not causing the page load to be 47 second.
Reporter | ||
Comment 24•13 years ago
|
||
That was me mentally translating from PHP - the actual query is http://mxr.mozilla.org/build/source/tools/clobberer/index.php#385 and I missed seeing the e() that does quote the branch name.
Comment 25•13 years ago
|
||
[root@relengweb1.dmz.scl3 ~]# nc -vz b1-db1.db.scl3.mozilla.com 3306
^C
[root@relengweb1.dmz.scl3 ~]# nc -vz b1-db2.db.scl3.mozilla.com 3306
^C
I'll get that set up.
Comment 26•13 years ago
|
||
We did some query logging. A slave looks like:
11300323 Connect clobberer@10.2.10.103 on clobberer
11300323 Query SELECT DISTINCT builddir from builds where slave='linux64-ix-slave03'
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'srv-cen-lnx64-pgo' ORDER by last_build_time DESC LIMIT 1
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'rel-m-beta-lnx64-rpk-5' ORDER by last_build_time DESC LIMIT 1
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'm-beta-lnx64' ORDER by last_build_time DESC LIMIT 1
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'oak-lnx64-dbg' ORDER by last_build_time DESC LIMIT 1
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'tb-comm-cen-lnx64-dbg' ORDER by last_build_time DESC LIMIT 1
11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'bir-lnx64' ORDER by last_build_time DESC LIMIT 1
...
11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'srv-cen-lnx64-pgo' AND (branch IS NULL OR branch = 'services-central') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1
11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'rel-m-beta-lnx64-rpk-5' AND (branch IS NULL OR branch = 'release-mozilla-beta') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1
11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'm-beta-lnx64' AND (branch IS NULL OR branch = 'mozilla-beta') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1
11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'oak-lnx64-dbg' AND (branch IS NULL OR branch = 'oak') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1
...
(whew!)
while a UI hit looks like this:
11300044 Query SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-%' AND branch='mozilla-inbound' ORDER BY branch ASC, buildername ASC
11300044 Query SELECT DISTINCT buildername, builddir FROM builds WHERE builddir LIKE 'rel-%'
11300044 Query SELECT DISTINCT buildername, builddir FROM builds WHERE builddir LIKE 'rel-%'
.. 1411 times ..
11300044 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'm-in-andrd-dbg' AND (branch IS NULL OR branch = 'mozilla-inbound') AND (master IS NULL OR master = '') AND (slave IS NULL OR slave = 'linux-ix-slave01') ORDER BY lastclobber DESC LIMIT 1
.. 1434 times, with different branches and slaves
So, there's some optimization to be done in the PHP here (on a new bug, plz)
Comment 27•13 years ago
|
||
I turned all clobberer instances over to using the scl3 VIPs. I don't expect that to fix the performance issues, but it's a good idea all the same :)
I'll punt this back to releng now to work on the clobberer code itself.
Comment 28•13 years ago
|
||
Fixed by bug 756532.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Depends on: 756532
Resolution: --- → FIXED
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Assignee | ||
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•