Closed Bug 749081 Opened 13 years ago Closed 13 years ago

Clobberer's busted for mozilla-inbound

Categories

(Release Engineering :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Every other tree I tried loads just fine, but https://build.mozilla.org/clobberer/?branch=mozilla-inbound gives a 500 "Service Unavailable" error. We may or may not be able to elude the current bustage I wanted to clobber with a backout, but as often as we seem to land clobber-required things, it's only a matter of time before that becomes a blocker instead of just critical.
I've just checked the URL and it's loading slow, but still usable. I'll keep an eye on it.
Severity: critical → normal
Interesting, because I get the 500 fairly quickly (in clobberer terms, at least). Wonder whether that's because of the $SPECIAL_PEOPLE handling (though I don't see an obvious spot for it to be).
Severity: normal → critical
And from a discussion in #developers when we needed to clobber something and nobody but releng people could load it, apparently it is either VPN-related, or $SPECIAL_PEOPLE related. I just checked every branch including the odd releasey bits like "None", and inbound is the only one that gives a quick 500 to unspecial people. Be interesting to know what it might be throwing in the server error log when it does.
FWIW, I'm not in $SPECIAL_PEOPLE but while connected to the [scl3] Build-VPN I am able to load it, though it does take a while to load.
I think the difference here from the old is that the load balancer produces a 500 pretty quickly. I thought clobberer was fixed, or at least didn't suck as much as before. I bumped what I *believe* is the correct timeout in zeus to 90 seconds. If you see the 500 with a red "Service Unavailable" in less than 90 seconds, let me know. When connected to Build-VPN, you aren't traversing the load balancer, so you may see a longer timeout.
It is the red "Service Unavailable" (still), in what looks like 30 seconds plus connection time.
Assignee: nobody → dmaher
The connection and response timeouts in Zeus for the HTTP site were set to 40 and 30 seconds respectively; and for the HTTPS site, both were set to 30 seconds. I have adjusted all of those to 90 seconds, as requested.
Assignee: dmaher → nobody
Thanks phrawzty :) Philor, re-open if you're still seeing failures in <90s, or if 90s isn't enough?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
[10:20:51.192] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 33692ms]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Was that a zeus error or from Apache? The logfile is odd.. is this an automated script of some sort?
Ben, can you take a look and see if there's a particular query here that's running too slowly, that sheeri might be able to help us with?
It was the same red "Service Unavailable," and the odd thing looking like a logfile line is just what you get when you copy-paste from Firefox's Tools - Web Console, which was handier than using wget to know how long it was taking.
Not working for me either: [09:36:51.269] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 30471ms]
Yeah, I'm getting ~30000ms long timeouts as well: [03:37:43.374] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 30199ms] -- [03:38:20.015] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 500 Internal Server Error 32639ms]
Well, I think the fix is to make this service take less than 30s to respond, but let's see if we can make the timeouts longer, too.
So, internally (bypassing zeus), I managed: --2012-05-01 10:32:25-- https://build.mozilla.org/clobberer/?branch=mozilla-inbound Resolving build.mozilla.org... 10.22.74.128 Connecting to build.mozilla.org|10.22.74.128|:443... connected. HTTP request sent, awaiting response... 401 Authorization Required Connecting to build.mozilla.org|10.22.74.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: “index.html?branch=mozilla-inbound” [ <=> ] 347,175 99.5K/s in 3.4s 2012-05-01 10:33:07 (99.5 KB/s) - “index.html?branch=mozilla-inbound” saved [347175] real 0m47.218s user 0m0.016s sys 0m0.003s having taken less than 15s to type my password. So I think the timeout here is still on the Zeus side. Indeed, it looks like phrawzty missed the setting on the pools - I upped that to 90s.
WFM now with the longer timeout, thank you :-) [18:44:18.485] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 200 OK 40226ms]
(In reply to Dustin J. Mitchell [:dustin] from comment #16) Works for me in ~42 seconds now: [12:46:10.525] GET https://build.mozilla.org/clobberer/?branch=mozilla-inbound [HTTP/1.1 200 OK 42735ms] -- [12:46:47.220] GET https://build.mozilla.org/clobberer/clobberer.css [HTTP/1.1 200 OK 289ms] [12:46:47.225] GET https://build.mozilla.org/clobberer/jquery.min.js [HTTP/1.1 200 OK 513ms]
The query would be "SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch=mozilla-inbound ORDER BY branch ASC, buildername ASC" but I don't know where the db lives, so I can't do anything about getting sheeri to help us fix it.
It's at tm-b01-master01.m.o, db is clobberer. So the clobberer instance in scl3 is reaching back to sjc1 to do the db queries.
Whilst I can successfully load the clobberer for inbound, selecting machines to mark for clobber and submitting the page, just returns the same page back again with none of the machines marked as clobbered by me.
Phil - I know where the clobberer db is (or rather, I have docs). And nthomas knew, too. off to check this out, maybe there's an easy fix like adding an index.
:philor - if that's actually the query, this might be the problem: mysql> explain SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch=mozilla-inbound ORDER BY branch ASC, buildername ASC\G ERROR 1054 (42S22): Unknown column 'mozilla' in 'where clause' The problem is that "mozilla-inbound" isn't quoted. That wouldn't cause a timeout, just an error. Here's what the EXPLAIN looks like when "mozilla-inbound" is quoted: mysql> explain SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-' AND branch='mozilla-inbound' ORDER BY branch ASC, buildername ASC\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: builds type: ref possible_keys: ix_branch key: ix_branch key_len: 53 ref: const rows: 1561 Extra: Using where; Using filesort 1 row in set (0.01 sec) This should not take very long, and indeed I just ran it and it came back right away (0.01 sec) with 1416 rows. I tried this on production, both the read-only server and the read-write server, and the results were the same. Dustin can you check the VIPs? We should probably be using the VIP for the b1 cluster for ro and rw, instead of whatever vip for tm-b01-master01 and -slave01. Maybe the problem is the zeus forwarding adds a little extra time? I can tell you that the query is not causing the page load to be 47 second.
That was me mentally translating from PHP - the actual query is http://mxr.mozilla.org/build/source/tools/clobberer/index.php#385 and I missed seeing the e() that does quote the branch name.
[root@relengweb1.dmz.scl3 ~]# nc -vz b1-db1.db.scl3.mozilla.com 3306 ^C [root@relengweb1.dmz.scl3 ~]# nc -vz b1-db2.db.scl3.mozilla.com 3306 ^C I'll get that set up.
We did some query logging. A slave looks like: 11300323 Connect clobberer@10.2.10.103 on clobberer 11300323 Query SELECT DISTINCT builddir from builds where slave='linux64-ix-slave03' 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'srv-cen-lnx64-pgo' ORDER by last_build_time DESC LIMIT 1 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'rel-m-beta-lnx64-rpk-5' ORDER by last_build_time DESC LIMIT 1 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'm-beta-lnx64' ORDER by last_build_time DESC LIMIT 1 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'oak-lnx64-dbg' ORDER by last_build_time DESC LIMIT 1 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'tb-comm-cen-lnx64-dbg' ORDER by last_build_time DESC LIMIT 1 11300323 Query SELECT buildername, builddir, branch FROM builds WHERE builddir = 'bir-lnx64' ORDER by last_build_time DESC LIMIT 1 ... 11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'srv-cen-lnx64-pgo' AND (branch IS NULL OR branch = 'services-central') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1 11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'rel-m-beta-lnx64-rpk-5' AND (branch IS NULL OR branch = 'release-mozilla-beta') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1 11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'm-beta-lnx64' AND (branch IS NULL OR branch = 'mozilla-beta') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1 11300323 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'oak-lnx64-dbg' AND (branch IS NULL OR branch = 'oak') AND (master IS NULL OR master = 'http://buildbot-master13.build.scl1.mozilla.com:8001/') AND (slave IS NULL OR slave = 'linux64-ix-slave03') ORDER BY lastclobber DESC LIMIT 1 ... (whew!) while a UI hit looks like this: 11300044 Query SELECT DISTINCT id, branch, builddir, buildername, slave FROM builds WHERE builddir NOT LIKE 'rel-%' AND branch='mozilla-inbound' ORDER BY branch ASC, buildername ASC 11300044 Query SELECT DISTINCT buildername, builddir FROM builds WHERE builddir LIKE 'rel-%' 11300044 Query SELECT DISTINCT buildername, builddir FROM builds WHERE builddir LIKE 'rel-%' .. 1411 times .. 11300044 Query SELECT id, who, lastclobber FROM clobber_times WHERE builddir = 'm-in-andrd-dbg' AND (branch IS NULL OR branch = 'mozilla-inbound') AND (master IS NULL OR master = '') AND (slave IS NULL OR slave = 'linux-ix-slave01') ORDER BY lastclobber DESC LIMIT 1 .. 1434 times, with different branches and slaves So, there's some optimization to be done in the PHP here (on a new bug, plz)
I turned all clobberer instances over to using the scl3 VIPs. I don't expect that to fix the performance issues, but it's a good idea all the same :) I'll punt this back to releng now to work on the clobberer code itself.
Fixed by bug 756532.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Depends on: 756532
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.