Closed Bug 1049430 Opened 11 years ago Closed 11 years ago

Frequent nagios ** PROBLEM alert - buildapi.pvt.build.mozilla.org/http - /buildapi/self-serve/jobs is CRITICAL **

Categories

(Release Engineering :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

Details

First started at 08:38 UK: ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-06-2014 00:38:12 Additional Info: CRITICAL - Socket timeout after 10 seconds http://m.allizom.org/http%2B-%2B/buildapi/self-serve/jobs And then has recovered/regressed multiple times since. I don't know if it's fallout from the current hg.m.o ISE 500s/timeouts (bug 1040308).
Latest: ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-06-2014 02:32:12 Additional Info: CRITICAL - Socket timeout after 10 seconds
It'll come back to hg.m.o issues. Buildapi pulls these files periodically hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json and blocks on it.
Ah thank you. Perhaps we should add a timeout to the urlopen()s? http://mxr.mozilla.org/build-central/search?string=urlopen&find=buildapi
Depends on: 1040308
Depends on: 1049446
(In reply to Ed Morley [:edmorley] from comment #3) > Perhaps we should add a timeout to the urlopen()s? Filed bug 1049446.
I still got a recent wave of these even now that bug 1049446 has been deployed :-( ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-12-2014 18:42:14 Additional Info: CRITICAL - Socket timeout after 10 seconds
***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-13-2014 09:42:08 Additional Info: CRITICAL - Socket timeout after 10 seconds
I verified bug 1049446 did get deployed correctly, and we're getting Tracebacks from unhandled timeouts after 30 seconds (should probably add handling for that). eg 2014-08-13 17:48:42,106 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint enance/production-branches.json 2014-08-13 17:48:55,118 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint enance/production-branches.json 2014-08-13 17:49:00,238 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json 2014-08-13 17:49:25,139 ERROR [buildapi.lib.helpers] [MainThread] Error loading branches json; using old list Traceback (most recent call last): File "/data/www/buildapi/virtualenv/lib/python2.7/site-packages/buildapi/lib/helpers.py", line 172, in get_branches branches = json.load(urllib2.urlopen(branches_url, timeout=30)) File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 407, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline() File "/usr/lib/python2.7/socket.py", line 447, in readline data = self._sock.recv(self._rbufsize) timeout: timed out The request nagios is making should be getting to this bit of code: https://hg.mozilla.org/build/buildapi/file/66f1d42de07d/buildapi/controllers/selfserve.py#l261 which is only interacting with a mysql database. For some reason that's taking longer than 10 seconds.
Where do we stand here nick, bug is in buildduty queue, though I have not seen errors over the past day or two...
Flags: needinfo?(nthomas)
Component: Buildduty → Tools
QA Contact: bugspam.Callek → hwine
Nagios hasn't reported this problem since 21 Aug, which is good, but I wouldn't say we understand where it came from in the first place. Hard to debug now, so resolving WFM.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(nthomas)
Resolution: --- → WORKSFORME
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.