Closed
Bug 1049430
Opened 10 years ago
Closed 10 years ago
Frequent nagios ** PROBLEM alert - buildapi.pvt.build.mozilla.org/http - /buildapi/self-serve/jobs is CRITICAL **
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: emorley, Unassigned)
References
Details
First started at 08:38 UK: ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-06-2014 00:38:12 Additional Info: CRITICAL - Socket timeout after 10 seconds http://m.allizom.org/http%2B-%2B/buildapi/self-serve/jobs And then has recovered/regressed multiple times since. I don't know if it's fallout from the current hg.m.o ISE 500s/timeouts (bug 1040308).
Reporter | ||
Comment 1•10 years ago
|
||
Latest: ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-06-2014 02:32:12 Additional Info: CRITICAL - Socket timeout after 10 seconds
Comment 2•10 years ago
|
||
It'll come back to hg.m.o issues. Buildapi pulls these files periodically hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json and blocks on it.
Reporter | ||
Comment 3•10 years ago
|
||
Ah thank you. Perhaps we should add a timeout to the urlopen()s? http://mxr.mozilla.org/build-central/search?string=urlopen&find=buildapi
Depends on: 1040308
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #3) > Perhaps we should add a timeout to the urlopen()s? Filed bug 1049446.
Reporter | ||
Comment 5•10 years ago
|
||
I still got a recent wave of these even now that bug 1049446 has been deployed :-( ***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-12-2014 18:42:14 Additional Info: CRITICAL - Socket timeout after 10 seconds
Reporter | ||
Comment 6•10 years ago
|
||
***** Nagios ***** Notification Type: PROBLEM Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 08-13-2014 09:42:08 Additional Info: CRITICAL - Socket timeout after 10 seconds
Comment 7•10 years ago
|
||
I verified bug 1049446 did get deployed correctly, and we're getting Tracebacks from unhandled timeouts after 30 seconds (should probably add handling for that). eg 2014-08-13 17:48:42,106 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint enance/production-branches.json 2014-08-13 17:48:55,118 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint enance/production-branches.json 2014-08-13 17:49:00,238 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json 2014-08-13 17:49:25,139 ERROR [buildapi.lib.helpers] [MainThread] Error loading branches json; using old list Traceback (most recent call last): File "/data/www/buildapi/virtualenv/lib/python2.7/site-packages/buildapi/lib/helpers.py", line 172, in get_branches branches = json.load(urllib2.urlopen(branches_url, timeout=30)) File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 407, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 365, in _read_status line = self.fp.readline() File "/usr/lib/python2.7/socket.py", line 447, in readline data = self._sock.recv(self._rbufsize) timeout: timed out The request nagios is making should be getting to this bit of code: https://hg.mozilla.org/build/buildapi/file/66f1d42de07d/buildapi/controllers/selfserve.py#l261 which is only interacting with a mysql database. For some reason that's taking longer than 10 seconds.
Comment 8•10 years ago
|
||
Where do we stand here nick, bug is in buildduty queue, though I have not seen errors over the past day or two...
Flags: needinfo?(nthomas)
Updated•10 years ago
|
Component: Buildduty → Tools
QA Contact: bugspam.Callek → hwine
Comment 9•10 years ago
|
||
Nagios hasn't reported this problem since 21 Aug, which is good, but I wouldn't say we understand where it came from in the first place. Hard to debug now, so resolving WFM.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(nthomas)
Resolution: --- → WORKSFORME
Assignee | ||
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•