Closed
Bug 1049430
Opened 11 years ago
Closed 11 years ago
Frequent nagios ** PROBLEM alert - buildapi.pvt.build.mozilla.org/http - /buildapi/self-serve/jobs is CRITICAL **
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: emorley, Unassigned)
References
Details
First started at 08:38 UK:
***** Nagios *****
Notification Type: PROBLEM
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 08-06-2014 00:38:12
Additional Info:
CRITICAL - Socket timeout after 10 seconds
http://m.allizom.org/http%2B-%2B/buildapi/self-serve/jobs
And then has recovered/regressed multiple times since.
I don't know if it's fallout from the current hg.m.o ISE 500s/timeouts (bug 1040308).
| Reporter | ||
Comment 1•11 years ago
|
||
Latest:
***** Nagios *****
Notification Type: PROBLEM
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 08-06-2014 02:32:12
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Comment 2•11 years ago
|
||
It'll come back to hg.m.o issues. Buildapi pulls these files periodically
hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json
hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json
and blocks on it.
| Reporter | ||
Comment 3•11 years ago
|
||
Ah thank you.
Perhaps we should add a timeout to the urlopen()s?
http://mxr.mozilla.org/build-central/search?string=urlopen&find=buildapi
Depends on: 1040308
| Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #3)
> Perhaps we should add a timeout to the urlopen()s?
Filed bug 1049446.
| Reporter | ||
Comment 5•11 years ago
|
||
I still got a recent wave of these even now that bug 1049446 has been deployed :-(
***** Nagios *****
Notification Type: PROBLEM
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 08-12-2014 18:42:14
Additional Info:
CRITICAL - Socket timeout after 10 seconds
| Reporter | ||
Comment 6•11 years ago
|
||
***** Nagios *****
Notification Type: PROBLEM
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 08-13-2014 09:42:08
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Comment 7•11 years ago
|
||
I verified bug 1049446 did get deployed correctly, and we're getting Tracebacks from unhandled timeouts after 30 seconds (should probably add handling for that). eg
2014-08-13 17:48:42,106 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint
enance/production-branches.json
2014-08-13 17:48:55,118 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maint
enance/production-branches.json
2014-08-13 17:49:00,238 INFO [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json
2014-08-13 17:49:25,139 ERROR [buildapi.lib.helpers] [MainThread] Error loading branches json; using old list
Traceback (most recent call last):
File "/data/www/buildapi/virtualenv/lib/python2.7/site-packages/buildapi/lib/helpers.py", line 172, in get_branches
branches = json.load(urllib2.urlopen(branches_url, timeout=30))
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
The request nagios is making should be getting to this bit of code:
https://hg.mozilla.org/build/buildapi/file/66f1d42de07d/buildapi/controllers/selfserve.py#l261
which is only interacting with a mysql database. For some reason that's taking longer than 10 seconds.
Comment 8•11 years ago
|
||
Where do we stand here nick, bug is in buildduty queue, though I have not seen errors over the past day or two...
Flags: needinfo?(nthomas)
Updated•11 years ago
|
Component: Buildduty → Tools
QA Contact: bugspam.Callek → hwine
Comment 9•11 years ago
|
||
Nagios hasn't reported this problem since 21 Aug, which is good, but I wouldn't say we understand where it came from in the first place. Hard to debug now, so resolving WFM.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(nthomas)
Resolution: --- → WORKSFORME
| Assignee | ||
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•