Closed
Bug 907514
Opened 12 years ago
Closed 12 years ago
relay boards are falling down in p4.releng.scl1
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dividehex, Assigned: dividehex)
References
Details
Not sure what is causing this but it seems like something is overloading the relay boards in p4 to the point of causing them to be come unresponsive.
[15:54] <nagios-releng> Tue 15:54:42 PDT [4962] panda-relay-038.p4.releng.scl1.mozilla.com:http - /login.htm is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http+-+/login.htm)
[15:54] <nagios-releng> Tue 15:54:42 PDT [4963] panda-relay-034.p4.releng.scl1.mozilla.com:http - /login.htm is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http+-+/login.htm)
[15:54] <nagios-releng> Tue 15:54:42 PDT [4964] panda-relay-037.p4.releng.scl1.mozilla.com:http - /login.htm is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http+-+/login.htm)
[15:54] <nagios-releng> Tue 15:54:42 PDT [4965] panda-relay-032.p4.releng.scl1.mozilla.com:http - /login.htm is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http+-+/login.htm)
[15:54] <nagios-releng> Tue 15:54:42 PDT [4966] panda-relay-031.p4.releng.scl1.mozilla.com:http - /login.htm is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http+-+/login.htm)
[16:00] <nagios-releng> Tue 16:00:31 PDT [4967] panda-relay-035.p4.releng.scl1.mozilla.com:Memory is WARNING: SNMP WARNING - Mem *7549796* Bytes (http://m.allizom.org/Memory)
[16:03] <nagios-releng> Tue 16:03:41 PDT [4968] panda-relay-035.p4.releng.scl1.mozilla.com:Mozpool relay health is UNKNOWN: UNKNOWN: Traceback (most recent call last): (http://m.allizom.org/Mozpool+relay+health)
[16:04] <nagios-releng> Tue 16:04:31 PDT [4969] panda-relay-038.p4.releng.scl1.mozilla.com:http - /login.htm is OK: HTTP OK: HTTP/1.1 200 OK - 4145 bytes in 0.098 second response time (http://m.allizom.org/http+-+/login.htm)
[16:04] <nagios-releng> Tue 16:04:31 PDT [4970] panda-relay-037.p4.releng.scl1.mozilla.com:http - /login.htm is OK: HTTP OK: HTTP/1.1 200 OK - 4145 bytes in 1.109 second response time (http://m.allizom.org/http+-+/login.htm)
[16:04] <nagios-releng> Tue 16:04:31 PDT [4971] panda-relay-034.p4.releng.scl1.mozilla.com:http - /login.htm is OK: HTTP OK: HTTP/1.1 200 OK - 4145 bytes in 1.509 second response time (http://m.allizom.org/http+-+/login.htm)
[16:04] <nagios-releng> Tue 16:04:31 PDT [4972] panda-relay-031.p4.releng.scl1.mozilla.com:http - /login.htm is OK: HTTP OK: HTTP/1.1 200 OK - 4145 bytes in 2.271 second response time (http://m.allizom.org/http+-+/login.htm)
[16:04] <nagios-releng> Tue 16:04:31 PDT [4973] panda-relay-032.p4.releng.scl1.mozilla.com:http - /login.htm is OK: HTTP OK: HTTP/1.1 200 OK - 4145 bytes in 3.176 second response time (http://m.allizom.org/http+-+/login.htm)
Assignee | ||
Comment 1•12 years ago
|
||
From the mozpool logs showing several failed connections to the relay boards in p4
127.0.0.1:56815 - - [20/Aug/2013 16:01:34] "HTTP/1.1 GET /api/relay/panda-relay-031/test/" - 200 OK
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
return self.handle()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
return tocall(*args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/web/handlers.py", line 76, in wrapped
return function(self, id, *args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 33, in GET
return { 'success' : a.test_two_way_comms.run(relay_name)}
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/async.py", line 75, in run
raise TimeoutError
TimeoutError
127.0.0.1:56812 - - [20/Aug/2013 16:01:35] "HTTP/1.1 GET /api/relay/panda-relay-035/test/" - 500 Internal Server Error
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
return self.handle()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
return tocall(*args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/web/handlers.py", line 76, in wrapped
return function(self, id, *args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 33, in GET
return { 'success' : a.test_two_way_comms.run(relay_name)}
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/async.py", line 75, in run
raise TimeoutError
TimeoutError
127.0.0.1:56813 - - [20/Aug/2013 16:01:35] "HTTP/1.1 GET /api/relay/panda-relay-032/test/" - 500 Internal Server Error
bmm.relay ERROR - [2013-08-20 16:01:35,124] timeout communicating with panda-relay-035.p4.releng.scl1.mozilla.com
bmm.relay ERROR - [2013-08-20 16:01:35,140] timeout communicating with panda-relay-032.p4.releng.scl1.mozilla.com
db.pool DEBUG - [2013-08-20 16:02:20,594] setting SO_KEEPALIVE on MySQL socket 4
127.0.0.1:56819 - - [20/Aug/2013 16:03:25] "HTTP/1.1 GET /api/relay/panda-relay-032/test/" - 200 OK
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
return self.handle()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
return tocall(*args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/web/handlers.py", line 76, in wrapped
return function(self, id, *args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 33, in GET
return { 'success' : a.test_two_way_comms.run(relay_name)}
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/async.py", line 75, in run
raise TimeoutError
TimeoutError
127.0.0.1:56820 - - [20/Aug/2013 16:03:35] "HTTP/1.1 GET /api/relay/panda-relay-035/test/" - 500 Internal Server Error
bmm.relay ERROR - [2013-08-20 16:03:35,278] timeout communicating with panda-relay-035.p4.releng.scl1.mozilla.com
Assignee | ||
Updated•12 years ago
|
Assignee: relops → jwatkins
Assignee | ||
Comment 2•12 years ago
|
||
at this time I suspect the scans from scan1.ops.scl3.m.c are causing the boards to become overwhelmed and starving them for resources. See bug907528
Assignee | ||
Comment 3•12 years ago
|
||
Closing R/F.
per comment https://bugzilla.mozilla.org/show_bug.cgi?id=907528#c7
and since I haven't seen any problems here since last week
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•