Closed Bug 918932 Opened 12 years ago Closed 12 years ago

slaveapi doesn't escalate reboots properly when exceptions are thrown

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

2013-09-20 11:17:43,278 - INFO - talos-r4-lion-003 - Getting inventory info 2013-09-20 11:17:43,717 - INFO - talos-r4-lion-003 - Getting bug info 2013-09-20 11:17:43,717 - INFO - Sending request: GET https://bugzilla-dev.allizom.org/rest/bug/talos-r4-lion-003 2013-09-20 11:17:44,005 - INFO - Got response: 200 2013-09-20 11:17:44,006 - INFO - 10.12.51.165 - Attempting to reboot 2013-09-20 11:18:05,006 - ERROR - talos-r4-lion-003 - Caught exception. Traceback (most recent call last): File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 19, in reboot console.reboot() File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 137, in reboot rc, output = self.run_cmd(cmd) File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 81, in run_cmd self.connect() File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 50, in connect self.client.connect(hostname=self.fqdn, username=username, password=p, timeout=timeout, look_for_keys=False, allow_agent=False) File "/builds/slaveapi/dev/lib/python2.7/site-packages/paramiko/client.py", line 305, in connect retry_on_signal(lambda: sock.connect(addr)) File "/builds/slaveapi/dev/lib/python2.7/site-packages/paramiko/util.py", line 278, in retry_on_signal return function() File "/builds/slaveapi/dev/lib/python2.7/site-packages/paramiko/client.py", line 305, in <lambda> retry_on_signal(lambda: sock.connect(addr)) File "/builds/slaveapi/dev/lib/python2.7/site-packages/gevent/socket.py", line 384, in connect raise error(err, strerror(err)) error: [Errno 110] Connection timed out 2013-09-20 11:18:05,006 - INFO - pdu1.r102-1.build.scl1.mozilla.com - Powercycling via PDU. 2013-09-20 11:18:06,210 - INFO - 10.26.48.43 - - [2013-09-20 11:18:06] "GET /slave/talos-r4-lion-003/action/reboot HTTP/1.1" 200 190 0.000954 2013-09-20 11:18:11,024 - INFO - 10.26.48.43 - - [2013-09-20 11:18:11] "GET /slave/bld-centos6-hp-008/action/reboot HTTP/1.1" 200 829 0.001024 2013-09-20 11:18:11,043 - ERROR - Something went wrong while processing! Traceback (most recent call last): File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/processor.py", line 58, in _worker res, msg = action(slave, *args, **kwargs) File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 35, in reboot slave.pdu.powercycle() File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/pdu.py", line 34, in powercycle self.poweroff() File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/pdu.py", line 27, in poweroff self._run_cmd(self.off_cmd) File "/builds/slaveapi/dev/lib/python2.7/site-packages/slaveapi/clients/pdu.py", line 45, in _run_cmd return check_output(full_cmd, stderr=STDOUT) File "/builds/slaveapi/dev/lib/python2.7/site-packages/gevent_subprocess/gevent_subprocess.py", line 371, in check_output raise CalledProcessError(retcode, cmd, output=output) CalledProcessError: Command '['snmpset', '-v', '1', '-c', 'private', u'pdu1.r102-1.build.scl1.mozilla.com', u'1.3.6.1.4.1.1718.3.2.3.1.11.1.2.5', 'i', '2']' returned non-zero exit status 1 It should escalate to a bug modification in this case. Probably just need a try/except somewhere.
Looks like this is only true for PDU/IPMI reboots. SSH ones already try/except. This patch should fix PDU/IPMI ones and improve the logging a little bit.
Attachment #817823 - Flags: review?(jhopkins)
Attachment #817823 - Flags: review?(jhopkins) → review+
Comment on attachment 817823 [details] [diff] [review] reboot-escalation-slaveapi.diff Landed. Will need to roll a new version of the package to deploy...going to wait to see if there's more changes in the near future before doing that.
Attachment #817823 - Flags: checked-in+
The slaveapi hosts are getting upgraded for this along with bug 922858.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: