Closed Bug 965877 Opened 11 years ago Closed 11 years ago

slaveapi still files IT bugs for some slaves that aren't actually down

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

Latest example is bug 965829. I think they all stem from a failed SSH reboot....
I found two unique tracebacks associated with these machines just before their bugs were filed. > 46137 2014-02-06 04:59:48,476 - ERROR - talos-linux64-ix-031 - Caught exception during SSH reboot. > 46138 Traceback (most recent call last): > 46139 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 47, in reboot > 46140 console.reboot() > 46141 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot > 46142 rc, output = self.run_cmd(cmd) > 46143 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 92, in run_cmd > 46144 self.connect() > 46145 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 58, in connect > 46146 self.client.connect(hostname=self.fqdn, username=username, password=p, timeout=timeout, look_for_keys=False, > allow_agent=False) > 46147 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 311, in connect > 46148 t.start_client() > 46149 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/transport.py", line 465, in start_client > 46150 raise e > 46151 SSHException: Error reading SSH protocol banner This one looks like an edge case that should've been handled by bug 964247 - but SSHException isn't caught by RemoteCommandError. We probably need to catch both of those types of errors... 78534 2014-02-04 01:16:16,282 - ERROR - w64-ix-slave164 - Caught exception during IPMI reboot. 78535 Traceback (most recent call last): 78536 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 60, in reboot 78537 slave.ipmi.powercycle() 78538 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 49, in powercycle 78539 raise Exception() 78540 Exception This is poor exception raising from the ipmi client code, but it gets raised when when a hard shutdown doesn't complete. This seems like more of a freak case where the IPMI interface is acting funny. I'm inclined not to do anything about this unless we see more issues.
We also got a new IPMI failure today (https://bugzilla.mozilla.org/show_bug.cgi?id=969056 and https://bugzilla.mozilla.org/show_bug.cgi?id=969058): 88476 2014-02-06 12:59:55,912 - INFO - w64-ix-slave139-mgmt.build.mozilla.org - Unable to establish IPMI session, retrying... 88477 2014-02-06 12:59:55,912 - DEBUG - w64-ix-slave139-mgmt.build.mozilla.org - Return code was 1, output was: 88478 2014-02-06 12:59:55,913 - DEBUG - Error: Unable to establish IPMI v2 / RMCP+ session^M 88479 Unable to set Chassis Power Control to Soft^M 88480 88481 2014-02-06 12:59:55,913 - ERROR - w64-ix-slave139 - Caught exception during IPMI reboot. 88482 Traceback (most recent call last): 88483 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 60, in reboot 88484 slave.ipmi.powercycle() 88485 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 40, in powercycle 88486 self.poweroff(hard=False) 88487 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 32, in poweroff 88488 self.run_cmd("power soft") 88489 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 74, in run_cmd 88490 return check_output(full_cmd, stderr=STDOUT) 88491 File "/builds/slaveapi/prod/lib/python2.7/site-packages/gevent_subprocess/gevent_subprocess.py", line 371, in check_output 88492 raise CalledProcessError(retcode, cmd, output=output) 88493 CalledProcessError: Command '['ipmitool', '-H', 'w64-ix-slave139-mgmt.build.mozilla.org', '-I', 'lanplus', '-U', 'releng', '-P', u'xxxxxxxxxxxx', 'power', 'soft']' re turned non-zero exit status 1 Which was preceded by a different type of SSH exception: 88190 2014-02-06 12:59:14,597 - ERROR - w64-ix-slave137 - Caught exception during SSH reboot. 88191 Traceback (most recent call last): 88192 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 47, in reboot 88193 console.reboot() 88194 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot 88195 rc, output = self.run_cmd(cmd) 88196 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 92, in run_cmd 88197 self.connect() 88198 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 58, in connect 88199 self.client.connect(hostname=self.fqdn, username=username, password=p, timeout=timeout, look_for_keys=False, allow_agent=False) 88200 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 305, in connect 88201 retry_on_signal(lambda: sock.connect(addr)) 88202 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/util.py", line 278, in retry_on_signal 88203 return function() 88204 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 305, in <lambda> 88205 retry_on_signal(lambda: sock.connect(addr)) 88206 File "/builds/slaveapi/prod/lib/python2.7/site-packages/gevent/socket.py", line 384, in connect 88207 raise error(err, strerror(err)) 88208 error: [Errno 110] Connection timed out It's becoming clear that we're never going to catch all of the possible things that could happen when trying to reboot a machine. In particular, it seems that the SSH reboot can throw a multitude of different errors. I think it's time to try the approach of eating any errors that occur during the _triggering_ of a reboot, and then continue to watch for the slave to go down and back up again afterwards.
Attachment #8372277 - Flags: review?(jhopkins)
Attachment #8372277 - Flags: review?(jhopkins) → review+
Comment on attachment 8372277 [details] [diff] [review] eat errors during reboot triggering Landed and version bumped slaveapi in Puppet. After it gets deployed, I'll restart the servers to pick it up.
Attachment #8372277 - Flags: checked-in+
This got landed today into production today, but SlaveAPI is shut off until bug 970590 is fixed - hopefully tomorrow.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Sadly, there's still been bugs filed for slaves that aren't actually down.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Haven't seen any more of these in the last week or so, for what it's worth.
I'm optimistically closing this because it hasn't been seen again in over two weeks. The instances mentioned in comment #7 may have been me misreading logs?
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: