Closed
Bug 965877
Opened 11 years ago
Closed 11 years ago
slaveapi still files IT bugs for some slaves that aren't actually down
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
Details
Attachments
(1 file)
2.72 KB,
patch
|
jhopkins
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
Latest example is bug 965829. I think they all stem from a failed SSH reboot....
Assignee | ||
Comment 1•11 years ago
|
||
Assignee | ||
Comment 2•11 years ago
|
||
I found two unique tracebacks associated with these machines just before their bugs were filed.
> 46137 2014-02-06 04:59:48,476 - ERROR - talos-linux64-ix-031 - Caught exception during SSH reboot.
> 46138 Traceback (most recent call last):
> 46139 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 47, in reboot
> 46140 console.reboot()
> 46141 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
> 46142 rc, output = self.run_cmd(cmd)
> 46143 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 92, in run_cmd
> 46144 self.connect()
> 46145 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 58, in connect
> 46146 self.client.connect(hostname=self.fqdn, username=username, password=p, timeout=timeout, look_for_keys=False, > allow_agent=False)
> 46147 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 311, in connect
> 46148 t.start_client()
> 46149 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/transport.py", line 465, in start_client
> 46150 raise e
> 46151 SSHException: Error reading SSH protocol banner
This one looks like an edge case that should've been handled by bug 964247 - but SSHException isn't caught by RemoteCommandError. We probably need to catch both of those types of errors...
78534 2014-02-04 01:16:16,282 - ERROR - w64-ix-slave164 - Caught exception during IPMI reboot.
78535 Traceback (most recent call last):
78536 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 60, in reboot
78537 slave.ipmi.powercycle()
78538 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 49, in powercycle
78539 raise Exception()
78540 Exception
This is poor exception raising from the ipmi client code, but it gets raised when when a hard shutdown doesn't complete. This seems like more of a freak case where the IPMI interface is acting funny. I'm inclined not to do anything about this unless we see more issues.
Assignee | ||
Comment 3•11 years ago
|
||
We also got a new IPMI failure today (https://bugzilla.mozilla.org/show_bug.cgi?id=969056 and https://bugzilla.mozilla.org/show_bug.cgi?id=969058):
88476 2014-02-06 12:59:55,912 - INFO - w64-ix-slave139-mgmt.build.mozilla.org - Unable to establish IPMI session, retrying...
88477 2014-02-06 12:59:55,912 - DEBUG - w64-ix-slave139-mgmt.build.mozilla.org - Return code was 1, output was:
88478 2014-02-06 12:59:55,913 - DEBUG - Error: Unable to establish IPMI v2 / RMCP+ session^M
88479 Unable to set Chassis Power Control to Soft^M
88480
88481 2014-02-06 12:59:55,913 - ERROR - w64-ix-slave139 - Caught exception during IPMI reboot.
88482 Traceback (most recent call last):
88483 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 60, in reboot
88484 slave.ipmi.powercycle()
88485 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 40, in powercycle
88486 self.poweroff(hard=False)
88487 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 32, in poweroff
88488 self.run_cmd("power soft")
88489 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ipmi.py", line 74, in run_cmd
88490 return check_output(full_cmd, stderr=STDOUT)
88491 File "/builds/slaveapi/prod/lib/python2.7/site-packages/gevent_subprocess/gevent_subprocess.py", line 371, in check_output
88492 raise CalledProcessError(retcode, cmd, output=output)
88493 CalledProcessError: Command '['ipmitool', '-H', 'w64-ix-slave139-mgmt.build.mozilla.org', '-I', 'lanplus', '-U', 'releng', '-P', u'xxxxxxxxxxxx', 'power', 'soft']' re turned non-zero exit status 1
Which was preceded by a different type of SSH exception:
88190 2014-02-06 12:59:14,597 - ERROR - w64-ix-slave137 - Caught exception during SSH reboot.
88191 Traceback (most recent call last):
88192 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 47, in reboot
88193 console.reboot()
88194 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
88195 rc, output = self.run_cmd(cmd)
88196 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 92, in run_cmd
88197 self.connect()
88198 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 58, in connect
88199 self.client.connect(hostname=self.fqdn, username=username, password=p, timeout=timeout, look_for_keys=False, allow_agent=False)
88200 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 305, in connect
88201 retry_on_signal(lambda: sock.connect(addr))
88202 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/util.py", line 278, in retry_on_signal
88203 return function()
88204 File "/builds/slaveapi/prod/lib/python2.7/site-packages/paramiko/client.py", line 305, in <lambda>
88205 retry_on_signal(lambda: sock.connect(addr))
88206 File "/builds/slaveapi/prod/lib/python2.7/site-packages/gevent/socket.py", line 384, in connect
88207 raise error(err, strerror(err))
88208 error: [Errno 110] Connection timed out
It's becoming clear that we're never going to catch all of the possible things that could happen when trying to reboot a machine. In particular, it seems that the SSH reboot can throw a multitude of different errors. I think it's time to try the approach of eating any errors that occur during the _triggering_ of a reboot, and then continue to watch for the slave to go down and back up again afterwards.
Assignee | ||
Comment 4•11 years ago
|
||
Attachment #8372277 -
Flags: review?(jhopkins)
Updated•11 years ago
|
Attachment #8372277 -
Flags: review?(jhopkins) → review+
Assignee | ||
Comment 5•11 years ago
|
||
Comment on attachment 8372277 [details] [diff] [review]
eat errors during reboot triggering
Landed and version bumped slaveapi in Puppet. After it gets deployed, I'll restart the servers to pick it up.
Attachment #8372277 -
Flags: checked-in+
Assignee | ||
Comment 6•11 years ago
|
||
This got landed today into production today, but SlaveAPI is shut off until bug 970590 is fixed - hopefully tomorrow.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 7•11 years ago
|
||
Sadly, there's still been bugs filed for slaves that aren't actually down.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 8•11 years ago
|
||
Haven't seen any more of these in the last week or so, for what it's worth.
Assignee | ||
Comment 9•11 years ago
|
||
I'm optimistically closing this because it hasn't been seen again in over two weeks. The instances mentioned in comment #7 may have been me misreading logs?
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•