Closed Bug 964247 Opened 10 years ago Closed 10 years ago

slaveapi filed a bunch of "unreachable" slave bugs for slaves that aren't down now

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

Either these were falsely filed, or a bunch of machines came back up shortly after the bug was filed. Specifically:
talos-r3-fed64-069
talos-r3-fed64-070
bld-lion-r5-076
talos-mtnlion-r5-006
talos-mtnlion-r5-009
talos-mtnlion-r5-069
talos-mtnlion-r5-079
talos-mtnlion-r5-082
talos-mtnlion-r5-089

Need to dig up stuff from the logs still.
Found this:
 42600 2014-01-24 20:49:32,619 - ERROR - talos-r3-fed64-069 - Caught exception during SSH reboot.
 42601 Traceback (most recent call last):
 42602   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot
 42603     console.reboot()
 42604   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
 42605     rc, output = self.run_cmd(cmd)
 42606   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd
 42607     raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc)
 42608 RemoteCommandError: Caught exception while running command.

 69837 2014-01-25 17:00:54,300 - ERROR - talos-mtnlion-r5-079 - Caught exception during SSH reboot.
 69838 Traceback (most recent call last):
 69839   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot
 69840     console.reboot()
 69841   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
 69842     rc, output = self.run_cmd(cmd)
 69843   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd
 69844     raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc)
 69845 RemoteCommandError: Caught exception while running command.


So, something is happening while running "sudo reboot". I'm trying to get more information still...
It looks like the reboot was successfully initiated and happened on talos-r3-fed64-069. From the logs:
Jan 24 20:49:53 talos-r3-fed64-069 rpc.statd[1057]: Caught signal 15, un-registering and exiting.
Jan 24 20:49:59 talos-r3-fed64-069 rsyslogd: [origin software="rsyslogd" swVersion="4.4.1" x-pid="1025" x-info="http://www.rsyslog.com"] exiting on signal 15.
Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: Accepted password for root from 10.26.48.16 port 60310 ssh2
Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 24 20:49:34 talos-r3-fed64-069 sshd[3627]: Connection closed by 10.26.48.16

Followed by:
Jan 24 20:51:07 talos-r3-fed64-069 kernel: imklog 4.4.1, log source = /proc/kmsg started.
Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpuset
Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpu

...which indicates a fresh boot.

So, perhaps the way ssh is exiting is causing SlaveAPI to think the reboot attempt failed? If so, we could ignore such failures and watch for the slave to reboot anyways...
The better solution to this is to wait a few seconds after requesting the reboot, but on Mac and Linux it's only possible to wait in multiples of minutes - which is way too long. Given that, this seems like a good alternative.
Attachment #8366107 - Flags: review?(bugspam.Callek)
Attachment #8366107 - Flags: review?(bugspam.Callek) → review+
Attachment #8366107 - Flags: checked-in+
This in production and slaverebooter has been turned back on. I'm doing a manual run of it now, to get caught up, then it will continue to run every 4 hours as before.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: