Closed Bug 964247 Opened 11 years ago Closed 11 years ago

slaveapi filed a bunch of "unreachable" slave bugs for slaves that aren't down now

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

Either these were falsely filed, or a bunch of machines came back up shortly after the bug was filed. Specifically: talos-r3-fed64-069 talos-r3-fed64-070 bld-lion-r5-076 talos-mtnlion-r5-006 talos-mtnlion-r5-009 talos-mtnlion-r5-069 talos-mtnlion-r5-079 talos-mtnlion-r5-082 talos-mtnlion-r5-089 Need to dig up stuff from the logs still.
Found this: 42600 2014-01-24 20:49:32,619 - ERROR - talos-r3-fed64-069 - Caught exception during SSH reboot. 42601 Traceback (most recent call last): 42602 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot 42603 console.reboot() 42604 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot 42605 rc, output = self.run_cmd(cmd) 42606 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd 42607 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc) 42608 RemoteCommandError: Caught exception while running command. 69837 2014-01-25 17:00:54,300 - ERROR - talos-mtnlion-r5-079 - Caught exception during SSH reboot. 69838 Traceback (most recent call last): 69839 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot 69840 console.reboot() 69841 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot 69842 rc, output = self.run_cmd(cmd) 69843 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd 69844 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc) 69845 RemoteCommandError: Caught exception while running command. So, something is happening while running "sudo reboot". I'm trying to get more information still...
It looks like the reboot was successfully initiated and happened on talos-r3-fed64-069. From the logs: Jan 24 20:49:53 talos-r3-fed64-069 rpc.statd[1057]: Caught signal 15, un-registering and exiting. Jan 24 20:49:59 talos-r3-fed64-069 rsyslogd: [origin software="rsyslogd" swVersion="4.4.1" x-pid="1025" x-info="http://www.rsyslog.com"] exiting on signal 15. Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: Accepted password for root from 10.26.48.16 port 60310 ssh2 Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: pam_unix(sshd:session): session opened for user root by (uid=0) Jan 24 20:49:34 talos-r3-fed64-069 sshd[3627]: Connection closed by 10.26.48.16 Followed by: Jan 24 20:51:07 talos-r3-fed64-069 kernel: imklog 4.4.1, log source = /proc/kmsg started. Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpuset Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpu ...which indicates a fresh boot. So, perhaps the way ssh is exiting is causing SlaveAPI to think the reboot attempt failed? If so, we could ignore such failures and watch for the slave to reboot anyways...
The better solution to this is to wait a few seconds after requesting the reboot, but on Mac and Linux it's only possible to wait in multiples of minutes - which is way too long. Given that, this seems like a good alternative.
Attachment #8366107 - Flags: review?(bugspam.Callek)
Attachment #8366107 - Flags: review?(bugspam.Callek) → review+
Attachment #8366107 - Flags: checked-in+
This in production and slaverebooter has been turned back on. I'm doing a manual run of it now, to get caught up, then it will continue to run every 4 hours as before.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: