Closed
Bug 964247
Opened 11 years ago
Closed 11 years ago
slaveapi filed a bunch of "unreachable" slave bugs for slaves that aren't down now
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
Details
Attachments
(1 file)
1.79 KB,
patch
|
Callek
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
Either these were falsely filed, or a bunch of machines came back up shortly after the bug was filed. Specifically:
talos-r3-fed64-069
talos-r3-fed64-070
bld-lion-r5-076
talos-mtnlion-r5-006
talos-mtnlion-r5-009
talos-mtnlion-r5-069
talos-mtnlion-r5-079
talos-mtnlion-r5-082
talos-mtnlion-r5-089
Need to dig up stuff from the logs still.
Assignee | ||
Comment 1•11 years ago
|
||
Found this:
42600 2014-01-24 20:49:32,619 - ERROR - talos-r3-fed64-069 - Caught exception during SSH reboot.
42601 Traceback (most recent call last):
42602 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot
42603 console.reboot()
42604 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
42605 rc, output = self.run_cmd(cmd)
42606 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd
42607 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc)
42608 RemoteCommandError: Caught exception while running command.
69837 2014-01-25 17:00:54,300 - ERROR - talos-mtnlion-r5-079 - Caught exception during SSH reboot.
69838 Traceback (most recent call last):
69839 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot
69840 console.reboot()
69841 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot
69842 rc, output = self.run_cmd(cmd)
69843 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd
69844 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc)
69845 RemoteCommandError: Caught exception while running command.
So, something is happening while running "sudo reboot". I'm trying to get more information still...
Assignee | ||
Comment 2•11 years ago
|
||
It looks like the reboot was successfully initiated and happened on talos-r3-fed64-069. From the logs:
Jan 24 20:49:53 talos-r3-fed64-069 rpc.statd[1057]: Caught signal 15, un-registering and exiting.
Jan 24 20:49:59 talos-r3-fed64-069 rsyslogd: [origin software="rsyslogd" swVersion="4.4.1" x-pid="1025" x-info="http://www.rsyslog.com"] exiting on signal 15.
Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: Accepted password for root from 10.26.48.16 port 60310 ssh2
Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jan 24 20:49:34 talos-r3-fed64-069 sshd[3627]: Connection closed by 10.26.48.16
Followed by:
Jan 24 20:51:07 talos-r3-fed64-069 kernel: imklog 4.4.1, log source = /proc/kmsg started.
Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpuset
Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpu
...which indicates a fresh boot.
So, perhaps the way ssh is exiting is causing SlaveAPI to think the reboot attempt failed? If so, we could ignore such failures and watch for the slave to reboot anyways...
Assignee | ||
Comment 3•11 years ago
|
||
The better solution to this is to wait a few seconds after requesting the reboot, but on Mac and Linux it's only possible to wait in multiples of minutes - which is way too long. Given that, this seems like a good alternative.
Attachment #8366107 -
Flags: review?(bugspam.Callek)
Updated•11 years ago
|
Attachment #8366107 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #8366107 -
Flags: checked-in+
Assignee | ||
Comment 4•11 years ago
|
||
This in production and slaverebooter has been turned back on. I'm doing a manual run of it now, to get caught up, then it will continue to run every 4 hours as before.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•