Closed Bug 964247 Opened 11 years ago Closed 11 years ago

slaveapi filed a bunch of "unreachable" slave bugs for slaves that aren't down now

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

fix overly aggressive bug filing 11 years ago bhearsum@mozilla.com (:bhearsum) 1.79 KB, patch	Callek : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Assignee

Description

•

11 years ago

Either these were falsely filed, or a bunch of machines came back up shortly after the bug was filed. Specifically: talos-r3-fed64-069 talos-r3-fed64-070 bld-lion-r5-076 talos-mtnlion-r5-006 talos-mtnlion-r5-009 talos-mtnlion-r5-069 talos-mtnlion-r5-079 talos-mtnlion-r5-082 talos-mtnlion-r5-089 Need to dig up stuff from the logs still.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

11 years ago

Found this: 42600 2014-01-24 20:49:32,619 - ERROR - talos-r3-fed64-069 - Caught exception during SSH reboot. 42601 Traceback (most recent call last): 42602 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot 42603 console.reboot() 42604 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot 42605 rc, output = self.run_cmd(cmd) 42606 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd 42607 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc) 42608 RemoteCommandError: Caught exception while running command. 69837 2014-01-25 17:00:54,300 - ERROR - talos-mtnlion-r5-079 - Caught exception during SSH reboot. 69838 Traceback (most recent call last): 69839 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.py", line 40, in reboot 69840 console.reboot() 69841 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 148, in reboot 69842 rc, output = self.run_cmd(cmd) 69843 File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.py", line 139, in run_cmd 69844 raise RemoteCommandError("Caught exception while running command.", output=output, rc=rc) 69845 RemoteCommandError: Caught exception while running command. So, something is happening while running "sudo reboot". I'm trying to get more information still...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

11 years ago

It looks like the reboot was successfully initiated and happened on talos-r3-fed64-069. From the logs: Jan 24 20:49:53 talos-r3-fed64-069 rpc.statd[1057]: Caught signal 15, un-registering and exiting. Jan 24 20:49:59 talos-r3-fed64-069 rsyslogd: [origin software="rsyslogd" swVersion="4.4.1" x-pid="1025" x-info="http://www.rsyslog.com"] exiting on signal 15. Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: Accepted password for root from 10.26.48.16 port 60310 ssh2 Jan 24 20:49:27 talos-r3-fed64-069 sshd[3631]: pam_unix(sshd:session): session opened for user root by (uid=0) Jan 24 20:49:34 talos-r3-fed64-069 sshd[3627]: Connection closed by 10.26.48.16 Followed by: Jan 24 20:51:07 talos-r3-fed64-069 kernel: imklog 4.4.1, log source = /proc/kmsg started. Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpuset Jan 24 20:51:07 talos-r3-fed64-069 kernel: Initializing cgroup subsys cpu ...which indicates a fresh boot. So, perhaps the way ssh is exiting is causing SlaveAPI to think the reboot attempt failed? If so, we could ignore such failures and watch for the slave to reboot anyways...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 3

•

11 years ago

Attached patch fix overly aggressive bug filing — Details — Splinter Review

The better solution to this is to wait a few seconds after requesting the reboot, but on Mac and Linux it's only possible to wait in multiples of minutes - which is way too long. Given that, this seems like a good alternative.

Attachment #8366107 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Updated

•

11 years ago

Attachment #8366107 - Flags: review?(bugspam.Callek) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Attachment #8366107 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

11 years ago

This in production and slaverebooter has been turned back on. I'm doing a manual run of it now, to get caught up, then it will continue to run every 4 hours as before.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

slaveapi filed a bunch of "unreachable" slave bugs for slaves that aren't down now

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Updated

Attachment

General

Description

File Name

Content Type