Closed Bug 691244 Opened 14 years ago Closed 14 years ago

SeaMonkey Idle Slaves don't reboot properly...

Categories

(SeaMonkey :: Release Engineering, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

Details

Attachments

(1 file)

Ok, so I recently deployed Buildbot-0.8.4-pre-moz2 to linux slaves and enabled Idleizer on them Our master is 0.8.2 just like MoCo masters atm. I just noticed that all-but-one slave was disconnected for the moment. We should figure out what is wrong on this setup, and fix it. Even though SeaMonkey machines rarely go idle. Dustin do you have any ideas given the data I am about to provide below? twistd.log: 2011-10-01 12:19:20-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.322054 2011-10-01 12:19:20-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellComma nd instance at 0x8fe79cc> 2011-10-01 19:19:43-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmast er says it's OK 2011-10-01 19:19:43-0700 [-] Telling the master we want to shutdown after any running builds are fin ished 2011-10-01 19:19:43-0700 [Broker,client] Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature. 2011-10-01 19:19:43-0700 [Broker,client] rebooting NOW, since the master won't talk to us 2011-10-01 19:19:43-0700 [Broker,client] Invoking platform-specific reboot command 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] lost remote 2011-10-01 19:19:44-0700 [Broker,client] Lost connection to cb-seamonkey-linuxmaster-01.mozilla.org: 9010 2011-10-01 19:19:44-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x8 de00ec> 2011-10-01 19:19:44-0700 [-] Main loop terminated. 2011-10-01 19:19:44-0700 [-] Server Shut Down. [seabld@cb-sea-linux-tbox ~]$ ============== buildbot.tac (sanitized) ============== [seabld@cb-sea-linux-tbox ~]$ cat /builds/slave/buildbot.tac from twisted.application import service from buildslave.bot import BuildSlave maxdelay = 300 buildmaster_host = r<masterURL> passwd = 'somethingS3CRET' maxRotatedFiles = None basedir = r'/builds/slave' umask = 002 slavename = 'cb-sea-linux-tbox' usepty = False rotateLength = 1000000 port = 90210 # Yes not really keepalive = None application = service.Application('buildslave') try: from twisted.python.logfile import LogFile from twisted.python.log import ILogObserver, FileLogObserver logfile = LogFile.fromFullPath("twistd.log", rotateLength=rotateLength, maxRotatedFiles=maxRotatedFiles) application.setComponent(ILogObserver, FileLogObserver(logfile).emit) except ImportError: pass # old Twisted install - mostly on geriatric slaves s = BuildSlave(buildmaster_host, port, slavename, passwd, basedir, keepalive, usepty, umask=umask, maxdelay=maxdelay) s.setServiceParent(application) # enable idleizer from buildslave import idleizer idlz = idleizer.Idleizer(s, # 7 hours idle time before a reboot max_idle_time=3600*7, # 1 hour disconnect from a master before a reboot max_disconnected_time=3600*1) idlz.setServiceParent(application) ======================= Exceptions on master ======================= The following exceptions (total 3) were detected on cb-seamonkey-linuxmaster-01 master01: Exception in /builds/buildbot/master01/master/twistd.log.1: 2011-10-01 19:19:32-0700 [Broker,819,63.245.212.102] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 153, in gotItem self.callExpressionReceived(item) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 116, in callExpressionReceived self.expressionReceived(obj) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 514, in expressionReceived method(*sexp[1:]) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 826, in proto_message self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw) --- <exception caught here> --- File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage netResult = object.remoteMessageReceived(self, message, netArgs, netKw) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived method = getattr(self, "perspective_%s" % message) exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown' -------------------------------------------------------------------------------- Exception in /builds/buildbot/master01/master/twistd.log.1: 2011-10-01 19:19:48-0700 [Broker,571,63.245.210.16] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 153, in gotItem self.callExpressionReceived(item) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 116, in callExpressionReceived self.expressionReceived(obj) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 514, in expressionReceived method(*sexp[1:]) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 826, in proto_message self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw) --- <exception caught here> --- File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage netResult = object.remoteMessageReceived(self, message, netArgs, netKw) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived method = getattr(self, "perspective_%s" % message) exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown' -------------------------------------------------------------------------------- Exception in /builds/buildbot/master01/master/twistd.log.1: 2011-10-01 19:24:44-0700 [Broker,567,63.245.210.36] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 153, in gotItem self.callExpressionReceived(item) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/banana.py", line 116, in callExpressionReceived self.expressionReceived(obj) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 514, in expressionReceived method(*sexp[1:]) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 826, in proto_message self._recvMessage(self.localObjectForID, requestID, objectID, message, answerRequired, netArgs, netKw) --- <exception caught here> --- File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 840, in _recvMessage netResult = object.remoteMessageReceived(self, message, netArgs, netKw) File "/builds/buildbot/master01/lib/python2.6/site-packages/twisted/spread/pb.py", line 223, in perspectiveMessageReceived method = getattr(self, "perspective_%s" % message) exceptions.AttributeError: BuildSlave instance has no attribute 'perspective_shutdown'
I rebooted all the slaves except cn-sea-qm-centos5-01 which I'm leaving in its "twisted/buildbot is shutdown but machine still up" state for now, incase it helps isolate the issue.
So the exceptions are normal. The slave tries to gracefully shut itself down, but the old masters don't support it, so they log an exception and the slave falls back to just rebooting. My guess is that the code the slaves use to reboot isn't, for whatever reason. On linux, that's running 'sudo reboot'. Should that work?
yes it certainly should, and is what I used to reboot myself manually
So I would recommend shortening the time-scale (in buildbot.tac), and then watching a machine try to reboot. Does it go down for reboot and then abort?
Attached patch diff of sudoersSplinter Review
Ok, with dustins help I learned that the problem was buildbot failing to properly sudo reboot... It did work from command line, but buildbot was configured to not use a tty, and of course sudoers was configured to require a tty. This was fixed in Firefox's end in Bug 649683. When I updated these slaves, I didn't catch that sudoers was now hosted in puppet (http://mxr.mozilla.org/build/source/puppet-manifests/modules/sudoers/templates/sudoers.erb) and dustin grabbed the file from the production-puppet-files, which in this case was a bit older. Attached is the sudoers diff vs what was on the machines that I just deployed. Made sure it was root/root, chmod 0440 rebooted the last slave.
Assignee: nobody → bugspam.Callek
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: