Closed Bug 816345 Opened 12 years ago Closed 12 years ago

solution to stop buildslaves on foopies for b2g pandas

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: mozilla)

Details

We stop buildslaves on talos hardware by rebooting and checking slavealloc before starting buildbot. We stop buildslaves on Android tegra foopies by a combination of clientproxy, flag files, and cron jobs, none of which have knowledge of whether we think the tegra/buildslave are actually in use at this point in time. We want a non-clientproxy solution for the b2g panda foopies, but we can't reboot due to running multiple buildslaves on a single host.
Solution 1: Set reboot_command to a buildslave_stop.sh script, e.g.: 'reboot_command': ['scripts/external_tools/buildslave_stop.sh'] In that example, we're creating a mozharness/external_tools/buildslave_stop.sh script. At its bare bones, it's #!/bin/sh cd .. /tools/buildbot-0.8.4-pre-moz2/bin/buildslave stop . sleep 600 However, if we just run it like that, it'll stop the buildslave after every test run, and we'll run out of buildslaves very very fast. Putting in a grep for something would allow us to make this conditional, e.g. #!/bin/sh egrep -q '^bad_slave:' properties/* if [ $? -ne 0 ] ; then cd .. /tools/buildbot-0.8.4-pre-moz2/bin/buildslave stop . sleep 600 fi That would require us to self.set_buildbot_property("bad_slave", "somevalue", write_to_file=True) to trigger. Ideally we also self.buildbot_status(TBPL_RETRY) so buildbot retries the job. We could also do a similar grep in logs/log_error.log for a special string. This makes the buildslave go away without any sort of weird error message, since the reboot step already expects the slave might go away.
Solution 2: We can call buildslave stop . directly from the script, e.g.: dirs = self.query_abs_dirs() self.warning("Looks like a bad slave; setting buildbot RETRY and stopping the buildslave.") self.buildbot_status(TBPL_RETRY) buildslave = self.query_exe('buildslave', return_type="list") self.run_command(buildslave + ["stop", "."], cwd="%s/.." % dirs['base_work_dir']) time.sleep(600) This is a bit messier log-wise, since you'll see the connection reset message in the log. We also won't process any properties (maybe we don't want to?) We don't have to rely on property files parsing or log parsing to get to the buildslave stop . In either approach, we have to somehow mark the device as bad, and make sure we know that the buildslave was stopped on purpose (so we don't log into a foopy, wonder why 5 buildslaves are stopped, and just start them up again).
I think we could output something into a file or property and then in the reboot function include a case for "propert/file based disconnect requested". What do you think? The disconnection would happen on the reboot step rather on the test step. BTW, let's remind ourselves that this is a nice to have since the tegras don't have this feature. From looking at code in here: http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l6273 hg.mozilla.org/build/buildbotcustom/file/default/steps/misc.py#l446 "If force_disconnect is True, then the slave will always be disconnected after the command completes". 6273 def reboot(self): 6274 def do_disconnect(cmd): 6275 try: 6276 if 'SCHEDULED REBOOT' in cmd.logs['stdio'].getText(): 6277 return True 6278 except: 6279 pass 6280 return False 6281 if self.reboot_command: 6282 self.addStep(DisconnectStep( 6283 name='reboot', 6284 flunkOnFailure=False, 6285 warnOnFailure=False, 6286 alwaysRun=True, 6287 workdir='.', 6288 description="reboot", 6289 command=self.reboot_command, 6290 force_disconnect=do_disconnect, 6291 env=self.env, 6292 ))
Sure. The Tegras solve this by creating an error.flg and clientproxy shuts down the processes. I think this is better.
I have tested that from buildbot I can call this: /builds/manage_buildslave.sh stop WithProperties('%(slavename)s')
No longer blocks: 802317
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.