Closed Bug 428124 Opened 12 years ago Closed 11 years ago

mac buildbot slaves should reboot ready for use


(Release Engineering :: General, defect, P2)



(Not tracked)



(Reporter: joduinn, Assigned: bhearsum)




(2 files, 3 obsolete files)

Splitting out from bug#417887, as each o.s. will have different gotchas.

Basically, how to make each buildbot master/slaves reboot cleanly, reconnect and handle new jobs?
IIRC we don't have a way to launch Buildbot properly on boot in OS X. I believe there is hdiutil security context problems when it is launched from a startup script. This will require further investigation.
yeah, the problem on mac is that you need to be on the console when you launch the process so you inherit security settings from the login window.

We might be able to fake that by invoking an AppleScript from the user's StartupItems that does something like the following:

tell application "Terminal"
	do script "buildbot start /builds/slave"
end tell
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
taking this to test out on qm-moz2mini01.
Assignee: nobody → rcampbell
Priority: P3 → P2
Assignee: rcampbell → nobody
Priority: P2 → P3
Blocks: 472517
We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains.
Component: Release Engineering: Future → Release Engineering
Assignee: nobody → bhearsum
Priority: P3 → P2
Thanks to the work in bug 430833 this was easy-peasy to do on moz2-darwin9-slave08. I ran it overnight and all builds went green, except for mozilla-central leak tests, which are legitimately busted right now. I'm going to roll this out on the rest of the staging Macs later today. Once that's run for a bit we can roll it out in production.
Attached file buildbot launch agent file (obsolete) —
I should probably document the details here:
* Ensure /builds/slave is the slavedir or a symlink to it
* Download buildbot.start.slave.plist and put it in /Library/LaunchAgents
* Make sure it is owned by root:wheel

Copy and paste:
sudo wget --no-check-certificate -Obuildbot.start.slave.plist
sudo chown root:wheel buildbot.start.slave.plist 

From VNC:
* Make sure the resolution is set to 1280x1024.
* System Prefs -> Accounts -> Login Options
** Set 'Automatic Login' to 'cltbld', enter the password when prompted.

Attachment #361784 - Flags: review?(catlee) → review+
Comment on attachment 361784 [details] [diff] [review]
enable periodic reboots on staging for macs

changeset:   938:9710768473dd
Attachment #361784 - Flags: checked‑in+
Alright, these changes have been deployed on moz2-darwin9-slave03, 04, and 08 and periodic reboots of them have been enabled in staging. Once we're confident they're stable this can be rolled out to production.
I found out this morning that Macs need an /etc/sudoers update just like Linux. I added this to the staging Macs, hopefully they'll reboot this time.
Just had the first Mac reboot and come back up successfully. Let's keep running this for awhile to make sure it's stable.
Some of the Macs haven't been coming back up properly. I've added the following to the plist file to try and help with that:

This will cause OS X to try and start it every 600 seconds, which will fail gracefully if it's already started but _should_ start it up if it happens not to upon boot.
I haven't seen a Mac fail to restart its slave since I added the <key>StartInterval</key> parameter. This is ready to deploy.
Attached file updated plist file (obsolete) —
Here's an updated plist file with the StartInterval in it.
Attachment #361795 - Attachment is obsolete: true
This is getting deployed tomorrow. Here's copy/paste instructions (assumes /builds/slave is the slavedir or a symlink to it):

cd /Library/LaunchAgents
sudo wget --no-check-certificate -Obuildbot.start.slave.plist
sudo chown root:wheel buildbot.start.slave.plist 

And once the slave becomes idle it should be rebooted to ensure everything got installed okay.
Attached file plist, correct syntax (obsolete) —
Here's the plist file with the correct syntax. I didn't have time to deploy it on everything today, but the following slaves are setup for clean boots:

This only slaves the try slaves (try-mac-slave01 -> try-mac-slave05), which I will finish up with on Monday.
Attachment #364362 - Attachment is obsolete: true
Adjusting summary to remove masters, since they all run on linux.
Summary: mac buildbot masters/slaves should reboot ready for use → mac buildbot slaves should reboot ready for use
Depends on: 480753
Depends on: 480692
Alright, I've deployed this changes (with the CVS_RSH key) to the try server pool: try-mac-slave01 -> try-mac-slave05. I'm still waiting for slave04 to become idle to make sure it comes back up okay. Things are looking good other than that.

Still need to update support docs and inventory, almost done here though.
No longer depends on: 480692, 480753
Summary: mac buildbot slaves should reboot ready for use → mac buildbot masters/slaves should reboot ready for use
Whoops, messed up the dependencies.
Depends on: 480692, 480753
Okay, this has been deployed on all of our Mac buildbot slaves now, and the support docs have been updated. We're all done here.
Closed: 11 years ago
Resolution: --- → FIXED
Summary: mac buildbot masters/slaves should reboot ready for use → mac buildbot slaves should reboot ready for use
Also fixed up the CVS_RSH=ssh on fx--mac-1.9-slave1 & 2.
And on bm-xserve16 thru 19 & 22, moz2-darwin9-slave01 thru 08 - need CVS for release update verify.
Attachment #364605 - Attachment is obsolete: true
Product: → Release Engineering
You need to log in before you can comment on or make changes to this bug.