Closed Bug 429427 Opened 16 years ago Closed 15 years ago

Redesign linux refimage VM so no additional manual setup needed

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Unassigned)

References

Details

Attachments

(3 files, 16 obsolete files)

4.60 KB, text/plain
catlee
: review+
bhearsum
: checked-in+
Details
7.97 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
2.49 KB, text/plain
catlee
: review+
bhearsum
: checked-in+
Details
To setup a new linux VM, we start by cloning an existing refimage VM, which
gives us a basic set of OS+toolchain installs. However, we then need to follow
these manual instructions:

http://wiki.mozilla.org/ReferencePlatforms
or
http://wiki.mozilla.org/BuildbotTestfarm

to manually install the rest of the toolchain softwarem not included in the
refimage VM. All this takes a long time, and can be tricky. As we setup more
and more machines, this overhead is becoming a problem.

Lets eliminate these manual post-install steps

1) For software that needs configuring, write a script, run as root, which can
be run on first boot, and which fills in details like machine/host name, smtp
configuration, etc.

2) For software that changes rapidly, whatever version is on the refimage is
likely to be out of date anyway by the time we get to clone new VMs. Instead
lets design the refimage to contain enough to get started, and include in
refimage a script that checks for updates on first launch, and refresh forward
to the latest available at that time. One way of doing this would be to pull
specific tag versions of buildbot from CVS, for example, but there are probably
other ways to do this. Or pull a tagged version of a text file, which contains
URLs for downloading using WGET.

3) After VM is created, with bang-up-to-date versions of software, it will have
to keep checking and refreshing forward, or else this new VM will drift out of
date. We need a periodic check to verify VM is still running the right versions
of toolchain, and refresh forward if not. This would allow us to know that all
slaves are always in sync with each other. Open question: how frequently should
we be rechecking & refreshing? Once a day seems reasonable start, but thats
just a swag.
I'll take this while doing some buildbot setup on the new moz2 unittest master. I'm writing some scripts to pull and install the files automagically.
Assignee: nobody → rcampbell
Priority: -- → P2
Attached file centos5 python buildbot setup scripts (obsolete) —
wip. set of scripts running through installation of latest versions python, zope-interface, twisted and buildbot. Scripts download, install then cleanup after themselves. Additional script to configure user's profile upon successful completion forth-coming.
tweaked and hopefully improved version of setup scripts.
Attachment #316279 - Attachment is obsolete: true
TBD: parametrization of installables.
Attached file centos5 buildbot scripts (obsolete) —
This set of scripts works for the downloadable centos-5-ref-tools vm, and it adds:

1. making directories for logging in grab_files.sh
2. exporting paths to cltbld .bash_profile in post_install.sh
3. using /tools/dist for downloading of packages

Now to look at parametrization and incorporating a cvs file or someway to auto-update.
Attachment #321286 - Attachment is obsolete: true
Assignee: rcampbell → lukasblakk
the reason your image is much larger is because there are a whole bunch of 50k files with "._" prefixes on them, probably mac resource forks.
I ran each of the steps listed in install.sh individually and each seemed to run through without a hitch. Only issues were:

1) cleanup not removing previous versions of python, twisted(-core) and zope-interface.
2) .bash_profile options not picked up on login. Moved contents to .bashrc

Otherwise, this seems good to go.
also, should add mercurial to these.
whups, ignore that last comment. Mercurial's already been included in the refplatform image.
Did you run them as root and then su - cltbld?  cause that's how i was running them (which means maybe I should have a readme.txt in there) and when I did the user switch, cltbld had the correct paths
Ignore that last comment, just did some testing and saw the error of my ways.  A readme.txt is probably still a good idea though...

So now there's a little ReadMe.txt about running as root, the PATH is exported to .bashrc instead of .bash_profile and the cleanup script should delete the old versions of twisted/zope/python in /tools
Attachment #316454 - Attachment is obsolete: true
Attachment #321295 - Attachment is obsolete: true
line 10 in install_buildboot.sh should create another level of folder

  mkdir buildbotcustom
+ cd buildbotcustom
  cvs -d:pserver:anonymous@cvs-mirror.mozilla.org:/cvsroot co -d buildbotcustom mozilla/tools/buildbotcustom

This fixes the situation that $PYTHONPATH=/tools/buildbotcustom which did not allow to do "import buildbotcustom"
To fix I had to "export PYTHONPATH=/tools:$PYTHONPATH"

NOTE = I believe this script will actually better as one file which allows you to actually comment it
I used it for my local purposes, it might be good to put a lot of things into variables as the PROFILE variable I use

It might be interesting to have another one to "update" since this one is for a "first-time-run"
Attachment #323439 - Attachment mime type: application/octet-stream → text/plain
For some reason I was removing twisted-core-2.4.0 from the /tools dir and that broke the symlink and also removed twisted-core which we kind of need.

Also, I made some changes to the ReadMe regarding cvs key issues that arose when trying to use these scripts on the build machines.
Attachment #323176 - Attachment is obsolete: true
Putting back in the pool after triage.
Assignee: lukasblakk → nobody
Component: Release Engineering → Release Engineering: Future
Priority: P2 → P3
Still tweaking on this - added more scripts for mercurial, nagios and autoconf so this can be used for moz2.  Also cleaned up some of cleanup.sh - fixed path settings, improved buildbot installation.

At this point - run it as root - be sure to have copied ssh keys into ~/.ssh for root and cltbld
Attachment #323439 - Attachment is obsolete: true
Attachment #326069 - Attachment is obsolete: true
Er. I'm a little confused. This bug is about automating the post-install setup of the ref platform, right? Nagios, Mercurial, and Autoconf come standard with the ref platform...
This time the PYTHONHOME and PATH are exported in the zope and twisted install scripts because they are not set in the root user's profile.

The most important thing to know about running this automated script is that you must have cvs keys in your root .ssh dir.  Otherwise buildbot will not check out and thus, not install.

This information is in the README as well.
Attachment #329016 - Attachment is obsolete: true
So, if we were to put the stgbld keys on our ref image (both root and cltbld) - then these scripts would remove the manual steps.
Comment on attachment 332960 [details]
Automated install script for update/new cltbld setup

this is definitely obsolete now
Attachment #332960 - Attachment is obsolete: true
Comment on attachment 341934 [details]
Automated Install Scripts with stgbld keys

this is definitely obsolete now
Attachment #341934 - Attachment is obsolete: true
Assignee: nobody → bhearsum
Attached file buildbot tac generator
Catlee, this is very similar to what I showed you last week, with the fixes you suggested and some others. I think this will be a good starting point for when we want to extend it to talk to a web service or some other way of autobalancing.

get_default_options is kindof messy, but I'm not sure how to improve it.

I diff'ed the generated buildbot.tac against the existing one on linux-slave03. The only differences were related to the fact that the existing .tac on linux-slave03 was generated a long time ago, before a lot of log handling code was added. When diff'ed against a newer, production slave the only differences aside from ordering of the options and whitespace were the buildmaster_host and slavename values - for obvious reasons.
Attachment #410018 - Flags: review?(catlee)
Attached file buildbot-tac init service (obsolete) —
This script should be able to handle both CentOS style init (chkconfig) as well as being launched from launchd. The start and stop is a little strange here, but it seems to work (I copied it from the 'firstboot' service on the ref platform).

I tested this a couple of times on moz2-linux-slave03 and it works as intended:
if /builds/slave/buildbot.tac doesn't exist and /etc/sysconfig/buildbot-tac doesn't have 'RUN=YES' in it, it does nothing. It also bails early if the current hostname is in IGNORE_HOSTS.

Otherwise, it calls out to buildbot-tac.py, which intelligently generates the tac file based on the hostname. Note that the password is in this script, as I'm intending it to live in the private puppet-files repository.

Still to do, Puppet manifest updates to maintain a /tools checkout and deploy this script.
Attachment #410023 - Flags: review?(catlee)
Attached file buildbot-tac service with chown (obsolete) —
Sorry, forgot to :w before submitting
Attachment #410023 - Attachment is obsolete: true
Attachment #410025 - Flags: review?(catlee)
Attachment #410023 - Flags: review?(catlee)
Attachment #410025 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 410018 [details]
buildbot tac generator

A teeny tiny nit:
add a \ after the initial triple quote of your header and footer to prevent extra newlines in the buildbot.tac file.

Looks good otherwise.
Attachment #410018 - Flags: review?(catlee) → review+
Comment on attachment 410025 [details]
buildbot-tac service with chown

How hard would it be to add some detection and handling for stale lock files?  Or does the 'stop' action handle this case when the machine is shutdown/rebooted?
(In reply to comment #27)
> (From update of attachment 410018 [details])
> A teeny tiny nit:
> add a \ after the initial triple quote of your header and footer to prevent
> extra newlines in the buildbot.tac file.

The one on the footer is intentional - it separates the footer from the options above it. I'll fix the header one, though.
Comment on attachment 410018 [details]
buildbot tac generator

This worked great in my testing, too, so in it goes:
changeset:   407:25611d4e0cf9
Attachment #410018 - Flags: checked-in+
Attached patch buildbot-tac, one more time (obsolete) — Splinter Review
Ok, this version has lockfile aging as you requested. I've also made the action=skip messages better, and made sure we don't override existing tac files (this could've happened when we roll this out). I ended up putting buildbot-tac.py somewhere else, so I've also updated this script to reflect that. Should be pretty straightforward.
Attachment #410025 - Attachment is obsolete: true
Attachment #410588 - Flags: review?(catlee)
Attachment #410025 - Flags: review?(catlee)
These manifests can be a little obtuse, so here's the order of operations:
* Untar build-tools checkout into /tools, set-up symlink
* Copy buildbot-tac to /etc/init.d/buildbot-tac and install the service
* Start the buildbot-tac service

On existing machines this should amount to no change, as the buildbot-tac already exists. For those, it will also fill in the control file so that even if we have buildbot.tac out of the way for awhile it will never be regenerated.

The buildbot service has also been updated to ensure that the buildbot-tac service starts first, thus ensuring the new slaves will come up properly.
Attachment #410592 - Flags: review?(catlee)
Attached patch one more time, again (obsolete) — Splinter Review
Sorry about all the churn; my last version forgot to update ${CONTROL_FILE} if buildbot.tac already exists.
Attachment #410588 - Attachment is obsolete: true
Attachment #410593 - Flags: review?(catlee)
Attachment #410588 - Flags: review?(catlee)
Comment on attachment 410592 [details] [diff] [review]
puppet manifests to roll out buildbot-tac.py and the init script

Going to be modifying this slightly to make it work better for Mac.
Attachment #410592 - Attachment is obsolete: true
Attachment #410592 - Flags: review?(catlee)
Comment on attachment 410593 [details] [diff] [review]
one more time, again

Going to be modifying this slightly to make it work better for Mac.
Attachment #410593 - Attachment is obsolete: true
Attachment #410593 - Flags: review?(catlee)
The mac part of this should probably go in bug 429430, but I thought keeping it together would be easier for review purposes.
Attachment #410822 - Flags: review?(catlee)
Attached file buildbot-tac, again
Pretty much the same as the last version, just changed the location of the control file on Mac.

This script will end up in the puppet-files CVS repo.
Attachment #410824 - Flags: review?(catlee)
Attached patch buildbot-tac plist launcher (obsolete) — Splinter Review
Attachment #410825 - Flags: review?(catlee)
Comment on attachment 410825 [details] [diff] [review]
buildbot-tac plist launcher

Sorry, wrong bug for this one.
Attachment #410825 - Attachment is obsolete: true
Attachment #410825 - Flags: review?(catlee)
Attachment #410822 - Flags: review?(catlee) → review+
Attachment #410824 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 410824 [details]
buildbot-tac, again

>        if ! `ps ax | awk '{print $1}' | grep -q \`cat ${LOCKFILE}\`` ||
>          `find ${LOCKFILE} -cmin ${MAX_LOCKFILE_AGE} | grep -q ${LOCKFILE}`;

I think you need -cmin +${MAX_LOCKFILE_AGE} here. r=me with that change.
Attachment #410824 - Flags: review?(catlee) → review+
Comment on attachment 410824 [details]
buildbot-tac, again

checked-in with the fix and the correct password.

Checking in buildbot-tac;
/mofo/puppet-files/shared/buildbot-tac,v  <--  buildbot-tac
initial revision: 1.1
done
Attachment #410824 - Flags: checked-in+
Comment on attachment 410822 [details] [diff] [review]
deploy zero to staging scripts on linux and mac

changeset:   74:2787b357f0c7
Attachment #410822 - Flags: checked-in+
Assignee: bhearsum → nobody
Component: Release Engineering: Future → Release Engineering
After getting everything landed and updating the ref platform I had Phong clone a new machine for me, moz2-linux-test01. After he cloned it and turned it on it appeared on staging-master.b.m.o:9010.

This should be the case for all new slaves, provided the buildbot-configs are updated to know about them in advance.

I've also updated the reference platform doc and removed the now-unnecessary manual steps, https://wiki.mozilla.org/ReferencePlatforms/Linux-CentOS-5.0#Post-Install_Setup.

Victory!
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
(In reply to comment #43)
> After getting everything landed and updating the ref platform I had Phong clone
> a new machine for me, moz2-linux-test01. After he cloned it and turned it on it
> appeared on staging-master.b.m.o:9010.
> 
> This should be the case for all new slaves, provided the buildbot-configs are
> updated to know about them in advance.
> 
> I've also updated the reference platform doc and removed the now-unnecessary
> manual steps,
> https://wiki.mozilla.org/ReferencePlatforms/Linux-CentOS-5.0#Post-Install_Setup.
> 
> Victory!

Very very sweet! :-)
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: