Closed
Bug 429418
Opened 17 years ago
Closed 15 years ago
Redesign win32 refimage VM so no additional manual setup needed to get into staging
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: bhearsum)
References
Details
Attachments
(8 files, 8 obsolete files)
7.56 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
2.95 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
1.01 KB,
patch
|
coop
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
3.08 KB,
patch
|
nthomas
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
3.58 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
950 bytes,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
1.52 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
639 bytes,
patch
|
nthomas
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
To setup a new win32 VM, we start by cloning an existing refimage VM, which gives us a basic set of OS+toolchain installs. However, we then need to follow these manual instructions:
http://wiki.mozilla.org/ReferencePlatforms
or
http://wiki.mozilla.org/BuildbotTestfarm
to manually install the rest of the toolchain softwarem not included in the refimage VM. All this takes a long time, and can be tricky. As we setup more and more machines, this overhead is becoming a problem.
Lets eliminate these manual post-install steps
1) For software that needs configuring, write a script, run as root, which can be run on first boot, and which fills in details like machine/host name, smtp configuration, etc.
2) For software that changes rapidly, whatever version is on the refimage is likely to be out of date anyway by the time we get to clone new VMs. Instead lets design the refimage to contain enough to get started, and include in refimage a script that checks for updates on first launch, and refresh forward to the latest available at that time. One way of doing this would be to pull specific tag versions of buildbot from CVS, for example, but there are probably other ways to do this. Or pull a tagged version of a text file, which contains URLs for downloading using WGET.
3) After VM is created, with bang-up-to-date versions of software, it will have to keep checking and refreshing forward, or else this new VM will drift out of date. We need a periodic check to verify VM is still running the right versions of toolchain, and refresh forward if not. This would allow us to know that all slaves are always in sync with each other. Open question: how frequently should we be rechecking & refreshing? Once a day seems reasonable start, but thats just a swag.
Assignee | ||
Updated•17 years ago
|
Priority: -- → P3
Updated•17 years ago
|
Assignee: nobody → rcampbell
Updated•17 years ago
|
Assignee: rcampbell → nobody
Component: Release Engineering → Release Engineering: Future
Assignee | ||
Comment 1•16 years ago
|
||
Things have changed a lot since comment #0, here's what needs to be done, as I see it:
* Change the hostname
* Reinstallation of OPSI
* Create the buildbot.tac file
* Install Buildbot
Reporter | ||
Comment 2•16 years ago
|
||
(In reply to comment #1)
> Things have changed a lot since comment #0, here's what needs to be done, as I
> see it:
> * Change the hostname
> * Reinstallation of OPSI
ok.
> * Create the buildbot.tac file
Tracked in bug#520727.
> * Install Buildbot
Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already done?
Note: when I filed this originally, I was thinking of Build/unittest VMs. However, we're now running config mgmt on talos also, so want to make sure we include Talos machines in whatever we do here too.
Depends on: 520727
Summary: Redesign win32 refimage VM so no additional manual setup needed. → Redesign win32 refimage VM so no additional manual setup needed to get into staging
Assignee | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> (In reply to comment #1)
> > Things have changed a lot since comment #0, here's what needs to be done, as I
> > see it:
> > * Change the hostname
> > * Reinstallation of OPSI
> ok.
>
> > * Install Buildbot
> Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already
> done?
We have an OPSI package that knows how to install Buildbot. It's not automatically set to install for new VMs, though.
> > * Create the buildbot.tac file
> Tracked in bug#520727.
Yup.
> Note: when I filed this originally, I was thinking of Build/unittest VMs.
> However, we're now running config mgmt on talos also, so want to make sure we
> include Talos machines in whatever we do here too.
Yup. I'm focusing on one thing at a time here, though.
Reporter | ||
Comment 4•16 years ago
|
||
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Things have changed a lot since comment #0, here's what needs to be done, as I
> > > see it:
> > > * Change the hostname
> > > * Reinstallation of OPSI
> > ok.
> >
>
> > > * Install Buildbot
> > Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already
> > done?
>
> We have an OPSI package that knows how to install Buildbot. It's not
> automatically set to install for new VMs, though.
Ah, that explains the confusion. I thought that we'd enable packages as we got them working... incremental improvements, and all that. I know you've been debugging other opsi fun recently, but is there any reason to not enable this?
> > > * Create the buildbot.tac file
> > Tracked in bug#520727.
> Yup.
> > Note: when I filed this originally, I was thinking of Build/unittest VMs.
> > However, we're now running config mgmt on talos also, so want to make sure we
> > include Talos machines in whatever we do here too.
> Yup. I'm focusing on one thing at a time here, though.
cool, just wanted to confirm the scope.
Assignee | ||
Comment 5•16 years ago
|
||
(In reply to comment #4)
> (In reply to comment #3)
> >
> > We have an OPSI package that knows how to install Buildbot. It's not
> > automatically set to install for new VMs, though.
> Ah, that explains the confusion. I thought that we'd enable packages as we got
> them working... incremental improvements, and all that. I know you've been
> debugging other opsi fun recently, but is there any reason to not enable this?
>
It's not a simple thing, unfortunately. OPSI can set default actions for new machines, but only globally. Because Talos machines are managed by the same OPSI server we'll have to find a way around this. I'll figure something out as part of this bug.
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → bhearsum
Assignee | ||
Comment 6•15 years ago
|
||
This hasn't been tested in any useful way yet, but the logic is mostly there. Basically, it updates the 'pckey' on the slave if doesn't have one at all, or if it has the ref platform pckey but not the ref platform hostname.
Assignee | ||
Comment 7•15 years ago
|
||
This is a companion for the client side part I just posted. Currently, it accepts a list of slaves on the command line and adds them to /etc/opsi/pckeys, and copies the ref platform config for them.
We talked about automatically finding slaves by parsing config files. We should do that, but for now I'm going to focus on making it work by manually adding them.
Assignee | ||
Comment 8•15 years ago
|
||
new in this version: use the right permissions on the config files, use the right path to the ref platform configs
Attachment #414855 -
Attachment is obsolete: true
Assignee | ||
Comment 9•15 years ago
|
||
new in this version: avoid unnecessary atexit handler, fix tuple bug
Attachment #414853 -
Attachment is obsolete: true
Assignee | ||
Comment 10•15 years ago
|
||
Alright, so the latest (v2) versions of these scripts do the things they are intended to do. The next step is making the client side one work properly on first boot. Right now what happens is:
* new clone starts up
* script runs on boot, bails because it still has the same hostname as the ref platform
* hostname is changed, reboot
* opsi hangs because the hostname and the key don't match. it eventually times out and gets stuck at the login screen, similar to bug 522708. once the mouse is moved it logs in
* script runs again, updates key
* opsi works fine on subsequent boots
We either need to disable OPSI temporarily, and then re-enable it when it will work or better yet, find a way to stop it from hanging in this situation.
Assignee | ||
Comment 11•15 years ago
|
||
I think I found a workaround for the hang. OPSI has a configuration parameter called 'SecsUntilConnectionTimeout', which is set to 180 by default. When I changed this to 10, the OPSI dialog went away quicker and the automatic login succeeded. My theory is that the screen saver starting is what's breaking the login, not OPSI. By timing out sooner we prevent that from happening.
I'm going to reboot a machine in a loop for a few hours to further test this theory.
Assignee | ||
Comment 12•15 years ago
|
||
Alright, this is a polished version of the server side script. It doesn't support automatically looking for new slaves, but I'll be adding that in today or tomorrow.
Attachment #414870 -
Attachment is obsolete: true
Attachment #415197 -
Flags: review?(catlee)
Assignee | ||
Comment 13•15 years ago
|
||
Alright, so this script, when run on boot, will regenerate the host key when the existing host key doesn't exist or is equal to that of the ref platform *and* the hostname is not the same as the ref platform. By doing this only under these conditions we should see the following behaviour:
* existing slaves are untouched
* ref platform never gets changed
* new slaves will automatically change their hostkey after a post-hostname change reboot
Before this works without intervention we need to deploy a quick OPSI package that lowers the timeout for OPSI, as described in comment #11. Since it's also a fix for bug 522078, I'll track that there.
Deploying this is going to be a little heavy handed. We'll need:
* A 'tools' checkout on the slaves. We should set it up to run at every boot and keep the repo up to date.
* The timeout described in comment #11 to be lowered. I'll do this in bug 522078, since it's a fix for that bug too.
* To add this script to Scheduled Tasks, so it runs as Administrator at every boot. This command line can be rolled out easily with OPSI, and should do it:
schtasks /create /tn opsikey /tr "\"d:\mozilla-build\python25\python.exe\" \"c:\documents and settings\administrator\tools\buildfarm\opsi\regenerate-hostkey.py"\" /sc ONSTART /ru administrator /rp adminpassword
Attachment #414871 -
Attachment is obsolete: true
Attachment #415203 -
Flags: review?(catlee)
Assignee | ||
Comment 14•15 years ago
|
||
Comment on attachment 415197 [details] [diff] [review]
script that can add new slaves to the opsi server
From catlee's IRC review:
* use 0664 instead of stat.*
* use 'for line in open()' instead of readline
* f.close() might end up with an AttributeError
Assignee | ||
Comment 15•15 years ago
|
||
Alright, so this version addresses all of the review comments plus adds the ability to read a list of hosts in from a file.
Attachment #415197 -
Attachment is obsolete: true
Attachment #415401 -
Flags: review?(catlee)
Attachment #415197 -
Flags: review?(catlee)
Assignee | ||
Comment 16•15 years ago
|
||
Forgot to mention: this is intended to be run through cron, like so:
python look-for-new-slaves.py -f production-slaves
Assignee | ||
Comment 17•15 years ago
|
||
Attachment #415203 -
Attachment is obsolete: true
Attachment #415402 -
Flags: review?(catlee)
Attachment #415203 -
Flags: review?(catlee)
Assignee | ||
Comment 18•15 years ago
|
||
I just ran this script on a windows machine and found that the dirname it generates on Windows doesn't match what we use now. This patch corrects that.
Attachment #415446 -
Flags: review?(ccooper)
Updated•15 years ago
|
Attachment #415402 -
Flags: review?(catlee) → review+
Comment 19•15 years ago
|
||
Comment on attachment 415402 [details] [diff] [review]
slave side script, bugfixed and with better logging
>+ try:
>+ sys.stdout = open(options.logfile, "a")
>+ except IOError:
>+ log("WARN: Couldn't open %s, logging to STDOUT instead")
I think you're missing a ' % options.logfile' in there.
r+ with that change.
Updated•15 years ago
|
Attachment #415401 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 20•15 years ago
|
||
Comment on attachment 415402 [details] [diff] [review]
slave side script, bugfixed and with better logging
Landed with the review comment addressed,
changeset: 433:b5c5c5c144f7
Won't be deployed on the slaves at least until bug 531951 is resolved.
Attachment #415402 -
Flags: checked-in+
Assignee | ||
Comment 21•15 years ago
|
||
Comment on attachment 415401 [details] [diff] [review]
server side script that reads new hosts in from a file
changeset: 24:49912b230a12
I'll set up the cronjobs on the OPSI servers at a later time - when the rest of this bug is closer to landing.
Attachment #415401 -
Flags: checked-in+
Assignee | ||
Comment 22•15 years ago
|
||
Quick recap of what needs to be done still:
* Setup cronjob on OPSI servers to look for new slaves
* Land windows dirname fix
* Land timeout-lowering opsi package from bug 522078
(After bug 531951 is landed and deployed):
* Deploy client side script on existing slaves; ref platform (requires OPSI package that is to-be-posted)
* Get buildbot-tac.py running at startup or firstrun on slaves
Updated•15 years ago
|
Attachment #415446 -
Flags: review?(ccooper) → review+
Assignee | ||
Comment 23•15 years ago
|
||
Comment on attachment 415446 [details] [diff] [review]
one line fix for the windows dirname
changeset: 435:60c493f55bab
Attachment #415446 -
Flags: checked-in+
Assignee | ||
Comment 24•15 years ago
|
||
(In reply to comment #22)
Still to do:
* Setup cronjob on OPSI servers to look for new slaves
(After bug 531951 is landed and deployed):
* Deploy client side script on existing slaves; ref platform (requires OPSI package that is to-be-posted)
* Get buildbot-tac.py running at startup or firstrun on slaves
Assignee | ||
Comment 25•15 years ago
|
||
This is an extension of the buildbot.bat file we currently use to launch buildbot on the Windows slaves. This file will end up in the opsi-binaries CVS repo, with the real buildslave password. I considered doing this in Python, or porting the buildbot-tac init script used on linux and mac, but in the end I think this is actually the better solution, as odd as it sounds.
Package to deploy this incoming in a bit.
Attachment #415941 -
Flags: review?(catlee)
Assignee | ||
Comment 26•15 years ago
|
||
This package is capable of adding and removing a Scheduled Task that will run the regenerate-hostkey.py script at boot. We have to run it in Scheduled Tasks because it requires administrative privileges, and our automatic logon is done as cltbld.
I've tested both the install and the uninstall, and verified that a fresh clone does indeed run the script correctly and change the host key after its name is changed.
Attachment #415966 -
Flags: review?(nrthomas)
Assignee | ||
Comment 27•15 years ago
|
||
Still to do:
* Land buildbot.bat (waiting on review)
* Deploy tools checkout package (bug 531951 - ready to land at any time)
* Write & test & deploy OPSI package for buildbot.bat (will test and post this tomorrow)
* Land & deploy hostkey-generator OPSI package (just posted)
* Setup cronjob on OPSI servers to look for new slaves (will do this tomorrow)
Assignee | ||
Comment 28•15 years ago
|
||
Forgot about one eeny-weeny tiny little detail:
We're going to need to change the slavenames in the Buildbot configs and .tac files to make this work properly. Currently, all of the build try slaves and "moz2" win32 slaves' hostnames don't match the buildbot slavenames.
We should be able to do this without any downtime.
Comment 29•15 years ago
|
||
Comment on attachment 415966 [details] [diff] [review]
deploy hostkey generator
Looks fine to me.
Attachment #415966 -
Flags: review?(nrthomas) → review+
Assignee | ||
Comment 30•15 years ago
|
||
Comment on attachment 415966 [details] [diff] [review]
deploy hostkey generator
changeset: 27:4512fd032017
Checking in hostkey.ins;
/mofo/opsi-binaries/hostkey-generator/hostkey.ins,v <-- hostkey.ins
initial revision: 1.1
done
Haven't set this to roll out yet - might do that later today.
Attachment #415966 -
Flags: checked-in+
Assignee | ||
Comment 31•15 years ago
|
||
This is basically the same as the patch it deprecates, with the addition of an OPSI package to deploy it. I gave it a run through in staging, and it worked fine - although I need to fix buildbot-tac.py to support 'win32-slaveNN' hostnames.
Attachment #415941 -
Attachment is obsolete: true
Attachment #416079 -
Flags: review?(catlee)
Attachment #415941 -
Flags: review?(catlee)
Assignee | ||
Comment 32•15 years ago
|
||
Attachment #416080 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #416080 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 33•15 years ago
|
||
Comment on attachment 416080 [details] [diff] [review]
fix buildbot-tac.py to support win32 slaves properly
changeset: 436:5f52f4f0d579
Attachment #416080 -
Flags: checked-in+
Updated•15 years ago
|
Attachment #416079 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 34•15 years ago
|
||
This should prevent us from having a half written tac file. If opening, writing, or closing the file fails we'll never get 'buildbot.tac'. If the move fails, the tac file should also not exist - and we'll try again next time.
Attachment #416166 -
Flags: review?(catlee)
Assignee | ||
Updated•15 years ago
|
Attachment #416079 -
Flags: checked-in+
Assignee | ||
Comment 35•15 years ago
|
||
Comment on attachment 416079 [details] [diff] [review]
buildbot batch launcher; opsi package to deploy it
changeset: 29:271beb023e94
will roll this out next week.
Assignee | ||
Comment 36•15 years ago
|
||
Attachment #416166 -
Attachment is obsolete: true
Attachment #416178 -
Flags: review?(catlee)
Attachment #416166 -
Flags: review?(catlee)
Updated•15 years ago
|
Attachment #416178 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 37•15 years ago
|
||
Comment on attachment 416080 [details] [diff] [review]
fix buildbot-tac.py to support win32 slaves properly
Actually, this ended up being changeset: 443:6d33ff4f21dc
Assignee | ||
Comment 38•15 years ago
|
||
Comment on attachment 416178 [details] [diff] [review]
safer saving, v2
changeset: 444:3555ae980742
Assignee | ||
Updated•15 years ago
|
Attachment #416178 -
Flags: checked-in+
Assignee | ||
Comment 39•15 years ago
|
||
The tacfile generator and hostkey generator have been set to roll out.
Assignee | ||
Comment 40•15 years ago
|
||
All of the OPSI packages have been deployed to the existing slaves and the ref platform now. A couple small bits still to do:
* Make sure e:\builds\moz2_slave is created on new slaves
* Do a test with a newly cloned slave - verify new ref platform docs as part of this.
* Setup cronjob on OPSI servers to look for new slaves (will do this tomorrow)
Assignee | ||
Comment 41•15 years ago
|
||
So, it turns out that I never actually tested this in an end to end scenario. My testing involved installing the job on an existing slave or ref platform and letting it go from there. Because of that, I did not find out that setting the scheduled task to run as Administrator causes Windows to prepend the hostname to the username. This works fine until the hostname is changed - such as when we clone new slaves from the ref platform.
Additionally, I didn't get the quoting quite right in the '/tr' argument.
The fix here gets us running the scheduled task as SYSTEM - whose full qualified username is 'NT AUTHORITY\SYSTEM' - which does not change upon cloning. Running as system also doesn't require a password added to the task, which means we can move this script to the public repository.
I have fully tested this version in the real world scenario - deployed on the ref platform through the OPSI package, cloned to a new slave, and run automatically at start up.
Attachment #416987 -
Flags: review?(nrthomas)
Comment 42•15 years ago
|
||
Comment on attachment 416987 [details] [diff] [review]
fix hostkey scheduled task installation
>diff --git a/hostkey-generator/CLIENT_DATA/hostkey.ins b/hostkey-generator/CLIENT_DATA/hostkey.ins
>+DefVar $User$
>+DefVar $Password$
You can remove these lines on checkin. r+
Attachment #416987 -
Flags: review?(nrthomas) → review+
Assignee | ||
Comment 43•15 years ago
|
||
Comment on attachment 416987 [details] [diff] [review]
fix hostkey scheduled task installation
Removed the obsolete variables on landing.
changeset: 30:97410da8d2c3
Attachment #416987 -
Flags: checked-in+
Assignee | ||
Comment 44•15 years ago
|
||
Okay, I believe we're done here:
* All existing slaves have had the OPSI packages deployed on them.
* The ref platform has the OPSI packages
* A freshly cloned slave for both the "production" and try pools have been tested - both come up straight into staging. Note that they still have to be added to the configs manually, which is documented here: https://wiki.mozilla.org/ReferencePlatforms/Win32#Releng_stuff.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 45•15 years ago
|
||
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•