Closed Bug 429418 Opened 17 years ago Closed 15 years ago

Redesign win32 refimage VM so no additional manual setup needed to get into staging

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: bhearsum)

References

Details

Attachments

(8 files, 8 obsolete files)

7.56 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
2.95 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.01 KB, patch
coop
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
3.08 KB, patch
nthomas
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
3.58 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
950 bytes, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.52 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
639 bytes, patch
nthomas
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
To setup a new win32 VM, we start by cloning an existing refimage VM, which gives us a basic set of OS+toolchain installs. However, we then need to follow these manual instructions: http://wiki.mozilla.org/ReferencePlatforms or http://wiki.mozilla.org/BuildbotTestfarm to manually install the rest of the toolchain softwarem not included in the refimage VM. All this takes a long time, and can be tricky. As we setup more and more machines, this overhead is becoming a problem. Lets eliminate these manual post-install steps 1) For software that needs configuring, write a script, run as root, which can be run on first boot, and which fills in details like machine/host name, smtp configuration, etc. 2) For software that changes rapidly, whatever version is on the refimage is likely to be out of date anyway by the time we get to clone new VMs. Instead lets design the refimage to contain enough to get started, and include in refimage a script that checks for updates on first launch, and refresh forward to the latest available at that time. One way of doing this would be to pull specific tag versions of buildbot from CVS, for example, but there are probably other ways to do this. Or pull a tagged version of a text file, which contains URLs for downloading using WGET. 3) After VM is created, with bang-up-to-date versions of software, it will have to keep checking and refreshing forward, or else this new VM will drift out of date. We need a periodic check to verify VM is still running the right versions of toolchain, and refresh forward if not. This would allow us to know that all slaves are always in sync with each other. Open question: how frequently should we be rechecking & refreshing? Once a day seems reasonable start, but thats just a swag.
Priority: -- → P3
Blocks: 427919
Assignee: nobody → rcampbell
Assignee: rcampbell → nobody
Component: Release Engineering → Release Engineering: Future
Blocks: 522528
Things have changed a lot since comment #0, here's what needs to be done, as I see it: * Change the hostname * Reinstallation of OPSI * Create the buildbot.tac file * Install Buildbot
(In reply to comment #1) > Things have changed a lot since comment #0, here's what needs to be done, as I > see it: > * Change the hostname > * Reinstallation of OPSI ok. > * Create the buildbot.tac file Tracked in bug#520727. > * Install Buildbot Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already done? Note: when I filed this originally, I was thinking of Build/unittest VMs. However, we're now running config mgmt on talos also, so want to make sure we include Talos machines in whatever we do here too.
Depends on: 520727
Summary: Redesign win32 refimage VM so no additional manual setup needed. → Redesign win32 refimage VM so no additional manual setup needed to get into staging
(In reply to comment #2) > (In reply to comment #1) > > Things have changed a lot since comment #0, here's what needs to be done, as I > > see it: > > * Change the hostname > > * Reinstallation of OPSI > ok. > > > * Install Buildbot > Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already > done? We have an OPSI package that knows how to install Buildbot. It's not automatically set to install for new VMs, though. > > * Create the buildbot.tac file > Tracked in bug#520727. Yup. > Note: when I filed this originally, I was thinking of Build/unittest VMs. > However, we're now running config mgmt on talos also, so want to make sure we > include Talos machines in whatever we do here too. Yup. I'm focusing on one thing at a time here, though.
(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > Things have changed a lot since comment #0, here's what needs to be done, as I > > > see it: > > > * Change the hostname > > > * Reinstallation of OPSI > > ok. > > > > > > * Install Buildbot > > Per https://bugzilla.mozilla.org/show_bug.cgi?id=520727#c3 isnt this already > > done? > > We have an OPSI package that knows how to install Buildbot. It's not > automatically set to install for new VMs, though. Ah, that explains the confusion. I thought that we'd enable packages as we got them working... incremental improvements, and all that. I know you've been debugging other opsi fun recently, but is there any reason to not enable this? > > > * Create the buildbot.tac file > > Tracked in bug#520727. > Yup. > > Note: when I filed this originally, I was thinking of Build/unittest VMs. > > However, we're now running config mgmt on talos also, so want to make sure we > > include Talos machines in whatever we do here too. > Yup. I'm focusing on one thing at a time here, though. cool, just wanted to confirm the scope.
(In reply to comment #4) > (In reply to comment #3) > > > > We have an OPSI package that knows how to install Buildbot. It's not > > automatically set to install for new VMs, though. > Ah, that explains the confusion. I thought that we'd enable packages as we got > them working... incremental improvements, and all that. I know you've been > debugging other opsi fun recently, but is there any reason to not enable this? > It's not a simple thing, unfortunately. OPSI can set default actions for new machines, but only globally. Because Talos machines are managed by the same OPSI server we'll have to find a way around this. I'll figure something out as part of this bug.
Assignee: nobody → bhearsum
Attached patch [wip] slave side key updater (obsolete) — Splinter Review
This hasn't been tested in any useful way yet, but the logic is mostly there. Basically, it updates the 'pckey' on the slave if doesn't have one at all, or if it has the ref platform pckey but not the ref platform hostname.
This is a companion for the client side part I just posted. Currently, it accepts a list of slaves on the command line and adds them to /etc/opsi/pckeys, and copies the ref platform config for them. We talked about automatically finding slaves by parsing config files. We should do that, but for now I'm going to focus on making it work by manually adding them.
Attached file [wip] server side script, v2 (obsolete) —
new in this version: use the right permissions on the config files, use the right path to the ref platform configs
Attachment #414855 - Attachment is obsolete: true
Attached file [wip] client script, v2 (obsolete) —
new in this version: avoid unnecessary atexit handler, fix tuple bug
Attachment #414853 - Attachment is obsolete: true
Alright, so the latest (v2) versions of these scripts do the things they are intended to do. The next step is making the client side one work properly on first boot. Right now what happens is: * new clone starts up * script runs on boot, bails because it still has the same hostname as the ref platform * hostname is changed, reboot * opsi hangs because the hostname and the key don't match. it eventually times out and gets stuck at the login screen, similar to bug 522708. once the mouse is moved it logs in * script runs again, updates key * opsi works fine on subsequent boots We either need to disable OPSI temporarily, and then re-enable it when it will work or better yet, find a way to stop it from hanging in this situation.
I think I found a workaround for the hang. OPSI has a configuration parameter called 'SecsUntilConnectionTimeout', which is set to 180 by default. When I changed this to 10, the OPSI dialog went away quicker and the automatic login succeeded. My theory is that the screen saver starting is what's breaking the login, not OPSI. By timing out sooner we prevent that from happening. I'm going to reboot a machine in a loop for a few hours to further test this theory.
Alright, this is a polished version of the server side script. It doesn't support automatically looking for new slaves, but I'll be adding that in today or tomorrow.
Attachment #414870 - Attachment is obsolete: true
Attachment #415197 - Flags: review?(catlee)
Alright, so this script, when run on boot, will regenerate the host key when the existing host key doesn't exist or is equal to that of the ref platform *and* the hostname is not the same as the ref platform. By doing this only under these conditions we should see the following behaviour: * existing slaves are untouched * ref platform never gets changed * new slaves will automatically change their hostkey after a post-hostname change reboot Before this works without intervention we need to deploy a quick OPSI package that lowers the timeout for OPSI, as described in comment #11. Since it's also a fix for bug 522078, I'll track that there. Deploying this is going to be a little heavy handed. We'll need: * A 'tools' checkout on the slaves. We should set it up to run at every boot and keep the repo up to date. * The timeout described in comment #11 to be lowered. I'll do this in bug 522078, since it's a fix for that bug too. * To add this script to Scheduled Tasks, so it runs as Administrator at every boot. This command line can be rolled out easily with OPSI, and should do it: schtasks /create /tn opsikey /tr "\"d:\mozilla-build\python25\python.exe\" \"c:\documents and settings\administrator\tools\buildfarm\opsi\regenerate-hostkey.py"\" /sc ONSTART /ru administrator /rp adminpassword
Attachment #414871 - Attachment is obsolete: true
Attachment #415203 - Flags: review?(catlee)
Comment on attachment 415197 [details] [diff] [review] script that can add new slaves to the opsi server From catlee's IRC review: * use 0664 instead of stat.* * use 'for line in open()' instead of readline * f.close() might end up with an AttributeError
Depends on: 531951
Alright, so this version addresses all of the review comments plus adds the ability to read a list of hosts in from a file.
Attachment #415197 - Attachment is obsolete: true
Attachment #415401 - Flags: review?(catlee)
Attachment #415197 - Flags: review?(catlee)
Forgot to mention: this is intended to be run through cron, like so: python look-for-new-slaves.py -f production-slaves
Attachment #415203 - Attachment is obsolete: true
Attachment #415402 - Flags: review?(catlee)
Attachment #415203 - Flags: review?(catlee)
I just ran this script on a windows machine and found that the dirname it generates on Windows doesn't match what we use now. This patch corrects that.
Attachment #415446 - Flags: review?(ccooper)
Attachment #415402 - Flags: review?(catlee) → review+
Comment on attachment 415402 [details] [diff] [review] slave side script, bugfixed and with better logging >+ try: >+ sys.stdout = open(options.logfile, "a") >+ except IOError: >+ log("WARN: Couldn't open %s, logging to STDOUT instead") I think you're missing a ' % options.logfile' in there. r+ with that change.
Attachment #415401 - Flags: review?(catlee) → review+
Comment on attachment 415402 [details] [diff] [review] slave side script, bugfixed and with better logging Landed with the review comment addressed, changeset: 433:b5c5c5c144f7 Won't be deployed on the slaves at least until bug 531951 is resolved.
Attachment #415402 - Flags: checked-in+
Comment on attachment 415401 [details] [diff] [review] server side script that reads new hosts in from a file changeset: 24:49912b230a12 I'll set up the cronjobs on the OPSI servers at a later time - when the rest of this bug is closer to landing.
Attachment #415401 - Flags: checked-in+
Quick recap of what needs to be done still: * Setup cronjob on OPSI servers to look for new slaves * Land windows dirname fix * Land timeout-lowering opsi package from bug 522078 (After bug 531951 is landed and deployed): * Deploy client side script on existing slaves; ref platform (requires OPSI package that is to-be-posted) * Get buildbot-tac.py running at startup or firstrun on slaves
Attachment #415446 - Flags: review?(ccooper) → review+
Comment on attachment 415446 [details] [diff] [review] one line fix for the windows dirname changeset: 435:60c493f55bab
Attachment #415446 - Flags: checked-in+
(In reply to comment #22) Still to do: * Setup cronjob on OPSI servers to look for new slaves (After bug 531951 is landed and deployed): * Deploy client side script on existing slaves; ref platform (requires OPSI package that is to-be-posted) * Get buildbot-tac.py running at startup or firstrun on slaves
This is an extension of the buildbot.bat file we currently use to launch buildbot on the Windows slaves. This file will end up in the opsi-binaries CVS repo, with the real buildslave password. I considered doing this in Python, or porting the buildbot-tac init script used on linux and mac, but in the end I think this is actually the better solution, as odd as it sounds. Package to deploy this incoming in a bit.
Attachment #415941 - Flags: review?(catlee)
This package is capable of adding and removing a Scheduled Task that will run the regenerate-hostkey.py script at boot. We have to run it in Scheduled Tasks because it requires administrative privileges, and our automatic logon is done as cltbld. I've tested both the install and the uninstall, and verified that a fresh clone does indeed run the script correctly and change the host key after its name is changed.
Attachment #415966 - Flags: review?(nrthomas)
Still to do: * Land buildbot.bat (waiting on review) * Deploy tools checkout package (bug 531951 - ready to land at any time) * Write & test & deploy OPSI package for buildbot.bat (will test and post this tomorrow) * Land & deploy hostkey-generator OPSI package (just posted) * Setup cronjob on OPSI servers to look for new slaves (will do this tomorrow)
Forgot about one eeny-weeny tiny little detail: We're going to need to change the slavenames in the Buildbot configs and .tac files to make this work properly. Currently, all of the build try slaves and "moz2" win32 slaves' hostnames don't match the buildbot slavenames. We should be able to do this without any downtime.
Comment on attachment 415966 [details] [diff] [review] deploy hostkey generator Looks fine to me.
Attachment #415966 - Flags: review?(nrthomas) → review+
Comment on attachment 415966 [details] [diff] [review] deploy hostkey generator changeset: 27:4512fd032017 Checking in hostkey.ins; /mofo/opsi-binaries/hostkey-generator/hostkey.ins,v <-- hostkey.ins initial revision: 1.1 done Haven't set this to roll out yet - might do that later today.
Attachment #415966 - Flags: checked-in+
This is basically the same as the patch it deprecates, with the addition of an OPSI package to deploy it. I gave it a run through in staging, and it worked fine - although I need to fix buildbot-tac.py to support 'win32-slaveNN' hostnames.
Attachment #415941 - Attachment is obsolete: true
Attachment #416079 - Flags: review?(catlee)
Attachment #415941 - Flags: review?(catlee)
Depends on: 532924
Attachment #416080 - Flags: review?(catlee) → review+
Comment on attachment 416080 [details] [diff] [review] fix buildbot-tac.py to support win32 slaves properly changeset: 436:5f52f4f0d579
Attachment #416080 - Flags: checked-in+
Attachment #416079 - Flags: review?(catlee) → review+
This should prevent us from having a half written tac file. If opening, writing, or closing the file fails we'll never get 'buildbot.tac'. If the move fails, the tac file should also not exist - and we'll try again next time.
Attachment #416166 - Flags: review?(catlee)
Attachment #416079 - Flags: checked-in+
Comment on attachment 416079 [details] [diff] [review] buildbot batch launcher; opsi package to deploy it changeset: 29:271beb023e94 will roll this out next week.
Attached patch safer saving, v2Splinter Review
Attachment #416166 - Attachment is obsolete: true
Attachment #416178 - Flags: review?(catlee)
Attachment #416166 - Flags: review?(catlee)
Attachment #416178 - Flags: review?(catlee) → review+
Comment on attachment 416080 [details] [diff] [review] fix buildbot-tac.py to support win32 slaves properly Actually, this ended up being changeset: 443:6d33ff4f21dc
Comment on attachment 416178 [details] [diff] [review] safer saving, v2 changeset: 444:3555ae980742
Attachment #416178 - Flags: checked-in+
The tacfile generator and hostkey generator have been set to roll out.
All of the OPSI packages have been deployed to the existing slaves and the ref platform now. A couple small bits still to do: * Make sure e:\builds\moz2_slave is created on new slaves * Do a test with a newly cloned slave - verify new ref platform docs as part of this. * Setup cronjob on OPSI servers to look for new slaves (will do this tomorrow)
So, it turns out that I never actually tested this in an end to end scenario. My testing involved installing the job on an existing slave or ref platform and letting it go from there. Because of that, I did not find out that setting the scheduled task to run as Administrator causes Windows to prepend the hostname to the username. This works fine until the hostname is changed - such as when we clone new slaves from the ref platform. Additionally, I didn't get the quoting quite right in the '/tr' argument. The fix here gets us running the scheduled task as SYSTEM - whose full qualified username is 'NT AUTHORITY\SYSTEM' - which does not change upon cloning. Running as system also doesn't require a password added to the task, which means we can move this script to the public repository. I have fully tested this version in the real world scenario - deployed on the ref platform through the OPSI package, cloned to a new slave, and run automatically at start up.
Attachment #416987 - Flags: review?(nrthomas)
Comment on attachment 416987 [details] [diff] [review] fix hostkey scheduled task installation >diff --git a/hostkey-generator/CLIENT_DATA/hostkey.ins b/hostkey-generator/CLIENT_DATA/hostkey.ins >+DefVar $User$ >+DefVar $Password$ You can remove these lines on checkin. r+
Attachment #416987 - Flags: review?(nrthomas) → review+
Comment on attachment 416987 [details] [diff] [review] fix hostkey scheduled task installation Removed the obsolete variables on landing. changeset: 30:97410da8d2c3
Attachment #416987 - Flags: checked-in+
Okay, I believe we're done here: * All existing slaves have had the OPSI packages deployed on them. * The ref platform has the OPSI packages * A freshly cloned slave for both the "production" and try pools have been tested - both come up straight into staging. Note that they still have to be added to the configs manually, which is documented here: https://wiki.mozilla.org/ReferencePlatforms/Win32#Releng_stuff.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: