Closed Bug 683715 Opened 13 years ago Closed 13 years ago

Deploy process for setting up a rev4 10.6 test slave

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: dustin)

References

Details

This bug is to figure out how to go from a blank mini to the point where we are ready to run puppet to setup the machine to be a tester.  The goal here is to figure out a process which Relops can support.  Per meeting with zandr, this is not going to block our medium scale test.

This is the state that a Mac must be in to become a 10.6 tester 
1) Mac OS X 10.6 installed
2) 10.6.8 v1.1 update applied, from:
   curl -LO http://support.apple.com/downloads/DL1399/en_US/MacOSXUpdCombo10.6.8.dmg
2) User created with the following details:
   Full Name: Client Builder
   User Name: cltbld
   Password to be communicated to Release Engineering
3) VNC and SSH sharing enabled.  This can be done by
   launching System Preferences, going to 'Sharing' and ticking
   the 'Screen Sharing' and 'Remote Login' settings
4) 'cltbld' set to automatically log into a console session.  
   This can be done by launching System Preferences, going to 
   'Accounts', clicking the padlock to unlock the preference pane,
   pressing 'Login Options' then selecting "Client Builder" from
   the "Automatic Login" list
5) Puppet v 0.24.8 installed. This can be done by running
   curl -LO http://downloads.puppetlabs.com/gems/facter-1.5.6.gem
   curl -LO http://projects.puppetlabs.com/attachments/download/584/puppet-0.24.8.gem
   sudo gem install facter-1.5.6.gem puppet-0.24.8.gem

At some point, we would like to have puppetd sync up as part of the imaging process.  I am not sure how to get puppet to sync without interacting with the cert tool on the puppet master.

If I can't get the Python dependencies figured out before we need to image the bulk of machines for production, we might have one additional thing that needs to be done before the machine is ready for puppet (installing a metapackage that has the Python headers and source)
Assignee: server-ops-releng → jhford
My testing is showing that a base os install with manual steps and puppet results in a machine that is running green in staging.  We still have to run these machines in staging for a week or two to get a baseline for performance numbers, but the imaging process developed in this bug shouldn't change.

I do need to make a change to the manual steps in comment 0, which affects our ability to have non-mac vnc clients connect to the mac vnc server.

The change is from:

(In reply to John Ford [:jhford] from comment #0)
> 3) VNC and SSH sharing enabled.  This can be done by
>    launching System Preferences, going to 'Sharing' and ticking
>    the 'Screen Sharing' and 'Remote Login' settings

to:

3) VNC and SSH sharing enabled with VNC password set to 'cltbld' password.
   This can be done by launching System Preferences, going to 'Sharing' and ticking
   the 'Screen Sharing' and 'Remote Login' settings.  To set the VNC password, select
   the 'Screen Sharing' item from the checkbox list and press "Computer Settings".  On the
   the sheet that drops down, tick the 'VNC Viewers may control screen with password:"
   checkbox, then enter the communicated 'cltbld' user password.

As well as a clarification:

> 1) Mac OS X 10.6 installed

This installation should *not* be what shipped installed on the machine, as I think that includes the ilife package.  Using the system recovery disc should be fine, but we don't want to have ilife on these machines.



Pushing this bug back to ServerOps to figure out assignee.  I am also adding dependency information.  To be clear, these bugs aren't strict dependencies and are able to be done concurrently.  I am adding them to
Assignee: jhford → server-ops-releng
Blocks: 671415
Depends on: 681726
Summary: figure out how to prepare base image for rev4 minis → Deploy process for setting up a rev4 10.6 test slave
Now that you have a baseline that's working, we need to figure out how to deploy and manage these boxes.  In order to facilitate future management of these machines, we're going to need to turn on ARD vs straight screen sharing, for example.  Can I have talos-r4-snow-009 to experiment with, so we can work on deployment tools?
FWIW, I've had recurring instances of ARD hanging on reboot (thus requiring a manual power cycle) so it's disabled on all my minis.  For system administration, I enable ARD as-needed (via SSH), do the work, then immediately turn ARD back off.  YMMV.
(In reply to Amy Rich [:arich] from comment #2)
> Now that you have a baseline that's working, we need to figure out how to
> deploy and manage these boxes.  In order to facilitate future management of
> these machines, we're going to need to turn on ARD vs straight screen
> sharing, for example.  Can I have talos-r4-snow-009 to experiment with, so
> we can work on deployment tools?

Why do we need ARD?  What can it do that we can't do with screensharing/vnc and puppet?

Given John's experience with ARD in comment 3 and my incomplete understanding of ARDs advantages, I think I'd prefer if we only turned it on when it is needed.  As well, due to the performance tests and the focus issues with unit tests, we should not perform any maintenance on a machine while it is in production.  Anything that could change performance numbers needs to be done during a tree closure.
The plan is to use ARD to do the base OS install since we've had issues with deploystudio in the past.
Can ARD be disabled and ScreenSharing+VNC enabled after the installation?
Also, its worth noting that I was completely unable to have the ARD agent running, with the ability for non-mac VNC clients able to connect, even following the Apple KB article.  Having VNC working for non-mac users on our team is critical.
(In reply to John Ford [:jhford] from comment #1)
> This installation should *not* be what shipped installed on the machine, as
> I think that includes the ilife package.  Using the system recovery disc
> should be fine, but we don't want to have ilife on these machines.

There seems to be some confusion around this point.  Each of these machines comes with:

1) mac os x 10.6.4 and iLife preinstalled on the HD 
2) an OS X 10.6.4 restore DVD
3) an Application restore DVD (iLife, etc)

I would like to have the hd erased and the OS reinstalled because there are OS integration packages installed by iLife.  Instead of deleting the apps, which does nothing to remove the OS integration, I would like our testing slaves to have never had iLife installed.

We have never had iLife installed on our testing slaves because we used reference images that had the entire machine setup from a clean install of the OS.  Now that we are doing base os + puppet, we need to make sure that all machines start in this same clean state.  Removing the app bundles in /Applications doesn't guarantee the removal of OS Integration portions of iLife.
This is a new version of ard, and I want to try to get vnc working with ard enabled. jhford: you still didn't say whether or not I can have one of those machines to test on (09 would work).  I'm going to need one whether or not we use ard.
(In reply to Amy Rich [:arich] from comment #9)
> This is a new version of ard, and I want to try to get vnc working with ard
> enabled. jhford: you still didn't say whether or not I can have one of those
> machines to test on (09 would work).  I'm going to need one whether or not
> we use ard.

Apologies, i had a comment saying that it is ok to take that machine but wasn't submitted due to a Firefox crash.  Yes, please take that machine to work with.
Cross posting from bug 683718 to reduce confusion:

Because there are addenda in comments in this and other bugs, these are the updated requirements.

These machines must have:
1) The hard drive with pre-installed Mac OS X erased
2) Mac OS X 10.6 installed from OS recovery DVD included with hardware. Note the 
   Applications DVD included with the hardware should not be installed.
3) 10.6.8 v1.1 update applied, from:
   curl -LO http://support.apple.com/downloads/DL1399/en_US/MacOSXUpdCombo10.6.8.dmg
4) User created with the following details:
   Full Name: Client Builder
   User Name: cltbld
   Password to be communicated by Release Engineering
5) VNC and SSH sharing enabled with VNC password set to 'cltbld' password.
   This can be done by launching System Preferences, going to 'Sharing' and ticking
   the 'Screen Sharing' and 'Remote Login' settings.  To set the VNC password, select
   the 'Screen Sharing' item from the checkbox list and press "Computer Settings".  On the
   the sheet that drops down, tick the 'VNC Viewers may control screen with password:"
   checkbox, then enter the communicated 'cltbld' user password.
6) 'cltbld' set to automatically log into a console session.  
   This can be done by launching System Preferences, going to 
   'Accounts', clicking the padlock to unlock the preference pane,
   pressing 'Login Options' then selecting "Client Builder" from
   the "Automatic Login" list
7) Puppet v 0.24.8 installed. This can be done by running
   curl -LO http://downloads.puppetlabs.com/gems/facter-1.5.6.gem
   curl -LO http://projects.puppetlabs.com/attachments/download/584/puppet-0.24.8.gem
   sudo gem install facter-1.5.6.gem puppet-0.24.8.gem
8) Hardware dongle installed that allows the display mode to be set to 1600x1200x32
Blocks: 690236
Last night, I set up 10 machines in SCL1 using the base os + manual steps + puppet solution.  My experience is that bootstraping puppet is incredibly fragile and is non-deterministic.  The bootstrapping process is where the machine goes from the 'base os + manual steps' to the 'base os + manual steps + puppet'.

Unfortunately, while I was writing the puppet manifests, I didn't have local access to my testing machine [1] and it ended up taking two weeks [2] to have the OS reset on my testing machine in MV.  This meant that I was unable to test the bootstrapping process.  While doing the bootstrap on 10 machines, I ran into at least one bug [3] that is causing a lot of trouble in the process.  

My short term priority is getting however many we pick (80?) rev4s as 10.6 machines in production.  To this end, I think that the best approach is to take is to use an image of a machine that has the base os + manual steps + puppet setup and image the (70?) remaining machines with this image, and make sure they sync with puppet.  We should test this on a couple machines before completing the bulk of the machines.

I don't consider this to be a complete failure of the goal to have base os + puppet as our deployment strategy.  In the past, we had a combined solution where half the image was setup by typing instructions for buttons to press in a wiki document then running puppet.  There were also a lot of things that were undocumented, like a full Xcode installation on testing hardware.

We now have a process that is installing the base os, updating the os and running puppet repeatedly until it works.  Beyond OS installing, os update and puppet, the only manual step is setting the VNC password on the machines, which try as I might, seems impossible on the command line.

For the longer term, I think we need to decide how important being able to bootstrap every single machine using puppet is.  Given Apple's direction of removing command line configurability, I doubt this is worthwhile.  If desired, a bug for fixing the bootstrapping issues can be filed.

One thing that I think would be very helpful is upgrading our puppet installation to a newer version.  There is now a feature called 'stages' which should make bootstrapping machines a whole bunch easier.





The steps for going from base os to full configured for me were:
-open vnc://talos-r4-snow-001.build.mozilla.org
-launch terminal on slave
-on puppet master run: 'puppetca --clean talos-r4-snow-001.build.scl1.mozilla.com'
-run 'sudo puppetd --test --server staging-puppet.build.mozilla.org'
-on puppet master run: 'puppetca --sign talos-r4-snow-001.build.scl1.mozilla.com'
-this will automatically reboot slave
-reconnect to slave over vnc/screensharing (mac screensharing will do this automatically)
-launch terminal on slave
-on puppet master run: 'puppetca --clean talos-r4-snow-001.build.scl1.mozilla.com'
-once slave has tried to sync puppet run on puppet master: 'puppetca --sign talos-r4-snow-001.build.scl1.mozilla.com'
  -this can be monitored by running this on staging master:
    tail -f /var/log/messages | grep 'has a waiting cert'
-slave will automatically reboot
-slave should automatically connect to buildbot master

I also noticed that:
-the first few runs of puppet will have random, weird failures.  They are non-deterministic.  Most should be fixed by puppet being run in a loop
-if a slave isn't on the buildbot master, isn't running run-puppet.sh, it likely needs to be rebooted
-because of a bug/feature in Mac OS X, any time puppet needs to install a DMG file, the command *must* be run over screensharing.  the dmg mounting program will not work when launched from ssh.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=678112
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=684346
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=690082
Late to the party, since I'm supposed to be on vacation.

(In reply to John Ford [:jhford] from comment #4)

> Why do we need ARD?  What can it do that we can't do with screensharing/vnc
> and puppet?

Remotely reimage a machine. 

ARD needn't block now, but we *will* transition to OSX native management tools in the future.
Depends on: 690923
I think there are two options for how we deploy to the rev4 10.6 machines.

Option A)
For each machine to be imaged, follow all of the steps in https://bugzilla.mozilla.org/show_bug.cgi?id=683715#c11 to setup the base OS then manually do the following:
-open vnc://talos-r4-snow-XXX.build.mozilla.org
-on slave launch terminal
-on puppet master run: 'puppetca --clean talos-r4-snow-XXX.build.scl1.mozilla.com'
-on slave through screensharing run 'sudo puppetd --test --server staging-puppet.build.mozilla.org'
-on puppet master run: 'puppetca --sign talos-r4-snow-XXX.build.scl1.mozilla.com'
-this will automatically reboot slave when done
-reconnect to slave over vnc/screensharing (mac screensharing will do this automatically)
-on slave launch terminal
-on puppet master run: 'puppetca --clean talos-r4-snow-XXX.build.scl1.mozilla.com'
- wait for the slave to try to sync with puppet master
-on puppet master: 'puppetca --sign talos-r4-snow-XXX.build.scl1.mozilla.com'
 -on staging master, monitor progress by running this:
    tail -f /var/log/messages | grep 'has a waiting cert'
-slave will automatically reboot
-slave should automatically connect to buildbot master


NOTE:
-the first few runs of puppet will have random, weird failures.  They are non-deterministic.  As puppet is being run in a loop, puppet should retry automatically, and this usually fixes it.
-if a slave isn't on the buildbot master, isn't running run-puppet.sh, it likely needs to be manually rebooted
-because of a bug/feature in Mac OS X, whenever puppet needs to install a DMG file, the command *must* be run over screensharing - the dmg mounting program will not work when launched from ssh.


Option B)
Set up a single machine using steps in (A) and keep that as a reference image. (fyi, I have an image of this created last wednesday)
For each machine:
-setup dhcp/dns to give machine hostname
-clean puppet cert on puppet master "puppetca --clean talos-r4-snow-XXX.build.scl1.mozilla.com"
-deploy image to machines which need image
-boot machine
-sign puppet cert on puppet master "puppetca --sign talos-r4-snow-XXX.build.scl1.mozilla.com

In bug 690923 I tested that the machine was able to sync with puppet after changing the hostname.  This is approximately what happens when the refimage is applied to the new machine.

While we originally intended to do option A, I find the manual interventions make this unworkable, and at this time I recommend option B.
> -the first few runs of puppet will have random, weird failures.  They are
> non-deterministic.  As puppet is being run in a loop, puppet should retry
> automatically, and this usually fixes it.

This is a bug in the puppet manifests, and while it doesn't block deployment, it should be fixed ASAP.  Is there a bug open for that?

> -because of a bug/feature in Mac OS X, whenever puppet needs to install a
> DMG file, the command *must* be run over screensharing - the dmg mounting
> program will not work when launched from ssh.

This will have something to do with mach contexts.  I don't remember the details, but I think you need to run it in the context of the logged-in user.  This should have an open bug, too.
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> > -the first few runs of puppet will have random, weird failures.  They are
> > non-deterministic.  As puppet is being run in a loop, puppet should retry
> > automatically, and this usually fixes it.
> 
> This is a bug in the puppet manifests, and while it doesn't block
> deployment, it should be fixed ASAP.  Is there a bug open for that?
> 
> > -because of a bug/feature in Mac OS X, whenever puppet needs to install a
> > DMG file, the command *must* be run over screensharing - the dmg mounting
> > program will not work when launched from ssh.
> 
> This will have something to do with mach contexts.  I don't remember the
> details, but I think you need to run it in the context of the logged-in
> user.  This should have an open bug, too.

I won't stop you from filing them :)  I have filed bug 690082, which is about manifests that reboot the machine mid puppet run.
What's left to do here that's not covered in bug 671415?
Assignee: server-ops-releng → dustin
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> What's left to do here that's not covered in bug 671415?

That bug is about getting to the point of ready to accept an image.  This bug takes over from that bug and goes from 'ready to image' to 'able to image'.
Given the progress in bug 690236, it seems like this problem has been solved.  RESOLVED->FIXED?
Seeing as 70 machines have been imaged, I think it is safe to RESOLVED->FIXED this bug.  If you feel this is not the case, please reopen
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.