Closed Bug 1186197 Opened 9 years ago Closed 9 years ago

repurpose two casper servers as replacement deploystudio servers

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

(Whiteboard: [osx])

Upgrading the existing DS servers is pretty risky and likely to leave us with no functioning install servers. The hardware is also out of warranty.

I'm going to attempt to reprupose two of the newer machines we bought for casper as new DS servers and get 10.10 and DS 1.6.15 installed on them.
https://mozilla.service-now.com/nav_to.do?uri=sc_req_item.do%3Fsys_id=7c2d5fcb38d10a001518c8fe7d991fa5%26sysparm_stack=sc_req_item_list.do%3Fsysparm_query=active=true filed to obtain two copies of OS X Server

10.10.2 installed on deploy.test.releng.scl3.mozilla.com, upgrades to 10.10.4 applied.

DS configuration files copied to /Volumes/Deploy (symlink made from /Deploy) from /Deploy on install.test.releng.scl3.mozilla.com.

Deploystudio downloaded and installed on deploy.test.
I've gotten OS X server installed on deploy.test. I've also copied over the netboot sets from install.test:/Library/NetBoot

I've turned on sharing and tried to match the sharing perms for the two netboot dirs and /Deploy (based on install.test).

Configuring deploystudio isn't working so far, since I presume the configuration files say install.test. The best route might be to shut down install.test and put this in it's place to try and finish the configuration, then see if it magically "just works."

:dividehex: you have an opinion on the best course of action there, since you've done this more times than anyone else.
Flags: needinfo?(jwatkins)
dividehex suggested moving the old DS server aside and standing this one up with the old IP and hostname to finish the config.

He also suggested using RAID1 for the disks instead of having separate ones for boot disk and /Volumes/Deploy. For the time being, I've moved the data off of the second disk and onto the boot disk. Apparently raid can be added after the fact.

In order to puppetize the 10.10 machine, I needed to add pbkdf2 copies of the dsadmin password in hiera.
Flags: needinfo?(jwatkins)
Okay, I think I have a working installation (at least one that will install yosemite) for the new install.test.releng.scl3.mozilla.com. This required quite a bit of trial and error. The old install server was interfering, so I had to shut it down completely. Once we're sure the new machine is working, we should wipe the old one to make sure that it doesn't accidentally get booted and start claiming it's a netboot server again.

I worked from the old DS server configuration infor on mana: 

https://mana.mozilla.org/wiki/display/SYSADMIN/Installing+and+Configuring+DeployStudio+and+DeployStudio+PC

First, I had to generate a new certificate (though the server console, but it only created it for one year) that had the correct hostname. I deleted the bogus deploy.test.releng.scl3.mozilla.com certificate.

I misread the original DS server setup and had specified a Local disk. I went back to correct that and made it a share point using AFP (I also shared /Deploy with AFP and smb with the dsadmin creds): afp://install.test.releng.scl3.mozilla.com/Deploy

Next I needed a new netboot set because the old one had a version mismatch with the new server. This also required a substantial amount of trial and error, and the final settings I used were NFS, SSL (with the newly generated cert for install.test.releng.scl3.mozilla.com), and (I'm not sure why), I needed to specify the IP address of the netboot server instead of the hostname when creating the  set. For some reason, using the hostname failed with a host not found/unreachable error. Once I put in the IP, it worked like a champ. This seems broken, but I'm willing to let it slide for now.

I've now used this configuration to install the second server, which will become install.build (currently it's deploy.test so I can get the initial install done).

We still need to test this the new install.test on older hardware to make sure it can image them correctly.

We also still need to get the boot disk mirrored on install.test.


As for the second machine, it's installed, puppet killed for the time being, and it's in the process of doing patches/upgrades. I'll work on doing more of the setup after we can verify install.test works and I move deploy.test (will be install.build) back to the build VLAN and attach the backup drive from the old install.build.
I used the new install.test to reimage t-snow-r4-0005 and t-yosemite-r5-0003, both of which looked completely successful and are now back in the pool.
the new install.build is up and running: puppetized, server installed and netboot and sharing configured, DS installed and configured, and a new netboot set generated. I'll run some test reimages off of it tomorrow to verify.

Currently install.build has both backup drives attached so I can grab files from either old machine as necessary.

The old install.build is shut down.
Research on mirroring the boot drives is so far not looking promising. When Apple moved to CoreStorage (a cripled LVM), they appear to have removed the ability to RAID the boot drive with appleraid. CoreStorage does not have the functionality to support RAID1, either, since it's primarily focused at doing RAID0 for fusion drives.

We could always revert the CoreStorage volumes back to normal physical devices and then try to raid them, but likely that will cause severe issues (possibly making the machines non-functional) whenever we apply an OS update.
install.build is proving to be a bit more of a problem than install.test because we use raid0 on the lion boot disks.

asr (the program that's used to restore files) no longer supports plain file copy, so you have to specify --erase when you're restoring. This wasn't happening with the lion builders, and I tracked this down to the fact that, apparently DS doesn't pass --erase if you specify a disk name (like Macintosh HD), but WILL if you pass a disk device (like /dev/disk0s2).

The last thing I attempted was to specify /dev/disk2 (which is what the raided disk is called). That machine (bld-lion-r5-002) successfully passed the restore phase and went on to finish the imaging process, but it never came back up. I'm not sure if it's overwritten key RAID info or what. I've opened bug 1187478 to get the machine back on the network and do some more troubleshooting (hopefully with dcops looking at the console so they can tell me the failure mode).
Depends on: 1187478
Since RAID1 no longer works in yosemite, I've set up the second disk as a timemachine disk and turned on timemachine. We can revisit the RAID issue at a later date.
We should NEVER opt to use RAID with apple hardware again; it's nothing but trouble when it comes to support, upgrading, etc.

I believe I have install.build functioning again after a lot of headdesking and hands-on help from :van. 

Notes for future configuration and troubleshooting:

Even after modifying the Restore sequence to specify the disk device of the RAID0 (/dev/disk2) so that it used --erase as an argument to asr, the machine would appear to reimage correctly and then never successfully boot again.

I speculated that the bootblocks were getting munged on the RAID with the use of the --erase flag. I tried adding in a partitioning step, but apparently you can't partition an apple RAID disk. I tried various ways to specify the disk, tried wiping out the RAID and starting over from two bare disks, tried turning on verbose boot to catch any errors (it would only go directly to the folder-with-question-mark screen that signified it couldn't find a boot disk), and nothing seemed to help.

At one point, I ran a disk repair on the RAID partition to see if that would help any. Miraculously the machine booted up after that. I reimaged with the old workflow again and it failed again.

Thankfully there's a checkbox in the Restore task in DS for "Preventative volume repair" which supposedly runs a fsck. I tried checking that off, and that seemed to do the trick on bld-lion-r5-002. I reimaged it twice without issue, then I used another mini, bld-lion-r5-004 to make sure that it was repeatable on a machine that I hadn't taken extreme liberties with already. It worked on bld-lion-r5-004 as well.

I'm not 100% sure what it's doing under the hood, but I speculate that the following lines in the logs are where it's working it's magic to fix the boot sectors:

2015-07-27 16:31:26.033 DeployStudio Runtime.bin[391:13067] Started file system repair on disk2 Mac
intosh HD
2015-07-27 16:31:26.134 DeployStudio Runtime.bin[391:13067] Updating boot support partitions for the volume as required
2015-07-27 16:34:29.227 DeployStudio Runtime.bin[391:13067] Error: -69673: Unable to unmount volume for repair
2015-07-27 16:34:29.228 DeployStudio Runtime.bin[391:13067] -> Restore action completed.

It claims to be doing something with the "boot support partitions" before it prints out a message that it's unable to umount the volume to repair it. So I don't know if it's actually running the fsck, or just messing with the bootblocks and then skipping the fsck. In either case, it adds a minute or two onto the time required to image the machine, but it at least seems to work.
Given the screaming nightmare that is RAID on apple hardware, I'm not even going to attempt to do something hacky with the DS server boot disks. I'm going to leave the second disk as a TimeMachine disk and let that be our source of local backups. Ideally we will also soon have remote backups via bacula once they fix their openssl libs.
I've re-enabled bld-lion-r5-002 and bld-lion-r5-004 to make sure that they still build things correctly (I don't have any reason to believe they wouldn't). Coop, could you take a look at them tomorrow to make sure that they're functioning correctly?
Flags: needinfo?(coop)
(In reply to Amy Rich [:arr] [:arich] from comment #12)
> I've re-enabled bld-lion-r5-002 and bld-lion-r5-004 to make sure that they
> still build things correctly (I don't have any reason to believe they
> wouldn't). Coop, could you take a look at them tomorrow to make sure that
> they're functioning correctly?

Will do.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #13)
> (In reply to Amy Rich [:arr] [:arich] from comment #12)
> > I've re-enabled bld-lion-r5-002 and bld-lion-r5-004 to make sure that they
> > still build things correctly (I don't have any reason to believe they
> > wouldn't). Coop, could you take a look at them tomorrow to make sure that
> > they're functioning correctly?
> 
> Will do.

All jobs have passed since last night on both machines.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.