Closed Bug 928148 Opened 11 years ago Closed 11 years ago

Repair/rebuild install.test.releng.scl3.m.c

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

It looks like the /Deploy partition was formatted when the deploystudio runtime was accidentally on this host.  The shortest route to recovery would be to recovery the lost files with recovery software.  I'm going to attempt to use testdisk and/or photorec to recovery the most critical files.  From there I can rebuild the DS config and installation.
Assignee: relops → jwatkins
photorec is running on a screen instance under root with the recovered files being written into /var/root/recovery.  Right now it has about 5hours left so I'll check back on it later.
Photorec is still running, but I'm not sure what, if anything, it's doing (dtruss isn't showing much action).  du -sm doesn't show anything that's going to be big enough to be an image, but it's likely it at least partially recovered from config files.

I also noticed that DS is still open on install.test, and it's still showing information under Computers (including machines and groups) and under Workflows. Masters, Scripts, and Packages are empty. I'm not sure if this is just resident in memory or if it's reading files from somewhere else on disk.

I've started to try to recreate what would have been in /Deploy from install.build.releng.scl3.mozilla.com.  I know there were some differences between the two, but this should at least give us a decent start.  I'm putting the files in /Deploy-recover and have so far:

* created a directory tree structure that matches install.build
* copied over "Databases/Workflows" from install.build - this doesn't have the mountain lion work flow, but we should be able to reconstruct pretty easily from there and compared to pxe1.
* copied over any of the Databases/ByHost files from install.build that had mtnlion in them - happily this is almost all of them, only the ref machine, 90-92 and 95 are missing.  406c8f3dfed9 got retasked to do mavericks, so I've renamed the file to denote that, but haven't changed the contents.
* recovered the database files for 90-92 and 95 from the files that photorec had pulled off of disk (files named -RECOVERED)
* copied over Files from install.build - this should be pretty much the same across all of the OS X architectures, so I'm hoping this will "just work"
* copied over Packages from install.build - I'm not sure if we did a 10.8-specific compile of puppet, but if so, we can always recreate that package again since that was recent work that was done and checked in.
* copied over Masters/HFS/talos-mtnlion-r5-ref-2012-07-31.i386.hfs.dmg from install.relabs - I verified by looking at the workflow that's still accessible via the DS GUI on install.test that this is the file we were using as the base install
* identified all of the plist files for the workflows for mavericks, mtnlion, and mtnlion-ref (still need to go through and figure out which one is the right one for each)
* created Databases/Workflows/B058840B-C584-49D8-B2C3-58B7884F720D.plist from f30477976.plist for the mavericks restore workflow
* created Databases/Workflows/CA2AAB9E-D036-42E7-AEDD-7989EE8A6037.plist from f19799304.plist for the talos-r5-mtnlion-ref restore workflow
* created /Deploy-recover/Databases/Workflows/462B9EB1-1C31-47F3-8392-BBCD6A14C5A4.plist from f30481504.plist for the talos-r5-mtnlion restore workflow

I'm not going to knock myself out trying to find the developer preview dmg image for 10.9 since we'll need to replace that with the release version, anyway.
* verified that Scripts/remove_systemconfig_plist.sh matched f19782616.sh
* verified that Scripts/enable_root.sh matched f19803928.sh
* verified that Scripts/copy_puppetize_files.sh matched f19793856.txt (with one small exception which can be ignored)
* created Scripts/enable_puppetize_at_boot.sh from f19794960.sh (not found on install.build).  Used this one instead of f19781320.sh or f19781696.sh because it appeared to be the latest one (better quoting, etc)
* replaced Scripts/set_hostname.sh with f19793776.sh since it appeared to be a better version than the one on install.build (had ipconfig waitall in it). Chose this one over f19787264.sh	f19788144.sh	f19793768.sh f19793776.sh because it appeared to be the most recent/best commented
In hindsight I don't think the last photorec session would have picked up the mntlion image since it would be a dmg file and photorec doesn't have a dmg signature.  We would need to build a dmg signature and include it as a custom sig.  On the other hand, i think it might be safe to assume the image on the JAMF test server is a good copy of the image.  When I get back, next week, I can build a custom sig and search for the dmg, then compare hashes against the one found on the JAMF mini.

Another consideration is the DS versions are different between install.test and install.build.  We should record which versions are installed on both in case we need to reinstall the DS app for get the version specific scripts back intact.

Let's also make sure we are not using /dev/disk0s* (which *was* /Deploy) so no other file corruption occurs .  Once we are confident, no other files need recovery, we can move format and move /Deploy-recovery to it.
I unmounted the disk before running photorec yesterday but it might have been automounted again at some point.
I am 99% sure that the DMG for mtnlion I got from the JAMF server is the correct one.

I've been putting /Deploy-recovery on the boot drive so I explicitly did not overlap with the partition that got reformatted.  The partition is currently mounted at "/Volumes/Macintosh HD 1" but I've made sure not to touch it so as not to interfere with forensic recovery.
Picking up where we left off, I was able to get install.test into a working state.  Here are the steps I took.

* stop photorec.  It had completed its search.
* snapped screen shots of the critical workflows and saved them to the desktop in case DS stopped working
* unloaded DS service with launchctl
* created soft links to the DS reconstructed folder (/Deploy -> /Deploy-recover, /Volumes/Deploy -> /Deploy-recover)
* reloaded the DS service (Note: when the service reloaded, DS copied its builtin scripts from its static Application folder to the Scripts folder so no need to reinstall DS to get the scripts to match the correct DS version)
* verified Restore talos-mtnlion-r5 and Restore talos-mtnlion-r5-ref matched the screenshots of the cached workflows
* Tested the Restore talos-mtnlion-r5 workflow on talos-mtnlion-r5-001

This is where I ran into problems.  I was seeing the same issue the restored mini having a cached default route of where the image was first taken. (eg SCL1).  I verified the the log file and made sure all steps in the workflow were executing properly.  I finally noticed the remove_systemconfig_plist.sh script was not using the DS environment variables to reference the correct path to the restored volume.  I added the proper variables and also added 3 other system config plist files to be removed during the post image restore step in the workflow.  These files also contain network identity references and should be removed so they may be regenerated on boot.

* Tested again, checked  the log and verified puppetization.
It all seems to be working now.

Other notes:
* added DS versions of install.build and install.test to mana
* looked into adding a dmg sig to photorec but this proved difficult since dmgs don't have a well defined magic number or other file markers
* downloaded and ran Stellar Phoenix Mac Recovery software.  Running both lost volume recovery and deep scan came up fruitless.

The last step here is to move the reconstructed folder to the original partition (/dev/disk0s2).  Once we do so, recovering files won't be an option but since we are fairly confident we have the correct mtnlion image, I don't see a problem moving forward on this.
I've pulled the trigger on the last step; restoring the DS files to the original partition on the second disk.

* unloaded the com.deploystudio.server
* removed symlinks /Depoly and /Volumes/Deploy
* Unmounted and reformatted /dev/disk0s2 with the volume name Deploy
* rsync /Deploy-recovery/ /Volumes/Deploy/ -av
* recreated symlink /Deploy -> /Volumes/Deploy
* restored AFP share points (renamed Deploy -> Deploy-recovery and renamed Deploy-old -> Deploy)
* reloaded com.deploystudio.server

And finally tested a restore on talos-mtnlion-r5-011 (reserved for testing in bug917082)
Everything worked like a charm.  I'm calling this done.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.