Status

mozilla.org Graveyard
Server Operations
--
blocker
RESOLVED FIXED
12 years ago
3 years ago

People

(Reporter: justdave, Assigned: justdave)

Tracking

Details

osadm01 corrupted its filesystem and remounted the primary drive as read-only at approximately 11:51am this morning, based on evidence on the system (last modified timestamps on logfiles and so forth).  For some reason, nothing got jammed up enough to page about it until around 2:50pm this afternoon.  On investigating the problems, discovered the filesystem was readonly, and the machine refused to gracefully reboot.  OSUOSL support was asked to manually reboot the machine.  We had a long wait in line because they're still cleaning up from last night's wind storm (which apparently wreaked havoc on the lab - they were on generator power for several hours)

16:49:52 <@emsearcy> justdave: it has disk problems (I only let it reboot once ... it was executing PXE-booting over and over again when I hooked the console up, after restarting it booted RHEL, but kernel panics) ... more ->
16:51:03 <@emsearcy> justdave: it detects an ext3 problem and goes straight to disk recovery, which errors with 'no journal found at offset 0xfoo, recovery failed, then the box kernel panics after an attempted swap_root

Current status: waiting of ISO downloads to reload the box with the current version of RHEL4.

plan:
1) Wipe the disk and reinstall RHEL4 from CDs
2) Register the machine with a "reactivation" key generated by RHN, which will immediately reload the last known package set and centrally-managed configuration files from that machine upon registration
3) pull last night's backup to restore the nagios and RHN Proxy configuration files
4) rsync the web cluster config (/data) over from mradm01.
Blocks: 363995
03:12:00 < chizu> I tried to get it to boot, that seemed to show some hardware issue existed. One of the drives is randomly saying it's dead, but the cciss controller says the drive is fine or in recovery mode.
03:12:49 < chizu> fsck worked fine, but actually booting the filesystem re-corrupts it.
03:13:16 < chizu> I kept getting kernel tracebacks in cciss or errors in ext3.
03:14:09 < chizu> I couldn't burn any RHEL CDs, as the OSL was out of burnable discs.
03:14:44 < chizu> So... yeah. Got stuck :|

The next logical course of action (since it's sounding a lot like some hardware issue other than the drives) is to swap the drives and network cables with one of the webheads.

Working on getting someone at OSL to assist, can't find chizu right now.

And for reference (since I didn't add it here earlier) this is also being tracked on OSL ticket 2183.
OK, as far as I can tell, everything is back up and running.

We ended up trading hard drives with osweb04, since they're the same model of machine, and the webheads are painless to rebuild (that's all scripted to the hilt), and the error followed the hard drives, :( so it is indeed the drives having problems.  Why the array controller didn't detect the problem, we don't know.

The old osadm01 machine, with the known-good drives from osweb04, was wiped clean, and reinstalled from RHEL4U4 CDs.

The old osweb04 machine was booted from a Gentoo LiveCD, and the drives' filesystem mounted under /mnt/osadm01.  As before, it will only mount read-only, but it still mounts, and having this available beat the pants off restoring backups from tape ;)

The new osadm01 was registered with RHN via rhnreg_ks with the re-activation key generated from the RHN website.  Unfortunately, it did not restore the package set.  When I logged into the RHN website to look, and tried to restore it from server side, it recommended that I not do that, because several of the packages in the profile did not exist and would not be able to be restored.  When I looked at the list of packages, it was all stuff from rpmforge or locally installed.

I rsynced /root /var/named /etc/ssh and /etc/sysconfig off of the old box, then took the list of "nonexistant" packages from above, and fed it to up2date from the command line, reinstalling most of those packages from rpmforge.

Once the ssh host keys were fixed, the /data directory magically appeared and started populating itself (because that's pushed from mradm01).  I rsynced /data/bin/config over later (that's not included in the push from mradm01)

Rather than following through with the rollback from RHN, I just grabbed the list of "packages that would be changed", taking just the ones that "only exist in the snapshot" and fed it to up2date on the command line.

I then rsynced /usr/src/redhat off the old box, and installed our custom-built php packages (which changed nothing except depending on rhn-apache instead of httpd), and the bacula rpm out of /root.  Then installed cacti from rpmforge (which depended on php, which was installed above).

I did a dry-run rsync of the /etc directory, and then cherry-picked files I thought we might have customized and built an include file (which is at /root/crashrestore-rsync.include if anyone needs to see what got restored) for the live rsync of /etc.

I also rsynced over most (but not all) of the stuff in /var (with almost all the daemons shut down at the time, including syslogd).  This included the nameserver config, cron jobs, mail, mail queues, and logs.

I rsynced over only files that didn't already exist from /usr/lib/nagios/plugins to pick up our custom plugins.

I chrooted into the old filesystem on osweb04 and ran chkconfig --list, then saved it to a file on osadm01, sorted, and ran it there as well, and compared the two to get all the right daemons queued for running at boot.

As of this writing, everything I can remember and know how to test seems to be up and running again.

In case anyone runs across anything that is still missing, the original filesystem is at 192.168.0.104:/mnt/osadm01

Not sure how long we'll leave it there before rebuilding osweb04 again (that's probably up to justin).  Once we yank it, we'll still have the previous day's backups on tape if we really need something that bad.
Status: NEW → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.