Closed Bug 513718 Opened 16 years ago Closed 16 years ago

Investigate why cb-xserve01 keeps going offline

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alqahira, Assigned: phong)

Details

+++ This bug was initially created as a clone of Bug #513412 +++ This morning, shortly after Phong rebooted cb-xserve01.mozilla.com (Camino's primary tinderbox) in bug 513412, it went yellow-for-longer-than-a-clobber on the very first build, and I can't ssh into the box (again). When I try to ssh into the box, I'm just sitting and waiting for the password prompts to appear and I assume, if I wait long enough, ssh will finally time out (I can, however, still ping it from the jumphost). Before restarting Tinderbox on the machine, I checked the disk in Disk Utility, and it found no problems. I also checked the system.log, console.log, and their rolled-over prior versions for anything that might tell me why the box went offline on Friday and saw nothing that jumped out (only the usual DNS spam about cb-xserve01.local ≠ cb-xserve01.mozilla.com). Please investigate why this Xserve is going offline when we try to make a build; thanks.
Probably need to put the KVM on this. Is VNC even working for you?
Flags: colo-trip+
I can't connect via ssh, which AFAIK is needed in order to get VNC to connect (at least via the jumphost). VNC did work this morning after the reboot, though; I ran Disk Utility and browsed the Console via the GUI, but that's not now ;)
You could port-forward VNC through cb-jump01. Box is answering on port 5900 but I can't tell if VNC is working. Either way, needs an onsite trip to put a KVM on it. That'll happen later in the day.
Is this machine using RAID for the disk ? The canonical failure mode for MoCo XServes is the filesystem cooking itself, manifesting as the box freezing up as soon as you try to run tinderbox on it. I've seen machines pass DiskUtility, but trying to do a ls on a terminal at the same time essentially hangs. Our solution has been to restore from a disk image.
I believe the disk had a RAID label attached to it in Disk Utility when I ran it this morning (iirc, there's a bizarre setup where there appear to be three disks but we only have one volume); I don't look at Disk Utility there enough to remember well, though. I definitely was on the box over VNC as tinderbox started, but it looks like I didn't stay on tailing the log over ssh after that.
Server was unresponsive when I hooked up a local KVM. I power cycled it, and watched it spin on the boot screen for a few minutes. Hooked up to the Raritan (kvm01) on channel 16.
Please let me know if there's anything you need me to do/from me to aid in the investigation :)
I suspect, based on the age of the box, that it's out of warranty. If comment #4 is right, then it needs to be re-imaged.
Where are we on getting the box re-imaged then?
I think mentally I was waiting for you guys to say "go ahead" since it'll obviously wipe all data. I'm not sure what other recovery techniques we can use - Phong is back in the office today and might have some insights.
Assignee: server-ops → phong
Ah, sorry. "Go ahead." We can setup the tinderbox again. I don't think there's anything we absolutely need on it. Smokey?
this server is pretty much dead. we'll have to re-image it. it is stuck at the apple screen with the spinning wheel.
I am going to use an image from cb-xserve03.
re-imaged and should be online again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.