Closed
Bug 513718
Opened 16 years ago
Closed 16 years ago
Investigate why cb-xserve01 keeps going offline
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: alqahira, Assigned: phong)
Details
+++ This bug was initially created as a clone of Bug #513412 +++
This morning, shortly after Phong rebooted cb-xserve01.mozilla.com (Camino's primary tinderbox) in bug 513412, it went yellow-for-longer-than-a-clobber on the very first build, and I can't ssh into the box (again). When I try to ssh into the box, I'm just sitting and waiting for the password prompts to appear and I assume, if I wait long enough, ssh will finally time out (I can, however, still ping it from the jumphost).
Before restarting Tinderbox on the machine, I checked the disk in Disk Utility, and it found no problems. I also checked the system.log, console.log, and their rolled-over prior versions for anything that might tell me why the box went offline on Friday and saw nothing that jumped out (only the usual DNS spam about cb-xserve01.local ≠ cb-xserve01.mozilla.com).
Please investigate why this Xserve is going offline when we try to make a build; thanks.
Comment 1•16 years ago
|
||
Probably need to put the KVM on this. Is VNC even working for you?
Flags: colo-trip+
Reporter | ||
Comment 2•16 years ago
|
||
I can't connect via ssh, which AFAIK is needed in order to get VNC to connect (at least via the jumphost). VNC did work this morning after the reboot, though; I ran Disk Utility and browsed the Console via the GUI, but that's not now ;)
Comment 3•16 years ago
|
||
You could port-forward VNC through cb-jump01. Box is answering on port 5900 but I can't tell if VNC is working. Either way, needs an onsite trip to put a KVM on it. That'll happen later in the day.
Comment 4•16 years ago
|
||
Is this machine using RAID for the disk ? The canonical failure mode for MoCo XServes is the filesystem cooking itself, manifesting as the box freezing up as soon as you try to run tinderbox on it. I've seen machines pass DiskUtility, but trying to do a ls on a terminal at the same time essentially hangs. Our solution has been to restore from a disk image.
Reporter | ||
Comment 5•16 years ago
|
||
I believe the disk had a RAID label attached to it in Disk Utility when I ran it this morning (iirc, there's a bizarre setup where there appear to be three disks but we only have one volume); I don't look at Disk Utility there enough to remember well, though.
I definitely was on the box over VNC as tinderbox started, but it looks like I didn't stay on tailing the log over ssh after that.
Comment 6•16 years ago
|
||
Server was unresponsive when I hooked up a local KVM. I power cycled it, and watched it spin on the boot screen for a few minutes.
Hooked up to the Raritan (kvm01) on channel 16.
Reporter | ||
Comment 7•16 years ago
|
||
Please let me know if there's anything you need me to do/from me to aid in the investigation :)
Comment 8•16 years ago
|
||
I suspect, based on the age of the box, that it's out of warranty. If comment #4 is right, then it needs to be re-imaged.
Comment 9•16 years ago
|
||
Where are we on getting the box re-imaged then?
Comment 10•16 years ago
|
||
I think mentally I was waiting for you guys to say "go ahead" since it'll obviously wipe all data. I'm not sure what other recovery techniques we can use - Phong is back in the office today and might have some insights.
Assignee: server-ops → phong
Comment 11•16 years ago
|
||
Ah, sorry. "Go ahead."
We can setup the tinderbox again. I don't think there's anything we absolutely need on it. Smokey?
Assignee | ||
Comment 12•16 years ago
|
||
this server is pretty much dead. we'll have to re-image it. it is stuck at the apple screen with the spinning wheel.
Assignee | ||
Comment 13•16 years ago
|
||
I am going to use an image from cb-xserve03.
Assignee | ||
Comment 14•16 years ago
|
||
re-imaged and should be online again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•