Closed Bug 438811 Opened 16 years ago Closed 16 years ago

bm-xserve11 wont boot

Categories

(Release Engineering :: General, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

Details

(Keywords: fixed1.9.0.1, verified1.9.0.1)

bm-xserve11 does debug builds for Firefox 3.0 on mac, so it's on the Firefox tinderbox tree. It's normally solid as a rock but the build starting at 2008/06/11 23:25 PDT never completed (normally takes 10 minutes). There were no checkins so this is looking like a random glitch or hardware problem.
Can't get a login prompt with ssh despite the initial connection being made, and doesn't respond to VNC either.

Over to Server Ops for someone to look at the console. It's a Tier 1 box but the tree is closed at the moment, so no need to set blocker severity.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Release Engineering: Maintenance → Server Operations: Tinderbox Maintenance
QA Contact: release → justin
Assignee: server-ops → mrz
Flags: colo-trip+
Console was hung (all keyboard LEDs lit up but nothing on the monitor).  Power cycled.  Box is up at login prompt.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Looks like it was doing a checkout when it went boom. Saved at
 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/Darwin_8.8.4_Depend.log.20080611-hang

Tinderbox restarted.
Gah, it did it again. Could you do a disk check please ?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
called for remote hands to power cycle.
power cycled but it's down again? 
I can keep power cycling but it keeps going dark, usually after you start doing something.  Something to fix on your end?
Assignee: mrz → nobody
Status: REOPENED → NEW
Component: Server Operations: Tinderbox Maintenance → Release Engineering
Flags: colo-trip+
QA Contact: justin → release
Maybe. Tinderbox says a build started at 2008/06/12 09:49, so that's comment #3 & 4. As far as I can tell no-one touched it after that, but CC'ing people that might've.

If it was touched then we should definitely clobber the build and have a general poke around. If not, then I think we're requesting some hardware diagnostics and a disk verify. 
I haven't touched it
Ok, I'm on the hook from the build side. Please reboot this again when you can get hold of me on IRC, then I can look at it immediately.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
Priority: -- → P2
QA Contact: release → justin
remote hands called.
Assignee: server-ops → mrz
Box rebooted.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Host won't boot off local disks.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Nick - OS reload?  
It's not too bad if we have to put the 10.4 Intel image on there and set up tinderbox again, but I'd really like to be confident in the hardware first.

Can we boot of a diagnostic CD and try to figure out if it's a problem with the RAID setup or disk failure or some other hardware fault? If one disk has died then we could replace that disk and rebuild the RAID. If the RAID is corrupted, could we look at the SMART status on the disks in case there's a failure right around the corner. 
Over to Phong to check.
Assignee: mrz → phong.tran
Status: REOPENED → NEW
Flags: colo-trip+
Status: NEW → ASSIGNED
Verifying volume "untitled RAID Set 1"
Checking HFS Plus Volume
Checking Extents Overflows file.
Checking Catalog file.
Invalid extent entry
Incorrect block count for file Compose.strings
(It should be 0 instead of 49247)
Invalid extent entry
Checking multi-linked files.
Checking catalog hierarchy
Checking Extent Attributes file.
Checking volume bitmap.
Volume Bit Map needs minor repair.
Checking volume information.
Invalid volume free block count.
(It should be 13608976 instead of 13608963).
The volume Server RAID needs to be repaired.

Error: The underlying task reported failure on exit

1 HFS volume checked
	Volume needs repair
volume was successfully repaired, but it still won't boot to OS.
What else can we try here, to get this machines back online?

Now that the disk has been repaired, maybe we could try imaging from another similar xserve, instead of imaging from a clean OSinstall?
Summary: bm-xserve11 is hung → bm-xserve11 wont boot
Do you have a server I can take down to create an image from?
Please use Build's "gold image" for Intel 10.4 rather than a running machine.
if needed, we can return qm-xserve06 for re-imaging. It's redundant on the unittest farm.
I'll try the 10.4 image first.
bm-xserve11 has been imaged with gold 10.4 image from bm-xserve09.
Thanks Phong, looks good. Taking back to RelEng for tinderbox setup.

I've set the hostname to bm-xserve11.build.m.o. Coop, anything special for a debug tinderbox setup ? Do you have time to get this going again ?
Assignee: phong → nobody
Status: ASSIGNED → NEW
Component: Server Operations → Release Engineering
Flags: colo-trip+
QA Contact: justin → release
This is a tier 1 machine on the 1.9.0 branch... why isn't the tree closed?
(In reply to comment #26)
> This is a tier 1 machine on the 1.9.0 branch... why isn't the tree closed?

Tree is now closed.
This is blocking our work on Gecko 1.9.0.1, and should be considered a P1 blocker. To whom should it be assigned?
Flags: blocking1.9.0.1+
Priority: P2 → P1
Assignee: nobody → ccooper
Status: NEW → ASSIGNED
A build is in progress now, but we're running in one-off mode until I'm sure the config is right. Don't let me forget to get a multi-config.pl setup!
Status: ASSIGNED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
(In reply to comment #29)
> Don't let me forget to get a multi-config.pl setup!

Coop reminded me that its done. And that it was done when I asked last time too. Adding this note so I dont stumble across comment#29 and worry anymore. :-)
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.