Closed Bug 444949 Opened 16 years ago Closed 16 years ago

bm-xserve07 has fts_read problem

Categories

(Release Engineering :: General, defect, P2)

PowerPC
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: nthomas)

References

()

Details

I was fixing Bug 443397, in which, I updated tinderbox's code and tried to start again.
For several cycles, it seems that two instances of tinderbox were running at the same time.
I stopped all instances and decided to remove obj dir and source dir but there is one folder that cannot be deleted. There might be some type of corruption

Command runned: rm -rf /builds/tinderbox/Tb-Trunk/Darwin_8.8.4_Depend/build/universal/i386/dist/include/dom

It returns this:
rm(315) malloc: *** vm_allocate(size=1069056) failed (error code=3)
rm(315) malloc: *** error: can't allocate region
rm(315) malloc: *** set a breakpoint in szone_error to debug
rm: fts_read: Cannot allocate memory
Assignee: server-ops → thardcastle
I wonder if there's a way we could force a disk check?

Can we get a rough idea of the work and time it may take to fix this? It was generating our nightlies and we were getting ready to cut 3.0a2.
Looks like Trevor rebooted this box shortly after he took the bug. The directory in comment #0 is still broken - a call to ls or rm uses 100% CPU and keeps on grabbing memory until it OOMs and the malloc error is reported.
Assignee: thardcastle → nobody
Component: Server Operations → Release Engineering
Priority: -- → P2
QA Contact: justin → release
Trevor didn't mind me grabbing this ;-)
Assignee: nobody → nthomas
Disk Utility's Verify Disk says "Keys out of order. The volume Macintosh RAID needs to be repaired". The Repair Disk button is disabled though. It also says the RAID status is OK, so go figure.

Anyway, I can't do much with it while it's booted off the disk that needs repairing, so back to Server Ops for a colo-trip and boot of CD.
Assignee: nthomas → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → justin
Flags: colo-trip+
Severity: critical → major
Assignee: server-ops → phong
Sounds like the RAID array itself is fine, but the underlying filesystem is corrupted. Not something RAID will protect you against.

Have you tried booting in single-user mode (holding Apple-S down at boot time) ?
That will bring you to a running system with / mounted read-only, so you should be able to run a manual fsck, i.e.:

$> fsck -fy
is this server frozen right now?  it is not responding with the KMV attached.
Status: NEW → ASSIGNED
I have been able to ssh and to vnc to it
Increasing severity, this is keeping the tree closed. Any way I can help, if I had access to the box, I could possibly be driving this myself.
Severity: major → critical
** /dev/redisk1
** root file system
** checking HFS Plus volume
** checking extents overflow file
** checking Catalog file
   Keys out of order
(4, 22709)
** Rebuilding Catalog B-tree
** The volume Macintosh RAID could not be repaired
Sounds bad, one possibility is to try Disk Warrior (http://www.alsoft.com/DiskWarrior), it's got a bootable CD and can often repair more errors than OS X itself can. Worth a shot, IMO.
I've booted off the CD and trying to run the repair that way.
Still won't repair from install DVD.
We probably have to restore from a cloning image then, if the actual drives are OK. Please check that and then use the Intel/10.4 image.

It's a fairly quick to setup tinderbox again.
is there anything I need to save before I wipe it out and reimage this server?
Looking now ...
I've moved everything we care about to stage.m.o:/tmp/bm-xserve07/, so please go ahead with restoring the clone image.
I have images from bm-xserve02 and 03.  Which one of those would you like?
I also have images from bm-xserve10 and 16.
Could we have the one from bm-xserve10 thanks. (x-refbug 410271 comment #18)
re-imaged with 10.4.8.  I also verified that disk and see no errors.
Great. You could pass it back to RelEng for tinderbox setup, or we can file a dependent bug.
Back to RelEng to finish this off.
Assignee: phong → nobody
Status: ASSIGNED → NEW
Component: Server Operations → Release Engineering
Flags: colo-trip+
QA Contact: justin → release
Assignee: nobody → nthomas
Rebuilt tinderbox dirs for Thunderbird and XULRunner, and restarted tinderbox. Will close this once both go green.
Builds went green, resolving FIXED. Thanks to IT for doing most of the work here.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
I'd forgotten about the chown stuff [1] so the XULRunner build failed. Fixed that up and forced a clobber, which was green.

[1] http://wiki.mozilla.org/ReferencePlatforms/Mac#chown_scripts
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.