Closed Bug 652966 Opened 13 years ago Closed 13 years ago

decommission bm-xserve21 (remove from releng configs & slavealloc)

Categories

(Release Engineering :: General, defect, P2)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 688548

People

(Reporter: bear, Assigned: armenzg)

References

()

Details

(Whiteboard: [badslave?][hardware] DNR)

in #developers the sheriff pinged me to look at 

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1303846981.1303851389.5655.gz&fulltext=1

which has this output:

Processing file: ./dist/bin/WriteArgument.dSYM
Processing file: ./dist/bin/xpcshell.dSYM
Processing file: ./dist/bin/xpidl.dSYM
error: Invalid argument - unable to create './dist/bin/XUL.dSYM' bundle directory.
error: Invalid argument - unable to create './dist/bin/components/libalerts_s.dylib.dSYM' bundle directory.
error: Invalid argument - unable to create './dist/bin/crashreporter.app/Contents/MacOS/crashreporter.dSYM' bundle directory.

after talking in IRC and dustin saying that slave sometimes has issues and me not being able to ssh to it, I marked it as disabled and filed this bug
QA Contact: zandr → dustin
This was toyed with in bug 644364, and seemed to be running fine, but wasn't.  So it at least needs a reimage, and any diagnostics we have for macs would be great too.
Assignee: server-ops-releng → zandr
colo-trip: --- → sjc1
Assignee: zandr → mlarrain
Dustin and I went onsite yesterday. Here are the notes from our findings;

There are three failed temp sensors with ridiculously hot values (one
was at 184C).  They are the three fans on the fanboard that are aimed at
teh DIMMs.  I didn't remove the fanboard, but I did feel around the
location of these sensors and the temperature is certainly not above
boiling, so these are bad sensors.

This wouldn't cause failures, so I went on to run the HD diagnostics.
They weren't finished by the time I wandered off, but had already
detected three errors.  The crash-cart is still hooked up, so there
should be more to see when you're back.

IMHO, assuming the disk failures are real failures, we should report
this to releng and as for their prescription - DNR or repair (where the
latter will likely be expensive).

I put the machine back in position, but the network is not re-connected.

I will test check the rest of the notes later today to verify HDD issues.
So matt didn't get to look at the test results here, but assuming that they do show HDD failures as well as temp sensor failures, what's the plan?
Assignee: mlarrain → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: dustin → release
Let's pull this one and retire it. Can be used for parts if we need to repair others. We don't use these for releases anyway.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
Priority: -- → P3
QA Contact: release → zandr
Whiteboard: [badslave?][hardware]
Summary: bm-xserve21 showing signs of hardware/drive issues → decommission bm-xserve21
Added to the decommissioning spreadsheet, removed from nagios.
Assignee: server-ops-releng → mlarrain
Severity: normal → minor
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Will be making sure it doesn't show up on slavealloc.
Assignee: mlarrain → armenzg
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Summary: decommission bm-xserve21 → decommission bm-xserve21 (remove from releng configs & slavealloc)
Whiteboard: [badslave?][hardware] → [badslave?][hardware] DNR
Priority: P3 → P2
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #7)
> Will be making sure it doesn't show up on slavealloc.

I will take care of it in bug 700705.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.