Closed Bug 651558 Opened 13 years ago Closed 13 years ago

Re-enable nagios checks for geriatric Mac slaves

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: arich)

References

Details

I just compiled and installed PPC versions of the nagios plugins on the geriatric Mac slaves in bug 578234, so we should re-enable whatever checks we can for these slaves.

The affected slaves are:
* bm-xserve0[1-5]
* g4-leopard01
Assignee: server-ops-releng → arich
I've added the checks for bm-xserve0[1-5] back in, but the DNS information for g4-leopard01 disagrees when it comes to PTR and A record, so the nagios config build scripts will not handle it correctly.  

host g4-leopard01.build
g4-leopard01.build.mozilla.org is an alias for g4-leopard1.build.mtv1.mozilla.com.
g4-leopard1.build.mtv1.mozilla.com has address 10.250.48.73
host 10.250.48.73
73.48.250.10.in-addr.arpa domain name pointer g4-leopard01.mv.mozilla.com.

The A and PTR need to match.

I'm not sure what the correct hostname/datacenter designator should be for this host.  Does someone have more information?
Status: NEW → ASSIGNED
Many of the checks failed due to lack of NRPE definitions.  I've acked them for now until we can decide what to do with them.
As for g4-leopard01, let's follow the usual slave pattern:

$ORIGIN build.mozilla.org.
g4-leopard01 IN CNAME g4-leopard01.build.mtv1.mozilla.com.

$ORIGIN build.mtv.mozilla.com.
g4-leopard01 IN A 10.250.48.73

$ORIGIN 48.250.10.in-addr.arpa.
73 IN PTR g4-leopard01.build.mtv1.mozilla.com.


As for the NRPE checks, here's the list for future reference:

bm-xserve01.build:buildbot is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
bm-xserve01.build:disk - / is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.

bm-xserve02.build:hung slave is CRITICAL: NRPE: Command check_file_age not defined

bm-xserve03.build:buildbot is WARNING: PROCS WARNING: 0 processes with command name python, args buildbot.tac

bm-xserve04.build:hung slave is CRITICAL: NRPE: Command check_file_age not defined

bm-xserve05.build:hung slave is CRITICAL: NRPE: Command check_file_age not defined


It looks like these systems aren't using runslave.py, either (hence the PROCS WARNING).  We should probably fix that - bug 652125.  These boxes don't run puppet.  That's probably not worth fixing.
Coop, I've fixed all of these up except for g4-leopard01.  It doesn't appear to be running nrpe, and I don't know the root or cltbld passwd, so I can't get in to take a look.  Do you have a way I can log in?
This is all set.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.