Closed Bug 857209 Opened 12 years ago Closed 12 years ago

most (all?) of the releng scl1 machines are alerting in nagios

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: afernandez)

Details

Not 100% sure if it's a nagios issue or a network issue, I can still route to the hosts that I've tried. We're getting alerts like: 14:22 < nagios-releng> Tue 11:19:20 PDT [499] panda-0337.p3.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out) 14:22 < nagios-releng> Tue 11:19:20 PDT [400] panda-0311.p3.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out) 14:22 < nagios-releng> Tue 11:19:20 PDT [401] panda-0335.p3.releng.scl1.mozilla 14:19 < nagios-releng> Tue 11:18:47 PDT [420] master-puppet1.build.scl1.mozilla.com:Ganglia IO is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/Ganglia+IO) 14:19 < nagios-releng> Tue 11:18:47 PDT [421] mac-signing3.build.scl1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time) 14:20 < nagios-releng> Tue 11:18:47 PDT [422] mobile-imaging-009.p9.releng.scl1.mozilla.com:ntp time is
This was related to admin1.mtv1.mozilla.com (VPN) not being up network wise. It is now, please check again.
Assignee: server-ops → afernandez
This was actually puppet killing keepalived on admin hosts (including those in scl1).
I'm pretty sure things have recovered now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
There was further fallout (all the w7 machines fell off the net) and dhcp had to be restarted.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.