Closed
Bug 857209
Opened 12 years ago
Closed 12 years ago
most (all?) of the releng scl1 machines are alerting in nagios
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: afernandez)
Details
Not 100% sure if it's a nagios issue or a network issue, I can still route to the hosts that I've tried. We're getting alerts like:
14:22 < nagios-releng> Tue 11:19:20 PDT [499] panda-0337.p3.releng.scl1.mozilla.com is DOWN :(Host Check Timed
Out)
14:22 < nagios-releng> Tue 11:19:20 PDT [400] panda-0311.p3.releng.scl1.mozilla.com is DOWN :(Host Check Timed
Out)
14:22 < nagios-releng> Tue 11:19:20 PDT [401] panda-0335.p3.releng.scl1.mozilla
14:19 < nagios-releng> Tue 11:18:47 PDT [420] master-puppet1.build.scl1.mozilla.com:Ganglia IO is CRITICAL:
CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/Ganglia+IO)
14:19 < nagios-releng> Tue 11:18:47 PDT [421] mac-signing3.build.scl1.mozilla.com:ntp time is CRITICAL:
CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
14:20 < nagios-releng> Tue 11:18:47 PDT [422] mobile-imaging-009.p9.releng.scl1.mozilla.com:ntp time is
| Assignee | ||
Comment 1•12 years ago
|
||
This was related to admin1.mtv1.mozilla.com (VPN) not being up network wise.
It is now, please check again.
Assignee: server-ops → afernandez
Comment 2•12 years ago
|
||
This was actually puppet killing keepalived on admin hosts (including those in scl1).
| Reporter | ||
Comment 3•12 years ago
|
||
I'm pretty sure things have recovered now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 4•12 years ago
|
||
There was further fallout (all the w7 machines fell off the net) and dhcp had to be restarted.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•