Closed Bug 768746 Opened 13 years ago Closed 13 years ago

Add nagios checks for all new windows infrastructure machines

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mlarrain, Assigned: mlarrain)

Details

I have installed the nagios client on dc7.releng.ad.mozilla.com [10.12.69.19] and I need assistance setting up the server side checks for this machine. They can mirror the same checks that dc01.winbuild.scl1.mozilla.com [10.12.40.13] have setup.
Assignee: server-ops-releng → mlarrain
The client and server side checks need to be added for: dc1.ad.mozilla.com dc2.ad.mozilla.com dc6.releng.ad.mozilla.com dc7.releng.ad.mozilla.com wds1.releng.ad.mozilla.com (any others I'm missing?)
Summary: Setup server side checks for dc7.releng.ad → Add nagios checks for all new windows infrastructure machines
Assignee: mlarrain → dustin
storage1.releng.ad.mozilla.com
OK, these are in (not storage1 - turns out we'll be killing it). However, DNS isn't working yet, so nagios isn't monitoring most of them.
storage1 has been killed off there is also kms1.ad.mozilla.com & kms2.ad.mozilla.com that will be getting configured to be the kms and wsus servers.
More accurately, dc6, dc7, and wds1 are added to admin1 for monitoring in releng. I just added dc1, dc2, and kms1 to nagios via puppet, all in the appropriate DC's. So far: 18:19 < nagios-releng-scl1> [18] dc7.releng.ad:disk - C is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. 18:22 < nagios-releng-scl1> [19] wds1.releng.ad:disk - C is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. and dc6 was already red in the web UI, so I downtimed it too. I'll leave those fixes up to Matt.
Assignee: dustin → mlarrain
kms1, dc1, and dc2 are failing their NRPE checks too.
Also, those paged infra oncall. Can this be fixed? Let them page Matt only or just alert in #somechannel? Thanks!
They are supposed to page oncall, although these particular alerts were bogus. Matt *also* gets paged. As this windows forest comes online, Matt will be working with oncall to make sure y'all can solve the problems that come up. These all got fixed on Friday, or they'd have been paging all weekend..
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.