Closed Bug 488240 Opened 15 years ago Closed 15 years ago

Mega nagios config changes

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: aravind)

References

Details

Apologies for the long list here, trying to get everything covered and consistent, plus fix up a couple of broken tests.

1) bm-xserve12.build
 a) service: RAID, always returns
      CHECK_NRPE: Socket timeout after 15 seconds. 
    The other checks (PING, disk use etc) work fine. In 
      /usr/local/nagios/etc/nrpe.cfg
    we have 
      command[check_appleraid]=/usr/local/nagios/plugins/check_appleraid
    and that utility returns
      Untitled RAID Set 1 Online: * disk0s2 Online SMART Verified * disk1s2 Online SMART Verified *
    We don't do this check on bm-xserve16 thru 19 or 22, so we can either get it working and roll it out there, or disable it here.
 b) Add a check_buildbot check, same as defined on bm-xserve16

2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16

3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of these may fail as these machines are not set up

4) try-mac-slave03/04 - set up same checks as try-mac-slave02

5) try-mac-slave05/06 - set up same checks as try-mac-slave02, we'll move a couple machines from bug 480203 to be these two machines (by separate request) so checks will fail at first

6) moz2-linux-slave06.build - Service buildbot
   Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK every few minutes. Other checks on this host are OK. Any ideas ?

7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another moz2-linux-slaveNN config

8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new machines)

9) try-linux-slave04 - setup same checks as try-linux-slave03

10) moz2-win32-slave01 - remove processes and avg_load checks

11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new machines)

12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all checks (no longer in use)

13) try-win32-slave01 thru 03, remove avg_load check

14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13)


If it's easier/useful to define a standard set of checks for a platform, then assign hosts that set then that'd probably work fine. We want everything to be the same for identical machines like moz2-win32-slaveN, but some of the older machines would still need individual configuration.
Update:

9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of these are still to be cloned in bug 485885)
While fixing those nagios settings, can you also make sure to add the following machines - they are in inventory but not in nagios.

(ref platforms, esx hosts, and nokia's have been removed from this list):
balsa-18branch
bm-symbolfetch01
bm-xserve03
bm-xserve04
bm-xserve07
bm-xserve15
bm-xserve22
crazyhorse
egg
fx-linux-1.9-slave03
fx-linux-1.9-slave04
fx-win32-1.9-slave03
fx-win32-1.9-slave04
karma
moz2-darwin9-slave02
moz2-darwin9-slave03
moz2-darwin9-slave04
moz2-darwin9-slave05
moz2-darwin9-slave06
moz2-darwin9-slave07
moz2-darwin9-slave08
moz2-darwin9-slave09
moz2-darwin9-slave10
moz2-darwin9-slave11
moz2-darwin9-slave12
moz2-linux-slave17
moz2-linux-slave18
moz2-linux-slave19
moz2-linux-slave20
moz2-linux-slave21
moz2-linux-slave22
moz2-linux-slave23
moz2-linux-slave24
moz2-linux-slave25
moz2-linux-workstation
moz2-linuxnonsse-slave01
moz2-win32-slave19
moz2-win32-slave20
moz2-win32-slave21
moz2-win32-slave22
moz2-win32-slave23
moz2-win32-slave24
moz2-win32-slave25
moz2-win32-slave26
moz2-win32-slave27
moz2-win32-slave28
moz2-win32-slave29
moz2-win32nonsse-slave01
production-1.8-master
production-crazyhorse
production-pacifica-vm02
production-patrocles
production-prometheus-vm02
prometheus.build
qm-buildbot01
qm-mini-centos01
qm-mini-centos02
qm-pleopard-slave01
qm-pleopard-slave02
qm-pleopard-try01
qm-pleopard-try02
qm-plinux-slave01
qm-plinux-slave02
qm-plinux-stage01
qm-plinux-trunk02
qm-pmac-slave01
qm-pmac-slave02
qm-ptiger-try02
qm-pubuntu-try02
qm-pvista-slave01
qm-pvista-slave02
qm-pvista-slave03
qm-pvista-slave04
qm-pvista-try01
qm-pvista-try02
qm-pxp-slave01
qm-pxp-slave02
qm-pxp-slave03
qm-pxp-slave04
qm-pxp-try02
staging-1.9-master
staging-crazyhorse
staging-opsi
staging-pacifica-vm
staging-pacifica-vm02
staging-patrocles
staging-prometheus-vm
staging-prometheus-vm02
staging-stage
staging-try-master
tb-linux-tbox
tbnewref-win32-tbox
try-linux-slave04
try-linux-slave05
try-linux-slave06
try-linux-slave07
try-linux-slave08
try-linux-slave09
try-mac-slave03
try-mac-slave04
try-mac-slave05
try-master
try-pmac-unit-01
try-win32-slave04
try-win32-slave05
try-win32-slave06
try-win32-slave07
try-win32-slave08
try-win32-slave09


The following machines were not monitored by nagios, but thats ok because they are a)not production RelEng, or b) are obsolete/powered off/etc:
bm-centos5-unittest-01
bm-l10n-centos5-01
bm-l10n-pmac-01
bm-l10n-win2k3-01
bm-stage-osx-01
gaius.build
mozillabuild-builder
papaya
pineapple
qm-image-master
qm-ref-leopard
qm-ref-tiger
qm-ref-ubuntu
qm-ref-vista
qm-ref-xpqm-leak-tiger-01
qm-leak-win2k3-01
qm-purify01
qm-rhel03
qm-vmware01
qm-win2k3-stage-pgo01
qm-xserve03
qm-xserve04
qm-xserve05
solaria
test-linslave
test-mgmt
test-opsi
test-winslave
test-winslave2
unknown-machine
unused-1463
Assignee: server-ops → aravind
(In reply to comment #2)
> (ref platforms, esx hosts, and nokia's have been removed from this list):
> balsa-18branch
> bm-symbolfetch01

I need to go through this list to confirm it. balsa-18branch is a fx2.0 machine that was in nagios until very recently, and doesn't need to be added back. bm-symbolfetch01 is actually off, need to talk to Ted if he's still going to use it. There's a lot of overlap with comment #0 too.
2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16 - DONE

3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of
these may fail as these machines are not set up - DONE

4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE

5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE
(In reply to comment #4)
> 3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of
> these may fail as these machines are not set up - DONE
> 4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE
> 5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE

We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also can't see the checks using my nagios login), and we have try-mac-slave07/08/09 that need checks (like try-mac-slave02).
(In reply to comment #5)
> We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also
> can't see the checks using my nagios login), and we have try-mac-slave07/08/09
> that need checks (like try-mac-slave02).

DONE
6) moz2-linux-slave06.build - Service buildbot
   Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK
every few minutes. Other checks on this host are OK. Any ideas ?

Fixed - was a config problem in the nagios master.
7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another
moz2-linux-slaveNN config - DONE

8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new
machines) - DONE
9) try-linux-slave04 - setup same checks as try-linux-slave03 - DONE
10) moz2-win32-slave01 - remove processes and avg_load checks - DONE
11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new
machines) - DONE
12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all
checks (no longer in use) - DONE
13) try-win32-slave01 thru 03, remove avg_load check - DONE
14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13) - DONE
(In reply to comment #1)
> Update:
> 
> 9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of
> these are still to be cloned in bug 485885)

That one is done as well.  Please open a different bug when you have the stuff in comment 2 figured out.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Awesome, thanks aravind!
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.