Closed Bug 488240 Opened 16 years ago Closed 16 years ago

Mega nagios config changes

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: aravind)

References

Details

Apologies for the long list here, trying to get everything covered and consistent, plus fix up a couple of broken tests. 1) bm-xserve12.build a) service: RAID, always returns CHECK_NRPE: Socket timeout after 15 seconds. The other checks (PING, disk use etc) work fine. In /usr/local/nagios/etc/nrpe.cfg we have command[check_appleraid]=/usr/local/nagios/plugins/check_appleraid and that utility returns Untitled RAID Set 1 Online: * disk0s2 Online SMART Verified * disk1s2 Online SMART Verified * We don't do this check on bm-xserve16 thru 19 or 22, so we can either get it working and roll it out there, or disable it here. b) Add a check_buildbot check, same as defined on bm-xserve16 2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16 3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of these may fail as these machines are not set up 4) try-mac-slave03/04 - set up same checks as try-mac-slave02 5) try-mac-slave05/06 - set up same checks as try-mac-slave02, we'll move a couple machines from bug 480203 to be these two machines (by separate request) so checks will fail at first 6) moz2-linux-slave06.build - Service buildbot Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK every few minutes. Other checks on this host are OK. Any ideas ? 7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another moz2-linux-slaveNN config 8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new machines) 9) try-linux-slave04 - setup same checks as try-linux-slave03 10) moz2-win32-slave01 - remove processes and avg_load checks 11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new machines) 12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all checks (no longer in use) 13) try-win32-slave01 thru 03, remove avg_load check 14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13) If it's easier/useful to define a standard set of checks for a platform, then assign hosts that set then that'd probably work fine. We want everything to be the same for identical machines like moz2-win32-slaveN, but some of the older machines would still need individual configuration.
Update: 9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of these are still to be cloned in bug 485885)
While fixing those nagios settings, can you also make sure to add the following machines - they are in inventory but not in nagios. (ref platforms, esx hosts, and nokia's have been removed from this list): balsa-18branch bm-symbolfetch01 bm-xserve03 bm-xserve04 bm-xserve07 bm-xserve15 bm-xserve22 crazyhorse egg fx-linux-1.9-slave03 fx-linux-1.9-slave04 fx-win32-1.9-slave03 fx-win32-1.9-slave04 karma moz2-darwin9-slave02 moz2-darwin9-slave03 moz2-darwin9-slave04 moz2-darwin9-slave05 moz2-darwin9-slave06 moz2-darwin9-slave07 moz2-darwin9-slave08 moz2-darwin9-slave09 moz2-darwin9-slave10 moz2-darwin9-slave11 moz2-darwin9-slave12 moz2-linux-slave17 moz2-linux-slave18 moz2-linux-slave19 moz2-linux-slave20 moz2-linux-slave21 moz2-linux-slave22 moz2-linux-slave23 moz2-linux-slave24 moz2-linux-slave25 moz2-linux-workstation moz2-linuxnonsse-slave01 moz2-win32-slave19 moz2-win32-slave20 moz2-win32-slave21 moz2-win32-slave22 moz2-win32-slave23 moz2-win32-slave24 moz2-win32-slave25 moz2-win32-slave26 moz2-win32-slave27 moz2-win32-slave28 moz2-win32-slave29 moz2-win32nonsse-slave01 production-1.8-master production-crazyhorse production-pacifica-vm02 production-patrocles production-prometheus-vm02 prometheus.build qm-buildbot01 qm-mini-centos01 qm-mini-centos02 qm-pleopard-slave01 qm-pleopard-slave02 qm-pleopard-try01 qm-pleopard-try02 qm-plinux-slave01 qm-plinux-slave02 qm-plinux-stage01 qm-plinux-trunk02 qm-pmac-slave01 qm-pmac-slave02 qm-ptiger-try02 qm-pubuntu-try02 qm-pvista-slave01 qm-pvista-slave02 qm-pvista-slave03 qm-pvista-slave04 qm-pvista-try01 qm-pvista-try02 qm-pxp-slave01 qm-pxp-slave02 qm-pxp-slave03 qm-pxp-slave04 qm-pxp-try02 staging-1.9-master staging-crazyhorse staging-opsi staging-pacifica-vm staging-pacifica-vm02 staging-patrocles staging-prometheus-vm staging-prometheus-vm02 staging-stage staging-try-master tb-linux-tbox tbnewref-win32-tbox try-linux-slave04 try-linux-slave05 try-linux-slave06 try-linux-slave07 try-linux-slave08 try-linux-slave09 try-mac-slave03 try-mac-slave04 try-mac-slave05 try-master try-pmac-unit-01 try-win32-slave04 try-win32-slave05 try-win32-slave06 try-win32-slave07 try-win32-slave08 try-win32-slave09 The following machines were not monitored by nagios, but thats ok because they are a)not production RelEng, or b) are obsolete/powered off/etc: bm-centos5-unittest-01 bm-l10n-centos5-01 bm-l10n-pmac-01 bm-l10n-win2k3-01 bm-stage-osx-01 gaius.build mozillabuild-builder papaya pineapple qm-image-master qm-ref-leopard qm-ref-tiger qm-ref-ubuntu qm-ref-vista qm-ref-xpqm-leak-tiger-01 qm-leak-win2k3-01 qm-purify01 qm-rhel03 qm-vmware01 qm-win2k3-stage-pgo01 qm-xserve03 qm-xserve04 qm-xserve05 solaria test-linslave test-mgmt test-opsi test-winslave test-winslave2 unknown-machine unused-1463
Assignee: server-ops → aravind
(In reply to comment #2) > (ref platforms, esx hosts, and nokia's have been removed from this list): > balsa-18branch > bm-symbolfetch01 I need to go through this list to confirm it. balsa-18branch is a fx2.0 machine that was in nagios until very recently, and doesn't need to be added back. bm-symbolfetch01 is actually off, need to talk to Ted if he's still going to use it. There's a lot of overlap with comment #0 too.
2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16 - DONE 3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of these may fail as these machines are not set up - DONE 4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE 5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE
(In reply to comment #4) > 3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of > these may fail as these machines are not set up - DONE > 4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE > 5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also can't see the checks using my nagios login), and we have try-mac-slave07/08/09 that need checks (like try-mac-slave02).
(In reply to comment #5) > We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also > can't see the checks using my nagios login), and we have try-mac-slave07/08/09 > that need checks (like try-mac-slave02). DONE
6) moz2-linux-slave06.build - Service buildbot Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK every few minutes. Other checks on this host are OK. Any ideas ? Fixed - was a config problem in the nagios master.
7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another moz2-linux-slaveNN config - DONE 8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new machines) - DONE
9) try-linux-slave04 - setup same checks as try-linux-slave03 - DONE 10) moz2-win32-slave01 - remove processes and avg_load checks - DONE 11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new machines) - DONE 12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all checks (no longer in use) - DONE 13) try-win32-slave01 thru 03, remove avg_load check - DONE 14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13) - DONE
(In reply to comment #1) > Update: > > 9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of > these are still to be cloned in bug 485885) That one is done as well. Please open a different bug when you have the stuff in comment 2 figured out.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Awesome, thanks aravind!
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.