Closed
Bug 488240
Opened 16 years ago
Closed 16 years ago
Mega nagios config changes
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: aravind)
References
Details
Apologies for the long list here, trying to get everything covered and consistent, plus fix up a couple of broken tests.
1) bm-xserve12.build
a) service: RAID, always returns
CHECK_NRPE: Socket timeout after 15 seconds.
The other checks (PING, disk use etc) work fine. In
/usr/local/nagios/etc/nrpe.cfg
we have
command[check_appleraid]=/usr/local/nagios/plugins/check_appleraid
and that utility returns
Untitled RAID Set 1 Online: * disk0s2 Online SMART Verified * disk1s2 Online SMART Verified *
We don't do this check on bm-xserve16 thru 19 or 22, so we can either get it working and roll it out there, or disable it here.
b) Add a check_buildbot check, same as defined on bm-xserve16
2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16
3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of these may fail as these machines are not set up
4) try-mac-slave03/04 - set up same checks as try-mac-slave02
5) try-mac-slave05/06 - set up same checks as try-mac-slave02, we'll move a couple machines from bug 480203 to be these two machines (by separate request) so checks will fail at first
6) moz2-linux-slave06.build - Service buildbot
Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK every few minutes. Other checks on this host are OK. Any ideas ?
7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another moz2-linux-slaveNN config
8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new machines)
9) try-linux-slave04 - setup same checks as try-linux-slave03
10) moz2-win32-slave01 - remove processes and avg_load checks
11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new machines)
12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all checks (no longer in use)
13) try-win32-slave01 thru 03, remove avg_load check
14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13)
If it's easier/useful to define a standard set of checks for a platform, then assign hosts that set then that'd probably work fine. We want everything to be the same for identical machines like moz2-win32-slaveN, but some of the older machines would still need individual configuration.
Reporter | ||
Comment 1•16 years ago
|
||
Update:
9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of these are still to be cloned in bug 485885)
Comment 2•16 years ago
|
||
While fixing those nagios settings, can you also make sure to add the following machines - they are in inventory but not in nagios.
(ref platforms, esx hosts, and nokia's have been removed from this list):
balsa-18branch
bm-symbolfetch01
bm-xserve03
bm-xserve04
bm-xserve07
bm-xserve15
bm-xserve22
crazyhorse
egg
fx-linux-1.9-slave03
fx-linux-1.9-slave04
fx-win32-1.9-slave03
fx-win32-1.9-slave04
karma
moz2-darwin9-slave02
moz2-darwin9-slave03
moz2-darwin9-slave04
moz2-darwin9-slave05
moz2-darwin9-slave06
moz2-darwin9-slave07
moz2-darwin9-slave08
moz2-darwin9-slave09
moz2-darwin9-slave10
moz2-darwin9-slave11
moz2-darwin9-slave12
moz2-linux-slave17
moz2-linux-slave18
moz2-linux-slave19
moz2-linux-slave20
moz2-linux-slave21
moz2-linux-slave22
moz2-linux-slave23
moz2-linux-slave24
moz2-linux-slave25
moz2-linux-workstation
moz2-linuxnonsse-slave01
moz2-win32-slave19
moz2-win32-slave20
moz2-win32-slave21
moz2-win32-slave22
moz2-win32-slave23
moz2-win32-slave24
moz2-win32-slave25
moz2-win32-slave26
moz2-win32-slave27
moz2-win32-slave28
moz2-win32-slave29
moz2-win32nonsse-slave01
production-1.8-master
production-crazyhorse
production-pacifica-vm02
production-patrocles
production-prometheus-vm02
prometheus.build
qm-buildbot01
qm-mini-centos01
qm-mini-centos02
qm-pleopard-slave01
qm-pleopard-slave02
qm-pleopard-try01
qm-pleopard-try02
qm-plinux-slave01
qm-plinux-slave02
qm-plinux-stage01
qm-plinux-trunk02
qm-pmac-slave01
qm-pmac-slave02
qm-ptiger-try02
qm-pubuntu-try02
qm-pvista-slave01
qm-pvista-slave02
qm-pvista-slave03
qm-pvista-slave04
qm-pvista-try01
qm-pvista-try02
qm-pxp-slave01
qm-pxp-slave02
qm-pxp-slave03
qm-pxp-slave04
qm-pxp-try02
staging-1.9-master
staging-crazyhorse
staging-opsi
staging-pacifica-vm
staging-pacifica-vm02
staging-patrocles
staging-prometheus-vm
staging-prometheus-vm02
staging-stage
staging-try-master
tb-linux-tbox
tbnewref-win32-tbox
try-linux-slave04
try-linux-slave05
try-linux-slave06
try-linux-slave07
try-linux-slave08
try-linux-slave09
try-mac-slave03
try-mac-slave04
try-mac-slave05
try-master
try-pmac-unit-01
try-win32-slave04
try-win32-slave05
try-win32-slave06
try-win32-slave07
try-win32-slave08
try-win32-slave09
The following machines were not monitored by nagios, but thats ok because they are a)not production RelEng, or b) are obsolete/powered off/etc:
bm-centos5-unittest-01
bm-l10n-centos5-01
bm-l10n-pmac-01
bm-l10n-win2k3-01
bm-stage-osx-01
gaius.build
mozillabuild-builder
papaya
pineapple
qm-image-master
qm-ref-leopard
qm-ref-tiger
qm-ref-ubuntu
qm-ref-vista
qm-ref-xpqm-leak-tiger-01
qm-leak-win2k3-01
qm-purify01
qm-rhel03
qm-vmware01
qm-win2k3-stage-pgo01
qm-xserve03
qm-xserve04
qm-xserve05
solaria
test-linslave
test-mgmt
test-opsi
test-winslave
test-winslave2
unknown-machine
unused-1463
Assignee | ||
Updated•16 years ago
|
Assignee: server-ops → aravind
Reporter | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> (ref platforms, esx hosts, and nokia's have been removed from this list):
> balsa-18branch
> bm-symbolfetch01
I need to go through this list to confirm it. balsa-18branch is a fx2.0 machine that was in nagios until very recently, and doesn't need to be added back. bm-symbolfetch01 is actually off, need to talk to Ted if he's still going to use it. There's a lot of overlap with comment #0 too.
Assignee | ||
Comment 4•16 years ago
|
||
2) bm-xserve22.build - add PING and buildbot checks as defined on bm-xserve16 - DONE
3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of
these may fail as these machines are not set up - DONE
4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE
5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE
Reporter | ||
Comment 5•16 years ago
|
||
(In reply to comment #4)
> 3) moz2-darwin9-slave09/10 - setup same checks as moz2-darwin9-slave08, some of
> these may fail as these machines are not set up - DONE
> 4) try-mac-slave03/04 - set up same checks as try-mac-slave02 - DONE
> 5) try-mac-slave05/06 - set up same checks as try-mac-slave02 - DONE
We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also can't see the checks using my nagios login), and we have try-mac-slave07/08/09 that need checks (like try-mac-slave02).
Assignee | ||
Comment 6•16 years ago
|
||
(In reply to comment #5)
> We changed our minds here, sorry. So moz2-darwin9-slave09/10 don't exist (also
> can't see the checks using my nagios login), and we have try-mac-slave07/08/09
> that need checks (like try-mac-slave02).
DONE
Assignee | ||
Comment 7•16 years ago
|
||
6) moz2-linux-slave06.build - Service buildbot
Flaps between "CHECK_NRPE: Error - Could not complete SSL handshake" and OK
every few minutes. Other checks on this host are OK. Any ideas ?
Fixed - was a config problem in the nagios master.
Assignee | ||
Comment 8•16 years ago
|
||
7) moz2-linux-slave11.build needs "disk - /builds" check cloned from another
moz2-linux-slaveNN config - DONE
8) moz2-linux-slave20 thru 25 - setup same checks as moz2-linux-slave19 (new
machines) - DONE
Assignee | ||
Comment 9•16 years ago
|
||
9) try-linux-slave04 - setup same checks as try-linux-slave03 - DONE
10) moz2-win32-slave01 - remove processes and avg_load checks - DONE
11) moz2-win32-slave24 thru 29 - setup same checks as moz2-win32-slave23 (new
machines) - DONE
12) fx-win32-1.9-slave10.build and fx-win32-1.9-slave11.build - remove all
checks (no longer in use) - DONE
13) try-win32-slave01 thru 03, remove avg_load check - DONE
14) try-win32-slave04 thru 09, copy checks from try-win32-slave03 after 13) - DONE
Assignee | ||
Comment 10•16 years ago
|
||
(In reply to comment #1)
> Update:
>
> 9) try-linux-slave04 thru 09 - setup same checks as try-linux-slave03 (four of
> these are still to be cloned in bug 485885)
That one is done as well. Please open a different bug when you have the stuff in comment 2 figured out.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 11•16 years ago
|
||
Awesome, thanks aravind!
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•