Closed Bug 410019 Opened 12 years ago Closed 12 years ago
set up nagios monitoring for build machines
Over in bug 393274 I'm setting up nrpe on the build machines. I've got 14 machines total that I'd like to be monitored, but I want to get it right on the first few and then roll that out. Should I put info on what we need monitored in this (or similar) bugs? Is there a nagios cfg repo that I can apply patches to, to make this easier? Any other ideas on getting the build farm monitoring situation (read: no monitoring) under control?
Please set up nagios to monitor the host staging-build-console.build.mozilla.org. NRPE is running and has the following checks set up: check_users check_load check_disk1 check_disk2 check_disk3 check_disk4 check_zombie_procs check_total_procs check_buildbot Can you also do ping, HTTP port 80 and HTTP port 8081 checks?
who should these page/email? do you guys have a build alias for these alerts?
(In reply to comment #2) > who should these page/email? do you guys have a build alias for these alerts? firstname.lastname@example.org is ok for now, this is where the tinderbox and ftp nagios alerts go.. Also it should not go off very often, we should react if it does :)
Here's some more hosts that are setup. They all have the following monitors: check_users check_load check_disk1 check_zombie_procs check_total_procs check_buildbot Some of them have additional disk monitors too, I'll list them with the hosts below: build-console: check_disk2 check_disk3 check_disk4 production-trunk-automation: check_disk2 fx-linux-1.9-slave2: check_disk2 fx-linux-1.9-slave1: check_disk2 production-prometheus-vm: (no extra monitors)
Do you have a build "group" setup within nagios ? I can currently see Build related tests like the stale file and tinderbox checks using my LDAP login, which helps for acknowledging problems in Tier 1 machines when I'm awake but IT isn't. It could be useful to include these new tests, and put other Build people in the "group", so that we can see all our status info all in one place.
Some more machines: staging-trunk-automation: check_users, check_load, check_disk1, check_disk2, check_zombie_procs, check_total_procs, check_buildbot The rest are Windows machines. fx-win32-1.9-slave1 AND fx-win32-1.9-slave2: check_cpu, check_disk1, check_disk2, check_disk3, check_total_procs, check_buildbot production-pacifica-vm AND staging-pacifica-vm: check_cpu, check_disk1, check_total_procs, check_buildbot Is there an ETA for getting these setup?
Mac's are up and running now. fx-mac-1.9-slave1, fx-mac-1.9-slave2, bm-xserve03, and bm-xserve05 all have nrpe running with the following monitors: check_users check_load check_disk1 check_total_procs check_zombie_procs check_buildbot Aravind, you mentioned changes to the way disk monitoring is done. Just let me know what those are and I'll do it.
Here is a copy of the e-mail I sent bhearsum, just in case someone else from the build team decides to do this. I will expect nrpe on the host to answer to at least the following. You can define any additional tests as you like and I will call those without any arguments. The check_cciss and the check_hp* stuff is hor HP hardware and is optional if you don't want me to monitor that stuff. I don't see any scripts named check_zombie_proc and check_total_proc, but they should accept similar arguments as the check_proc script (unless they are windows, then all bets are off). I will call check_buildbot without any arguments. Note: the check_disk script will take a path (mount point) argument and will check that mount point for disk space. So for the stuff marked check_disk1, check_disk2 etc, you will have to give me the mount point names. There is always a root disk check (for mount point /) for each machine by default. Let me know if you need me to clarify any of this stuff. Oh, and since nrpe will be accepting command line arguments, set dont_blame_nrpe to 1. Aravind. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ command[check_disk]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ command[check_users]=/usr/lib/nagios/plugins/check_users -w $ARG1$ -c $ARG2$ command[check_cciss]=/usr/lib/nagios/plugins/check_cciss -N 0 -v command[check_procs]=/usr/lib/nagios/plugins/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$ command[check_hplog]=/usr/lib/nagios/plugins/check_hplog -t l command[check_hpasm]=/usr/lib/nagios/plugins/contrib/check_hpasm -t 5
OK. All of the Linux and Mac machines are now set up to respond to check_load, check_disk, check_users, and check_procs as specified in comment #8. The Windows machines will respond to check_load, check_disk, and check_procs. All of these machines respond to check_buildbot. I'm not sure what good warn/crit levels are, so I'll leave that to you. Here's a list of mountpoints/drives to check_disk on, though (excluding /). staging-build-console: /builds /cvs /data build-console: /builds /cvs /data staging-trunk-automation, production-trunk-automation, fx-linux-1.9-slave1, fx-linux-1.9-slave2: /builds fx-win32-1.9-slave1, fx-win32-1.9-slave2: C D E The Windows hosts require a % sign on the first two args of check_disk, the default unit is Bytes, so it should be checked like..eg. './check_nrpe -H fx-win32-1.9-slave1.build -c check_disk -a 20% 10% C'. Let me know if you need anymore information.
It look like check_load on the linux boxes is set to command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ -p $ARG3$ Per comment 8, it should be set to command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ Please make the correction on the build boxes and restart nrpe. Please make sure that the other checks are defined correctly as well.
I noticed this too, but I thought I corrected them all.
As of 7.45pm PST, I'm getting nagios check_load usage errors from the following machines: staging-trunk-automation.build production-trunk-automation.build staging-build-console.build fx-linux-1.9-slave2.build Hope that helps....
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.