Closed Bug 410019 Opened 12 years ago Closed 12 years ago

set up nagios monitoring for build machines

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: aravind)

References

Details

Over in bug 393274 I'm setting up nrpe on the build machines.

I've got 14 machines total that I'd like to be monitored, but I want to get it right on the first few and then roll that out.

Should I put info on what we need monitored in this (or similar) bugs? Is there a nagios cfg repo that I can apply patches to, to make this easier?

Any other ideas on getting the build farm monitoring situation (read: no monitoring) under control?
Please set up nagios to monitor the host staging-build-console.build.mozilla.org. NRPE is running and has the following checks set up:

check_users
check_load
check_disk1
check_disk2
check_disk3
check_disk4
check_zombie_procs
check_total_procs
check_buildbot

Can you also do ping, HTTP port 80 and HTTP port 8081 checks?
Blocks: 393274
who should these page/email?  do you guys have a build alias for these alerts?
(In reply to comment #2)
> who should these page/email?  do you guys have a build alias for these alerts?

build@mozilla.org is ok for now, this is where the tinderbox and ftp nagios alerts go.. Also it should not go off very often, we should react if it does :)
Here's some more hosts that are setup. They all have the following monitors:
check_users
check_load
check_disk1
check_zombie_procs
check_total_procs
check_buildbot

Some of them have additional disk monitors too, I'll list them with the hosts below:
build-console:
check_disk2
check_disk3
check_disk4

production-trunk-automation:
check_disk2

fx-linux-1.9-slave2:
check_disk2

fx-linux-1.9-slave1:
check_disk2

production-prometheus-vm:
(no extra monitors)
Assignee: server-ops → aravind
Do you have a build "group" setup within nagios ? I can currently see Build related tests like the stale file and tinderbox checks using my LDAP login, which helps for acknowledging problems in Tier 1 machines when I'm awake but IT isn't. 

It could be useful to include these new tests, and put other Build people in the "group", so that we can see all our status info all in one place.
Some more machines:
staging-trunk-automation:
check_users, check_load, check_disk1, check_disk2, check_zombie_procs, check_total_procs, check_buildbot

The rest are Windows machines.
fx-win32-1.9-slave1 AND fx-win32-1.9-slave2:
check_cpu, check_disk1, check_disk2, check_disk3, check_total_procs, check_buildbot
production-pacifica-vm AND staging-pacifica-vm:
check_cpu, check_disk1, check_total_procs, check_buildbot


Is there an ETA for getting these setup?
Mac's are up and running now. fx-mac-1.9-slave1, fx-mac-1.9-slave2, bm-xserve03, and bm-xserve05 all have nrpe running with the following monitors:
check_users
check_load
check_disk1
check_total_procs
check_zombie_procs
check_buildbot

Aravind, you mentioned changes to the way disk monitoring is done. Just let me know what those are and I'll do it.
Here is a copy of the e-mail I sent bhearsum, just in case someone else from the build team decides to do this.

I will expect nrpe on the host to answer to at least the following.                                                                                                                                                                                                                                                            
You can define any additional tests as you like and I will call those                                                                                                                                                                                                                                                          
without any arguments.  The check_cciss and the check_hp* stuff is hor                                                                                                                                                                                                                                                         
HP hardware and is optional if you don't want me to monitor that                                                                                                                                                                                                                                                               
stuff.  I don't see any scripts named check_zombie_proc and                                                                                                                                                                                                                                                                    
check_total_proc, but they should accept similar arguments as the                                                                                                                                                                                                                                                              
check_proc script (unless they are windows, then all bets are off).  I                                                                                                                                                                                                                                                         
will call check_buildbot without any arguments.                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                               
Note: the check_disk script will take a path (mount point) argument                                                                                                                                                                                                                                                            
and will check that mount point for disk space.  So for the stuff                                                                                                                                                                                                                                                              
marked check_disk1, check_disk2 etc, you will have to give me the                                                                                                                                                                                                                                                              
mount point names.  There is always a root disk check (for mount point                                                                                                                                                                                                                                                         
/) for each machine by default.  Let me know if you need me to clarify                                                                                                                                                                                                                                                         
any of this stuff.  Oh, and since nrpe will be accepting command line                                                                                                                                                                                                                                                          
arguments, set dont_blame_nrpe to 1.                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                               
Aravind.                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                               
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=                                                                                                                                                                                                                                                       
command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$                                                                                                                                                                                                                                                     
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$                                                                                                                                                                                                                                           
command[check_users]=/usr/lib/nagios/plugins/check_users -w $ARG1$ -c $ARG2$                                                                                                                                                                                                                                                   
command[check_cciss]=/usr/lib/nagios/plugins/check_cciss -N 0 -v                                                                                                                                                                                                                                                               
command[check_procs]=/usr/lib/nagios/plugins/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$                                                                                                                                                                                                                                         
command[check_hplog]=/usr/lib/nagios/plugins/check_hplog -t l                                                                                                                                                                                                                                                                  
command[check_hpasm]=/usr/lib/nagios/plugins/contrib/check_hpasm -t 5                                                                                                                                                                                                                                                          
OK. All of the Linux and Mac machines are now set up to respond to check_load, check_disk, check_users, and check_procs as specified in comment #8. The Windows machines will respond to check_load, check_disk, and check_procs. All of these machines respond to check_buildbot.

I'm not sure what good warn/crit levels are, so I'll leave that to you. Here's a list of mountpoints/drives to check_disk on, though (excluding /).
staging-build-console:
/builds
/cvs
/data

build-console:
/builds
/cvs
/data

staging-trunk-automation, production-trunk-automation, fx-linux-1.9-slave1, fx-linux-1.9-slave2:
/builds

fx-win32-1.9-slave1, fx-win32-1.9-slave2:
C
D
E

The Windows hosts require a % sign on the first two args of check_disk, the default unit is Bytes, so it should be checked like..eg. './check_nrpe -H fx-win32-1.9-slave1.build -c check_disk -a 20% 10% C'.

Let me know if you need anymore information.
It look like check_load on the linux boxes is set to
command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ -p $ARG3$


Per comment 8, it should be set to
command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$

Please make the correction on the build boxes and restart nrpe.  Please make sure that the other checks are defined correctly as well.
I noticed this too, but I thought I corrected them all.
As of 7.45pm PST, I'm getting nagios check_load usage errors from the following machines:

staging-trunk-automation.build
production-trunk-automation.build
staging-build-console.build
fx-linux-1.9-slave2.build

Hope that helps....
Done.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.