410019 - set up nagios monitoring for build machines

Reporter

Description

•

18 years ago

Over in bug 393274 I'm setting up nrpe on the build machines. I've got 14 machines total that I'd like to be monitored, but I want to get it right on the first few and then roll that out. Should I put info on what we need monitored in this (or similar) bugs? Is there a nagios cfg repo that I can apply patches to, to make this easier? Any other ideas on getting the build farm monitoring situation (read: no monitoring) under control?

Robert Helmer [:rhelmer]

Reporter

Comment 1

•

18 years ago

Please set up nagios to monitor the host staging-build-console.build.mozilla.org. NRPE is running and has the following checks set up: check_users check_load check_disk1 check_disk2 check_disk3 check_disk4 check_zombie_procs check_total_procs check_buildbot Can you also do ping, HTTP port 80 and HTTP port 8081 checks?

Robert Helmer [:rhelmer]

Reporter

Updated

•

18 years ago

Blocks: 393274

Justin Fitzhugh

Comment 2

•

18 years ago

who should these page/email? do you guys have a build alias for these alerts?

Robert Helmer [:rhelmer]

Reporter

Comment 3

•

18 years ago

(In reply to comment #2) > who should these page/email? do you guys have a build alias for these alerts? build@mozilla.org is ok for now, this is where the tinderbox and ftp nagios alerts go.. Also it should not go off very often, we should react if it does :)

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

18 years ago

Here's some more hosts that are setup. They all have the following monitors: check_users check_load check_disk1 check_zombie_procs check_total_procs check_buildbot Some of them have additional disk monitors too, I'll list them with the hosts below: build-console: check_disk2 check_disk3 check_disk4 production-trunk-automation: check_disk2 fx-linux-1.9-slave2: check_disk2 fx-linux-1.9-slave1: check_disk2 production-prometheus-vm: (no extra monitors)

Aravind Gottipati [:aravind]

Assignee

Updated

•

18 years ago

Assignee: server-ops → aravind

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

18 years ago

Do you have a build "group" setup within nagios ? I can currently see Build related tests like the stale file and tinderbox checks using my LDAP login, which helps for acknowledging problems in Tier 1 machines when I'm awake but IT isn't. It could be useful to include these new tests, and put other Build people in the "group", so that we can see all our status info all in one place.

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

18 years ago

Some more machines: staging-trunk-automation: check_users, check_load, check_disk1, check_disk2, check_zombie_procs, check_total_procs, check_buildbot The rest are Windows machines. fx-win32-1.9-slave1 AND fx-win32-1.9-slave2: check_cpu, check_disk1, check_disk2, check_disk3, check_total_procs, check_buildbot production-pacifica-vm AND staging-pacifica-vm: check_cpu, check_disk1, check_total_procs, check_buildbot Is there an ETA for getting these setup?

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

18 years ago

Mac's are up and running now. fx-mac-1.9-slave1, fx-mac-1.9-slave2, bm-xserve03, and bm-xserve05 all have nrpe running with the following monitors: check_users check_load check_disk1 check_total_procs check_zombie_procs check_buildbot Aravind, you mentioned changes to the way disk monitoring is done. Just let me know what those are and I'll do it.

Aravind Gottipati [:aravind]

Assignee

Comment 8

•

18 years ago

Here is a copy of the e-mail I sent bhearsum, just in case someone else from the build team decides to do this. I will expect nrpe on the host to answer to at least the following. You can define any additional tests as you like and I will call those without any arguments. The check_cciss and the check_hp* stuff is hor HP hardware and is optional if you don't want me to monitor that stuff. I don't see any scripts named check_zombie_proc and check_total_proc, but they should accept similar arguments as the check_proc script (unless they are windows, then all bets are off). I will call check_buildbot without any arguments. Note: the check_disk script will take a path (mount point) argument and will check that mount point for disk space. So for the stuff marked check_disk1, check_disk2 etc, you will have to give me the mount point names. There is always a root disk check (for mount point /) for each machine by default. Let me know if you need me to clarify any of this stuff. Oh, and since nrpe will be accepting command line arguments, set dont_blame_nrpe to 1. Aravind. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ command[check_disk]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ command[check_users]=/usr/lib/nagios/plugins/check_users -w $ARG1$ -c $ARG2$ command[check_cciss]=/usr/lib/nagios/plugins/check_cciss -N 0 -v command[check_procs]=/usr/lib/nagios/plugins/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$ command[check_hplog]=/usr/lib/nagios/plugins/check_hplog -t l command[check_hpasm]=/usr/lib/nagios/plugins/contrib/check_hpasm -t 5

bhearsum@mozilla.com (:bhearsum)

Comment 9

•

18 years ago

OK. All of the Linux and Mac machines are now set up to respond to check_load, check_disk, check_users, and check_procs as specified in comment #8. The Windows machines will respond to check_load, check_disk, and check_procs. All of these machines respond to check_buildbot. I'm not sure what good warn/crit levels are, so I'll leave that to you. Here's a list of mountpoints/drives to check_disk on, though (excluding /). staging-build-console: /builds /cvs /data build-console: /builds /cvs /data staging-trunk-automation, production-trunk-automation, fx-linux-1.9-slave1, fx-linux-1.9-slave2: /builds fx-win32-1.9-slave1, fx-win32-1.9-slave2: C D E The Windows hosts require a % sign on the first two args of check_disk, the default unit is Bytes, so it should be checked like..eg. './check_nrpe -H fx-win32-1.9-slave1.build -c check_disk -a 20% 10% C'. Let me know if you need anymore information.

Aravind Gottipati [:aravind]

Assignee

Comment 10

•

18 years ago

It look like check_load on the linux boxes is set to command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ -p $ARG3$ Per comment 8, it should be set to command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ Please make the correction on the build boxes and restart nrpe. Please make sure that the other checks are defined correctly as well.

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

18 years ago

I noticed this too, but I thought I corrected them all.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

18 years ago

As of 7.45pm PST, I'm getting nagios check_load usage errors from the following machines: staging-trunk-automation.build production-trunk-automation.build staging-build-console.build fx-linux-1.9-slave2.build Hope that helps....

Aravind Gottipati [:aravind]

Assignee

Comment 13

•

18 years ago

Done.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

set up nagios monitoring for build machines

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: rhelmer, Assigned: aravind)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated