Closed Bug 633365 Opened 14 years ago Closed 14 years ago

slavealloc: nagios monitoring

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: dustin, Assigned: arich)

References

Details

(Whiteboard: [slavealloc])

We should have a nagios check for the slavealloc web UI and for the allocator. Both are basic HTTP upness checks.
Nagios should check that GETs to http://slavealloc.build.mozilla.org/api/pools return a 2xx status code. Also, the following checks that are usually applied to masters should be applied here, too: PING Swap Space avg load root partition
Assignee: dustin → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Assignee: server-ops-releng → arich
I've monitored the disk space monitored similar to the buildbot servers and added ping, swap, load, ntp, and / disk space (10%, 5%) monitoring. Should this server be in a hostgroup for nagios gui purposes? I still need to do a bit more hacking on our internal nagios generation stuff to get the HTTP check done.
Status: NEW → ASSIGNED
(In reply to comment #2) > I've monitored the disk space monitored similar to the buildbot servers and > added ping, swap, load, ntp, and / disk space (10%, 5%) monitoring. Sounds good > Should this server be in a hostgroup for nagios gui purposes? No need > I still need to do a bit more hacking on our internal nagios generation stuff > to get the HTTP check done. Cool
Finished adding the http check as well. This host should be monitored as you requested at this point. /etc/nagios/mpt-build/autogen/NAGIOS/Services.pm: 'http_expect' => ' define service{ use generic-service host_name replace_with_host_name service_description replace_with_description contact_groups build check_command check_http_expect!replace_with_args notification_period 24x7 } ', /etc/nagios/checkcommands.cfg: define command{ command_name check_http_expect command_line $USER1$/check_http -I $HOSTADDRESS$ --onredirect=follow -w 30 -c 90 -H $ARG1$ -u $ARG2$ -e $ARG3$ } /etc/nagios/mpt-build/autogen/hosts.h: http_expect:::$slave-allocs:slavealloc.build.mozilla.org!http://slavealloc.build.mozilla.org/api/pools!'HTTP/1.1 2'
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
I just verified that this even catches a hung slavealloc: kill -STOP $SLAVEALLOC_PID results in 12:04 < nagios> [44] slavealloc.build.scl1:http_expect - slavealloc.build.scl1 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 504 Gateway Time-out some time later - the gateway timeout is 2m, I believe, and then there's a delay for nagios to figure it out.
Status: RESOLVED → VERIFIED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.