Closed
Bug 633365
Opened 14 years ago
Closed 14 years ago
slavealloc: nagios monitoring
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: dustin, Assigned: arich)
References
Details
(Whiteboard: [slavealloc])
We should have a nagios check for the slavealloc web UI and for the allocator.
Both are basic HTTP upness checks.
Reporter | ||
Comment 1•14 years ago
|
||
Nagios should check that GETs to
http://slavealloc.build.mozilla.org/api/pools
return a 2xx status code. Also, the following checks that are usually applied to masters should be applied here, too:
PING
Swap Space
avg load
root partition
Assignee: dustin → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Reporter | ||
Updated•14 years ago
|
Assignee: server-ops-releng → arich
Assignee | ||
Comment 2•14 years ago
|
||
I've monitored the disk space monitored similar to the buildbot servers and added ping, swap, load, ntp, and / disk space (10%, 5%) monitoring.
Should this server be in a hostgroup for nagios gui purposes?
I still need to do a bit more hacking on our internal nagios generation stuff to get the HTTP check done.
Status: NEW → ASSIGNED
Reporter | ||
Comment 3•14 years ago
|
||
(In reply to comment #2)
> I've monitored the disk space monitored similar to the buildbot servers and
> added ping, swap, load, ntp, and / disk space (10%, 5%) monitoring.
Sounds good
> Should this server be in a hostgroup for nagios gui purposes?
No need
> I still need to do a bit more hacking on our internal nagios generation stuff
> to get the HTTP check done.
Cool
Assignee | ||
Comment 4•14 years ago
|
||
Finished adding the http check as well. This host should be monitored as you requested at this point.
/etc/nagios/mpt-build/autogen/NAGIOS/Services.pm:
'http_expect' => '
define service{
use generic-service
host_name replace_with_host_name
service_description replace_with_description
contact_groups build
check_command check_http_expect!replace_with_args
notification_period 24x7
}
',
/etc/nagios/checkcommands.cfg:
define command{
command_name check_http_expect
command_line $USER1$/check_http -I $HOSTADDRESS$ --onredirect=follow -w 30 -c 90 -H $ARG1$ -u $ARG2$ -e $ARG3$
}
/etc/nagios/mpt-build/autogen/hosts.h:
http_expect:::$slave-allocs:slavealloc.build.mozilla.org!http://slavealloc.build.mozilla.org/api/pools!'HTTP/1.1 2'
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•14 years ago
|
||
I just verified that this even catches a hung slavealloc:
kill -STOP $SLAVEALLOC_PID
results in
12:04 < nagios> [44] slavealloc.build.scl1:http_expect - slavealloc.build.scl1 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 504 Gateway Time-out
some time later - the gateway timeout is 2m, I believe, and then there's a delay for nagios to figure it out.
Status: RESOLVED → VERIFIED
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•