Closed
Bug 799616
Opened 13 years ago
Closed 13 years ago
Set up monitoring for imaging servers
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: ashish)
Details
Attachments
(1 file)
|
1.28 KB,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
We'll need to have NFS checks in place to..
- make sure the imaging service is running (process check, HTTP check)
- make sure rsyslog is running
- make sure in.tftpd is running
| Reporter | ||
Comment 1•13 years ago
|
||
that should be *nagios* monitoring. Updated and more detailed requirements now that we've built one of these hosts.
The first host to set this up for is mobile-services.build.scl1.mozilla.com. We'll add the production hosts to the same hostgroups when they're up.
In addition to the usual load, swap, etc., the following checks should occur:
[root@host ~]# /usr/lib64/nagios/plugins/check_procs -c 1:1 --ereg-argument-array rsyslogd
PROCS OK: 1 process with regex args 'rsyslogd'
[root@host ~]# /usr/lib64/nagios/plugins/check_procs -c 1:1 --ereg-argument-array xinetd
PROCS OK: 1 process with regex args 'xinetd'
[root@host ~]# /usr/lib64/nagios/plugins/check_http -H mobile-services.build.scl1.mozilla.com -u /api/board/list/
HTTP OK: HTTP/1.1 200 OK - 341 bytes in 0.345 second response time |time=0.344605s;;;0.000000 size=341B;;;0
The first two will need to be done with NRPE. I'll make sure check_procs_regex is defined on these hosts for that purpose (bug 804654)
The check_http invocation can be done remotely.
The root password for mobile-services.build.scl1.mozilla.com is in the releng password storage if you need it.
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
| Reporter | ||
Updated•13 years ago
|
Assignee: dustin → server-ops
Comment 2•13 years ago
|
||
Updated list of servers:
mobile-services.build.scl1.mozilla.com
mobile-imaging-001.p1.releng.scl1.mozilla.com
mobile-imaging-002.p2.releng.scl1.mozilla.com
mobile-imaging-003.p3.releng.scl1.mozilla.com
mobile-imaging-004.p4.releng.scl1.mozilla.com
mobile-imaging-005.p5.releng.scl1.mozilla.com
Any ETA on when this will be set up? Thanks...
Comment 4•13 years ago
|
||
Just checking back in. It's been a week since the last update and two since the bug was opened.
Comment 5•13 years ago
|
||
And I hit submit too soon. Please also add:
mobile-imaging-006.p6.releng.scl1.mozilla.com
mobile-imaging-007.p7.releng.scl1.mozilla.com
mobile-imaging-008.p8.releng.scl1.mozilla.com
mobile-imaging-009.p9.releng.scl1.mozilla.com
mobile-imaging-010.p10.releng.scl1.mozilla.com
The will be installed as soon as dcops configured IPMI in bug 800077.
| Assignee | ||
Comment 6•13 years ago
|
||
Load, swap, disk and procs checks added to mobile-imaging servers:
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=mobile-imaging&style=detail
These hosts need check_swap to be defined. I'll add servers 006-010 and the swap check once defined.
Status: NEW → ASSIGNED
| Assignee | ||
Comment 7•13 years ago
|
||
The http check is returning 404 on mobile-services.build.scl1 and 300 on the mobile-imaging servers:
[ashish@nagios1.private.releng.scl3 archives]$ /usr/lib64/nagios/plugins/check_http -H mobile-services.build.scl1.mozilla.com -u /api/board/list
HTTP WARNING: HTTP/1.1 404 Not Found - 188 bytes in 0.009 second response time |time=0.009307s;;;0.000000 size=188B;;;0
[ashish@nagios1.private.releng.scl3 archives]$ /usr/lib64/nagios/plugins/check_http -H mobile-imaging-005.p5.releng.scl1.mozilla.com -u /api/board/list
HTTP OK: HTTP/1.1 303 See Other - 240 bytes in 0.010 second response time |time=0.009734s;;;0.000000 size=240B;;;0
Has the uri changed?
| Reporter | ||
Comment 8•13 years ago
|
||
Ah, thanks -- mobile-services is in use for development of the next version, so things have changed there. Let's downtime that one for a week. The URL has changed, so it's an easy fix when everything's upgraded to the new version.
For the others, the trailing slash on the URL is required (/api/board/list/)
Comment 9•13 years ago
|
||
dustin: in puppet we also need to add/define the ntp check to the machines as well as the swap check.
| Reporter | ||
Comment 10•13 years ago
|
||
This introduces the required NRPE checks. I tried to divide them roughly by which service needs them - BMM or mozpool, or both.
Assignee: ashish → dustin
Attachment #680065 -
Flags: review?(rail)
| Reporter | ||
Comment 11•13 years ago
|
||
Attachment #680065 -
Attachment is obsolete: true
Attachment #680065 -
Flags: review?(rail)
| Reporter | ||
Comment 12•13 years ago
|
||
Comment on attachment 680065 [details] [diff] [review]
bug799616.patch
never mind, no ganglia servers in these VLANs, so the patch is reviewable as-is
Attachment #680065 -
Attachment is obsolete: false
Attachment #680065 -
Flags: review?(rail)
| Assignee | ||
Comment 13•13 years ago
|
||
The ntp and swap alerts have been ack'd on mobile-imaging servers:
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=mobile-imaging&style=detail
Need to work on the http check.
| Reporter | ||
Comment 14•13 years ago
|
||
Thanks, once those get reviewed they should go green.
Updated•13 years ago
|
Attachment #680065 -
Attachment is patch: true
Attachment #680065 -
Flags: review?(rail) → review+
| Reporter | ||
Comment 15•13 years ago
|
||
Landed and should be active in 30m or so. Back to ashish to verify and fix up HTTP checks if still required.
Assignee: dustin → ashish
| Assignee | ||
Comment 16•13 years ago
|
||
Added and downtimed http alerts for mobile-imaging servers.
| Assignee | ||
Comment 17•13 years ago
|
||
Ditto for < nagios-releng> ashish: Downtime for mobile-services.build.scl1.mozilla.com:http scheduled for 10 days, 0:00:00
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•