Closed Bug 799616 Opened 13 years ago Closed 13 years ago

Set up monitoring for imaging servers

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: ashish)

Details

Attachments

(1 file)

We'll need to have NFS checks in place to.. - make sure the imaging service is running (process check, HTTP check) - make sure rsyslog is running - make sure in.tftpd is running
that should be *nagios* monitoring. Updated and more detailed requirements now that we've built one of these hosts. The first host to set this up for is mobile-services.build.scl1.mozilla.com. We'll add the production hosts to the same hostgroups when they're up. In addition to the usual load, swap, etc., the following checks should occur: [root@host ~]# /usr/lib64/nagios/plugins/check_procs -c 1:1 --ereg-argument-array rsyslogd PROCS OK: 1 process with regex args 'rsyslogd' [root@host ~]# /usr/lib64/nagios/plugins/check_procs -c 1:1 --ereg-argument-array xinetd PROCS OK: 1 process with regex args 'xinetd' [root@host ~]# /usr/lib64/nagios/plugins/check_http -H mobile-services.build.scl1.mozilla.com -u /api/board/list/ HTTP OK: HTTP/1.1 200 OK - 341 bytes in 0.345 second response time |time=0.344605s;;;0.000000 size=341B;;;0 The first two will need to be done with NRPE. I'll make sure check_procs_regex is defined on these hosts for that purpose (bug 804654) The check_http invocation can be done remotely. The root password for mobile-services.build.scl1.mozilla.com is in the releng password storage if you need it.
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
No longer blocks: 764534
Assignee: dustin → server-ops
Updated list of servers: mobile-services.build.scl1.mozilla.com mobile-imaging-001.p1.releng.scl1.mozilla.com mobile-imaging-002.p2.releng.scl1.mozilla.com mobile-imaging-003.p3.releng.scl1.mozilla.com mobile-imaging-004.p4.releng.scl1.mozilla.com mobile-imaging-005.p5.releng.scl1.mozilla.com Any ETA on when this will be set up? Thanks...
A week at most.
Assignee: server-ops → ashish
Just checking back in. It's been a week since the last update and two since the bug was opened.
And I hit submit too soon. Please also add: mobile-imaging-006.p6.releng.scl1.mozilla.com mobile-imaging-007.p7.releng.scl1.mozilla.com mobile-imaging-008.p8.releng.scl1.mozilla.com mobile-imaging-009.p9.releng.scl1.mozilla.com mobile-imaging-010.p10.releng.scl1.mozilla.com The will be installed as soon as dcops configured IPMI in bug 800077.
Load, swap, disk and procs checks added to mobile-imaging servers: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=mobile-imaging&style=detail These hosts need check_swap to be defined. I'll add servers 006-010 and the swap check once defined.
Status: NEW → ASSIGNED
The http check is returning 404 on mobile-services.build.scl1 and 300 on the mobile-imaging servers: [ashish@nagios1.private.releng.scl3 archives]$ /usr/lib64/nagios/plugins/check_http -H mobile-services.build.scl1.mozilla.com -u /api/board/list HTTP WARNING: HTTP/1.1 404 Not Found - 188 bytes in 0.009 second response time |time=0.009307s;;;0.000000 size=188B;;;0 [ashish@nagios1.private.releng.scl3 archives]$ /usr/lib64/nagios/plugins/check_http -H mobile-imaging-005.p5.releng.scl1.mozilla.com -u /api/board/list HTTP OK: HTTP/1.1 303 See Other - 240 bytes in 0.010 second response time |time=0.009734s;;;0.000000 size=240B;;;0 Has the uri changed?
Ah, thanks -- mobile-services is in use for development of the next version, so things have changed there. Let's downtime that one for a week. The URL has changed, so it's an easy fix when everything's upgraded to the new version. For the others, the trailing slash on the URL is required (/api/board/list/)
dustin: in puppet we also need to add/define the ntp check to the machines as well as the swap check.
Attached patch bug799616.patchSplinter Review
This introduces the required NRPE checks. I tried to divide them roughly by which service needs them - BMM or mozpool, or both.
Assignee: ashish → dustin
Attachment #680065 - Flags: review?(rail)
Comment on attachment 680065 [details] [diff] [review] bug799616.patch sorry, we need ganglia too
Attachment #680065 - Attachment is obsolete: true
Attachment #680065 - Flags: review?(rail)
Comment on attachment 680065 [details] [diff] [review] bug799616.patch never mind, no ganglia servers in these VLANs, so the patch is reviewable as-is
Attachment #680065 - Attachment is obsolete: false
Attachment #680065 - Flags: review?(rail)
The ntp and swap alerts have been ack'd on mobile-imaging servers: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=mobile-imaging&style=detail Need to work on the http check.
Thanks, once those get reviewed they should go green.
Attachment #680065 - Attachment is patch: true
Attachment #680065 - Flags: review?(rail) → review+
Landed and should be active in 30m or so. Back to ashish to verify and fix up HTTP checks if still required.
Assignee: dustin → ashish
Added and downtimed http alerts for mobile-imaging servers.
Ditto for < nagios-releng> ashish: Downtime for mobile-services.build.scl1.mozilla.com:http scheduled for 10 days, 0:00:00
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: