setup nagios monitoring on release automation machines

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
10 years ago
4 years ago

People

(Reporter: joduinn, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: waiting on bug 410019)

Each of the newly setup machines & VMs used for staging & production release automation should be monitored.
This is not blocking us, but filing tracking bug so we dont forget.
Priority: -- → P3
What sort of monitoring did you have in mind ? If it's tree monitoring, then the production machines will tend to come and go quite alot and end up producing spam.
(In reply to comment #2)
> What sort of monitoring did you have in mind ? If it's tree monitoring, then
> the production machines will tend to come and go quite alot and end up
> producing spam.

Monitoring the machines directly. Disk space, memory, whether important processes are running, log monitoring, etc.
Aravind and I discussed; I am going to go ahead and set up the nagios remote plugin execution service (nrpe), and accept connections from the monitoring server.

I'll install, configure, and document the setup of the plugins on the build machine side, and someone on the IT side can configure the server to poll these values and notify us when a problem is found.
Assignee: build → rhelmer
Status: NEW → ASSIGNED
Whiteboard: ETA August 29
I played a bit with nrpe setup on the staging environment. Looks pretty straightforward to monitor the following cross-platform:

* free disk space
* load average (little different on windows)
* process check (by name, number of processes, zombie procs, etc)
* free memory

This would be great to start with.

The default log check is pretty simplistic (can search for a "bad" query), it'd be nice to search for known-good queries and report on all others (e.g. http://logcheck.org/).
Whiteboard: ETA August 29 → ETA Sept 28
Assignee: rhelmer → build
Status: ASSIGNED → NEW
Whiteboard: ETA Sept 28
Assignee: build → nobody
QA Contact: mozpreed → build
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
This is starting to bug me again :)
Let's start by getting the staging environments going.
Depends on: 410019
Whiteboard: waiting on bug 410019
(Assignee)

Updated

10 years ago
Assignee: rhelmer → bhearsum
Status: ASSIGNED → NEW
(Assignee)

Comment 7

10 years ago
At this point, I've got all of the Windows machines and all of the Linux machines (sans staging-prometheus-vm) setup with NRPE running. staging-prometheus-vm won't let me install the nrpe RPMs ('rpm' hangs). I think a reboot will fix this problem, but I haven't found a decent time to do it yet.

I've been told justdave has a nice little script to help with nrpe on OS X. We'll see how that goes.
(Assignee)

Comment 8

10 years ago
As mentioned in bug 410019, Mac nrpe is up and running now.
(Assignee)

Updated

10 years ago
Priority: P3 → P2
(Assignee)

Comment 9

10 years ago
This bug is done now. If we want to add NRPE to other build machines, let's file a new bug to track that.

I've filed bug 412443 about getting the NRPE daemon into the ref platform.
(Assignee)

Updated

10 years ago
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.