Each of the newly setup machines & VMs used for staging & production release automation should be monitored.
This is not blocking us, but filing tracking bug so we dont forget.
10 years ago
What sort of monitoring did you have in mind ? If it's tree monitoring, then the production machines will tend to come and go quite alot and end up producing spam.
(In reply to comment #2) > What sort of monitoring did you have in mind ? If it's tree monitoring, then > the production machines will tend to come and go quite alot and end up > producing spam. Monitoring the machines directly. Disk space, memory, whether important processes are running, log monitoring, etc.
Aravind and I discussed; I am going to go ahead and set up the nagios remote plugin execution service (nrpe), and accept connections from the monitoring server. I'll install, configure, and document the setup of the plugins on the build machine side, and someone on the IT side can configure the server to poll these values and notify us when a problem is found.
I played a bit with nrpe setup on the staging environment. Looks pretty straightforward to monitor the following cross-platform: * free disk space * load average (little different on windows) * process check (by name, number of processes, zombie procs, etc) * free memory This would be great to start with. The default log check is pretty simplistic (can search for a "bad" query), it'd be nice to search for known-good queries and report on all others (e.g. http://logcheck.org/).
This is starting to bug me again :) Let's start by getting the staging environments going.
At this point, I've got all of the Windows machines and all of the Linux machines (sans staging-prometheus-vm) setup with NRPE running. staging-prometheus-vm won't let me install the nrpe RPMs ('rpm' hangs). I think a reboot will fix this problem, but I haven't found a decent time to do it yet. I've been told justdave has a nice little script to help with nrpe on OS X. We'll see how that goes.
As mentioned in bug 410019, Mac nrpe is up and running now.
This bug is done now. If we want to add NRPE to other build machines, let's file a new bug to track that. I've filed bug 412443 about getting the NRPE daemon into the ref platform.