Closed Bug 419506 Opened 17 years ago Closed 17 years ago

add nagios to all unittest machines

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rcampbell, Unassigned)

References

Details

We should be running nagios on all these machines, qm-rhel02, qm-centos5-01, etc. For a full list, see:

http://wiki.mozilla.org/Buildbot/IT_Unittest_Support_Document

and,

http://wiki.mozilla.org/Buildbot/Talos/Machines
Also see:

http://wiki.mozilla.org/Build:Nagios:Win32

Setting up Nagios on Mac and Linux are quite a bit easier. bhearsum, do we have docs for those too? 

Not sure if our configs are in CVS or not, but we should make those available.

Should we get Nagios client into the ref platform, so we don't need to do this after-the-fact in the future? 
that'd be ideal. Thanks for the pointers!
I'm collecting stock nagios configs for all the platforms. I'll be putting a patch in bug 412816 shortly.
With regard to Linux docs, I'll be adding the nrpe daemon to the ref platform today, docs for installing will go on the ref platform page.

Our Mac's come with nagios installed (afaik), so I don't have any plans to write docs there. All of the ones I've set up have literally been "drop in nrpe.cfg".
I'm setting up nagios on the rest of the build machines right now, do you want me to do these, too? You may have to hook me up with passwords, I don't think I know them (anymore).

With regards to Talos machines, I think we should be careful as it may effect the numbers. If there's a machine on each platform I can install it on as a test I'd be happy to do so.
(In reply to comment #5)
> I'm setting up nagios on the rest of the build machines right now, do you want
> me to do these, too? You may have to hook me up with passwords, I don't think I
> know them (anymore).
Ben, yes, that would be great if you could do that too. I'll send you offline the usr/pswds that I know of, but I think I only have some of them.


> With regards to Talos machines, I think we should be careful as it may effect
> the numbers. If there's a machine on each platform I can install it on as a
> test I'd be happy to do so.
Excellent point. The buildbot masters would be good either way, but Alice would know best about whether we should touch the talos slave machines... 
Rob's getting me a list of these machines and the passwords, I'll do this.
Assignee: build → bhearsum
Status: NEW → ASSIGNED
Priority: -- → P2
To be clear, I'm going to adding nagios to the following machines:
qm-rhel02
qm-centos5-01
qm-xserve01
qm-win2k3-01
Alright, Nagios is on them.
Depends on: 420875
Anything left to do here or can we mark as FIXED?
Component: Testing → Release Engineering
Product: Core → mozilla.org
Version: Trunk → other
QA Contact: testing → release
If we still want Nagios on Talos machines, then sure. I'm not sure how viable this is going to be. We haven't even tested to see if/how much nagios is going to affect the numbers.
I'm inclined to leave nagios off the slaves for the reason Ben states. We can add it later if the need arises after we've tested it on the staging machines.
Alright, this is done then.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Summary: add nagios to all unittest / talos machines → add nagios to all unittest machines
The following emails have been getting sent to build@moco for the last few days, since this was enabled. Reopening bug to track.

Subject: [build] ** PROBLEM alert - qm-xserve01/buildbot is WARNING **
Date: Tue, 18 Mar 2008 03:12:08 -0700 (PDT)
From: nagios@dm-nagios01.mozilla.org (nagios)
To: build@mozilla.org

***** Nagios  *****

Notification Type: PROBLEM

Service: buildbot
Host: qm-xserve01
Address: 10.2.73.11
State: WARNING

Date/Time: 03-18-2008 03:12:08

Additional Info:

PROCS WARNING: 0 processes with args /tools/buildbot/bin/buildbot
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The error leads me to believe that the test is checking for a process with /tools/buildbot/bin/buildbot in the name.

Buildbot is running but it's installed in MacPython so the process has the name;

/Library/Frameworks/Python.framework/Versions/Current/bin/buildbot

This could be the issue, but I don't know enough about nagios to tell one way or another.
Another quick observation, one of the tests this is running is clicking on a webcal: url which launches iCal.

The url it's trying to open is;

webcal://127.0.0.1/rheeeeet.html
Sorry, I totally forgot about this problem. I'll have a look at it today.
Alright, the Buildbot monitor is fixed. I had to change this:
command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /Library/Frameworks/Python.framework/Versions/Current/bin/buildbot
to this:
command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a buildbot

For some reason the full path to Buildbot doesn't work on OS X. It works fine on the Build machines, not sure why it doesn't here.
I'm not sure about Calendar opening up -- maybe one of the tests is supposed to do that?
Assignee: bhearsum → nobody
Status: REOPENED → NEW
Anything left to do here or can we mark as FIXED? Is nagios on the new PGO unittest machines in bug#420073?
Component: Release Engineering: Talos → Release Engineering
Status: NEW → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.