Closed
Bug 564404
Opened 15 years ago
Closed 15 years ago
Install Nagios checks for hung slaves
Categories
(mozilla.org Graveyard :: Server Operations, task, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: jabba)
References
Details
(Whiteboard: [needs nrpe fixed])
Bug 523827 requires some additional Nagios checks for buildbot slaves:
Windows:
check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%"
Linux and OSX:
check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log
Could you please install these checks to production Nagios.
Updated•15 years ago
|
Assignee: server-ops → jdow
Comment 1•15 years ago
|
||
Justin: ping? Any progress here?
Assignee | ||
Comment 2•15 years ago
|
||
I installed the checks for all the slaves. The windows slaves are reporting that the path E:\builds\slave\twistd.log can't be found. Bug 569448 is tracking the progress on getting that path standardized.
Updated•15 years ago
|
Assignee: jdow → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: [needs nrpe fixed]
Comment 3•15 years ago
|
||
Something funky is going on. Here's the test run directly on the slave:
[cltbld@mv-moz2-linux-ix-slave05 ~]$ /usr/lib/nagios/plugins/check_file_age -w 7200 -c 21600 -f /builds/slave/twistd.log
FILE_AGE OK: /builds/slave/twistd.log is 3214 seconds old and 833410 bytes
While the result on nagios is
FILE_AGE WARNING: /builds/slave/twistd.log is 13457 seconds old and 773022 bytes
Last update: 06-03-2010 19:29:06 ( 0d 0h 0m 6s ago)
They disagree quite a bit.
Comment 4•15 years ago
|
||
P4 until bug 569448 is fixed.
Move back to Server Ops at that point.
Do we want to file a different bug for the discrepancy? or morph this bug for it?
Priority: -- → P4
Reporter | ||
Comment 6•15 years ago
|
||
Looks like nrpe properly reports "hung" slave.
I saved nagios page (https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28) as HTML and grepped the problematic slaves (CRITICAL)
$ grep -A2 statusBGCRITICAL nagios.html |grep hung+slave |sed 's/.*&host=\(.*.build\)&.*/\1/g'
Here is the list: http://pastebin.mozilla.org/735603
Then I run "ls -l /builds/slave/twistd.log" on those machines. Only 3 of them have logs dated Jun 15 (today) but nrpe check failed due to "NRPE: Command check_file_age not defined".
bm-xserve08.build.mozilla.org ls -l /builds/slave/twistd.log
ls: /builds/slave/twistd.log: No such file or directory
Date brake down is here: http://pastebin.mozilla.org/735605
Full log: http://pastebin.mozilla.org/735611
We have master pings disabled on slaves, so logs may stay untouched some days sometimes. Probably it would be better to increase the file age up to 4-7 days.
I also checked one of the slaves with twistd.log dated May 10. Master reported that the slave is not connected, while the slave was running. Definitely hung.
Reporter | ||
Comment 7•15 years ago
|
||
I would like to ask to increase the warning/critical args up to 86400 (1 day) and 172800 (2 days). So the final command should look like this:
check_nrpe -H $HOST -c check_file_age -a 86400 172800 /builds/slave/twistd.log
Could you also enable "hung slave" service notifications for linux, linux64, darwin9 and darwin10 hosts?
Reporter | ||
Comment 8•15 years ago
|
||
Reassigning to IT for Nagios configuration.
Assignee: rail → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Reporter | ||
Comment 9•15 years ago
|
||
Loks like Windows checks should be modified as well:
* File is is E:\builds\moz2_slave\twistd.log
* filter-written=>24h
The final command is:
check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\moz2_slave\\twistd.log" filter-written=\>24h MaxCrit=1 syntax="%filename% last changed %write%"
Assignee | ||
Updated•15 years ago
|
Assignee: server-ops → jdow
Assignee | ||
Comment 10•15 years ago
|
||
I made the changes to the checks. Should I re-enable notifications for them?
Reporter | ||
Comment 11•15 years ago
|
||
(In reply to comment #10)
> I made the changes to the checks. Should I re-enable notifications for them?
Yes, please.
Could you also exclude the following hosts from the checks:
bm-xserve08.build: doesn't run buildbot
bm-xserve20.build: doesn't run puppetd to get the latest nrpe.cfg (used for 1.9 unittests)
Assignee | ||
Comment 12•15 years ago
|
||
Notifications re-enabled and bm-xserve08 and bm-xserve20 have been excluded from the check.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•