Closed Bug 564404 Opened 15 years ago Closed 15 years ago

Install Nagios checks for hung slaves

Categories

(mozilla.org Graveyard :: Server Operations, task, P4)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: jabba)

References

Details

(Whiteboard: [needs nrpe fixed])

Bug 523827 requires some additional Nagios checks for buildbot slaves: Windows: check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%" Linux and OSX: check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log Could you please install these checks to production Nagios.
Assignee: server-ops → jdow
Justin: ping? Any progress here?
Depends on: 569448
I installed the checks for all the slaves. The windows slaves are reporting that the path E:\builds\slave\twistd.log can't be found. Bug 569448 is tracking the progress on getting that path standardized.
Assignee: jdow → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Whiteboard: [needs nrpe fixed]
Something funky is going on. Here's the test run directly on the slave: [cltbld@mv-moz2-linux-ix-slave05 ~]$ /usr/lib/nagios/plugins/check_file_age -w 7200 -c 21600 -f /builds/slave/twistd.log FILE_AGE OK: /builds/slave/twistd.log is 3214 seconds old and 833410 bytes While the result on nagios is FILE_AGE WARNING: /builds/slave/twistd.log is 13457 seconds old and 773022 bytes Last update: 06-03-2010 19:29:06 ( 0d 0h 0m 6s ago) They disagree quite a bit.
P4 until bug 569448 is fixed. Move back to Server Ops at that point. Do we want to file a different bug for the discrepancy? or morph this bug for it?
Priority: -- → P4
I think Rail is working on this, actually.
Assignee: nobody → rail
Looks like nrpe properly reports "hung" slave. I saved nagios page (https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28) as HTML and grepped the problematic slaves (CRITICAL) $ grep -A2 statusBGCRITICAL nagios.html |grep hung+slave |sed 's/.*&host=\(.*.build\)&.*/\1/g' Here is the list: http://pastebin.mozilla.org/735603 Then I run "ls -l /builds/slave/twistd.log" on those machines. Only 3 of them have logs dated Jun 15 (today) but nrpe check failed due to "NRPE: Command check_file_age not defined". bm-xserve08.build.mozilla.org ls -l /builds/slave/twistd.log ls: /builds/slave/twistd.log: No such file or directory Date brake down is here: http://pastebin.mozilla.org/735605 Full log: http://pastebin.mozilla.org/735611 We have master pings disabled on slaves, so logs may stay untouched some days sometimes. Probably it would be better to increase the file age up to 4-7 days. I also checked one of the slaves with twistd.log dated May 10. Master reported that the slave is not connected, while the slave was running. Definitely hung.
I would like to ask to increase the warning/critical args up to 86400 (1 day) and 172800 (2 days). So the final command should look like this: check_nrpe -H $HOST -c check_file_age -a 86400 172800 /builds/slave/twistd.log Could you also enable "hung slave" service notifications for linux, linux64, darwin9 and darwin10 hosts?
Reassigning to IT for Nagios configuration.
Assignee: rail → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Loks like Windows checks should be modified as well: * File is is E:\builds\moz2_slave\twistd.log * filter-written=>24h The final command is: check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\moz2_slave\\twistd.log" filter-written=\>24h MaxCrit=1 syntax="%filename% last changed %write%"
Assignee: server-ops → jdow
I made the changes to the checks. Should I re-enable notifications for them?
(In reply to comment #10) > I made the changes to the checks. Should I re-enable notifications for them? Yes, please. Could you also exclude the following hosts from the checks: bm-xserve08.build: doesn't run buildbot bm-xserve20.build: doesn't run puppetd to get the latest nrpe.cfg (used for 1.9 unittests)
Notifications re-enabled and bm-xserve08 and bm-xserve20 have been excluded from the check.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.