Closed Bug 774507 Opened 12 years ago Closed 12 years ago

change-oncall.sh bug: nagios configtest before nagios stop to avoid config errors

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Atoll, Assigned: jabba)

Details

Attachments

(1 file)

15:52 < jabba> but the change-oncall script
15:52 < jabba> I think it does /etc/init.d/nagios stop and then /etc/init.d/nagios start
15:52 < jabba> which will then fail, if there was a previous error

trunk and weave are presumably both affected, please ping ckolos when ready to push change.

if [ $CHANGED_FILES -ne 0 ]; then
    /etc/init.d/nagios stop && sleep 2
    /etc/init.d/nagios stop >/dev/null 2>&1
    /etc/init.d/nagios start
fi
Assignee: server-ops → jdow
I think the following 3 changes could help avoid this problem from happening as it currently does:

1) Change modules/nagios/manifests/init.pp. For the "service { "nagios"" ... code, add in restart => "/usr/bin/nagios -v /etc/nagios/nagios.cfg && /etc/init.d/nagios restart" along with "status => /usr/bin/nagios -v /etc/nagios/nagios.cfg && /etc/init.d/nagios status", which would force it to check the config each puppet run and always fail if configs are broken.

2) Implement a nagios alert to grep for failure in puppet's last_run_report.yaml. This will cause oncall to be alerted *when the failure occurs* and not potentially hours later when oncall is switched.

3) Add "/usr/bin/nagios -v /etc/nagios/nagios.cfg || mail -s "can't switch oncall on $hostname; exit" infra-all@mozilla.com" to the code snippet referenced in comment 0 in the change-oncall.sh script so that if the config validation fails, it bails out with a notification, but leaves nagios running with the old oncall until it is fixed.

Does that sound reasonable to fix the current issues?
3) prevents change-oncall.sh from crashing nagios, and addresses this bug.

1) describes a different issue i wasn't aware of, that's unrelated to change-oncall.sh. does it have its own bug? if not, here is fine too.
Proposed changes to address concerns 1 and 3. Concern 2 will be addressed in a separate bug.
Comment on attachment 642804 [details] [diff] [review]
Proposed changes to address concerns 1 and 3

Review of attachment 642804 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/nagios/files/data-bin/change-oncall.sh
@@ +140,4 @@
>  CHANGED_FILES=`find /etc/nagios -type f -name \*.cfg -newer $FLAGFILE | wc -l`
>  
>  if [ $CHANGED_FILES -ne 0 ]; then
> +    result=$(/usr/bin/nagios -v /etc/nagios/nagios.cfg)

$( ... 2>&1 ) so that it emails STDOUT and STDERR combined, not just STDOUT
(In reply to Justin Dow [:jabba] from comment #1)
> 2) Implement a nagios alert to grep for failure in puppet's
> last_run_report.yaml. This will cause oncall to be alerted *when the failure
> occurs* and not potentially hours later when oncall is switched.

This has been implemented on all nagios servers and works as expected.
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
I implemented part 3 of comment 1:

jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin> svn diff change-oncall.sh 
Index: change-oncall.sh
===================================================================
--- change-oncall.sh	(revision 45086)
+++ change-oncall.sh	(working copy)
@@ -7,6 +7,7 @@
 SYSADMINONCALLCFG=/etc/nagios/mozilla/oncall.cfg
 DESKTOPONCALLCFG=/etc/nagios/mozilla/desktop.cfg
 FLAGFILE=/tmp/flagfile
+MAILTO=infra-all@mozilla.com
 
 if [ -e $FLAGFILE ]; then
     echo "${FLAGFILE} already exists, someone else is already running this script?"
@@ -140,6 +141,7 @@
 CHANGED_FILES=`find /etc/nagios -type f -name \*.cfg -newer $FLAGFILE | wc -l`
 
 if [ $CHANGED_FILES -ne 0 ]; then
+    /usr/bin/nagios -v /etc/nagios/nagios.cfg || echo "Nagios Configuration check doesn't pass. Not changing oncall" |  mail -s "can't switch oncall on `hostname`" $MAILTO; exit 1
     /etc/init.d/nagios stop && sleep 2
     /etc/init.d/nagios stop >/dev/null 2>&1
     /etc/init.d/nagios start
jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin> svn ci -m "adding safety check before restarting nagios"
Sending        data-bin/change-oncall.sh
Transmitting file data .
Committed revision 45114.
jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin>
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: