Closed
Bug 774507
Opened 12 years ago
Closed 12 years ago
change-oncall.sh bug: nagios configtest before nagios stop to avoid config errors
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Atoll, Assigned: jabba)
Details
Attachments
(1 file)
1.28 KB,
patch
|
Details | Diff | Splinter Review |
15:52 < jabba> but the change-oncall script 15:52 < jabba> I think it does /etc/init.d/nagios stop and then /etc/init.d/nagios start 15:52 < jabba> which will then fail, if there was a previous error trunk and weave are presumably both affected, please ping ckolos when ready to push change. if [ $CHANGED_FILES -ne 0 ]; then /etc/init.d/nagios stop && sleep 2 /etc/init.d/nagios stop >/dev/null 2>&1 /etc/init.d/nagios start fi
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops → jdow
Assignee | ||
Comment 1•12 years ago
|
||
I think the following 3 changes could help avoid this problem from happening as it currently does: 1) Change modules/nagios/manifests/init.pp. For the "service { "nagios"" ... code, add in restart => "/usr/bin/nagios -v /etc/nagios/nagios.cfg && /etc/init.d/nagios restart" along with "status => /usr/bin/nagios -v /etc/nagios/nagios.cfg && /etc/init.d/nagios status", which would force it to check the config each puppet run and always fail if configs are broken. 2) Implement a nagios alert to grep for failure in puppet's last_run_report.yaml. This will cause oncall to be alerted *when the failure occurs* and not potentially hours later when oncall is switched. 3) Add "/usr/bin/nagios -v /etc/nagios/nagios.cfg || mail -s "can't switch oncall on $hostname; exit" infra-all@mozilla.com" to the code snippet referenced in comment 0 in the change-oncall.sh script so that if the config validation fails, it bails out with a notification, but leaves nagios running with the old oncall until it is fixed. Does that sound reasonable to fix the current issues?
3) prevents change-oncall.sh from crashing nagios, and addresses this bug. 1) describes a different issue i wasn't aware of, that's unrelated to change-oncall.sh. does it have its own bug? if not, here is fine too.
Comment 3•12 years ago
|
||
Proposed changes to address concerns 1 and 3. Concern 2 will be addressed in a separate bug.
Comment on attachment 642804 [details] [diff] [review] Proposed changes to address concerns 1 and 3 Review of attachment 642804 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/nagios/files/data-bin/change-oncall.sh @@ +140,4 @@ > CHANGED_FILES=`find /etc/nagios -type f -name \*.cfg -newer $FLAGFILE | wc -l` > > if [ $CHANGED_FILES -ne 0 ]; then > + result=$(/usr/bin/nagios -v /etc/nagios/nagios.cfg) $( ... 2>&1 ) so that it emails STDOUT and STDERR combined, not just STDOUT
Comment 5•12 years ago
|
||
(In reply to Justin Dow [:jabba] from comment #1) > 2) Implement a nagios alert to grep for failure in puppet's > last_run_report.yaml. This will cause oncall to be alerted *when the failure > occurs* and not potentially hours later when oncall is switched. This has been implemented on all nagios servers and works as expected.
Assignee | ||
Updated•12 years ago
|
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
Assignee | ||
Comment 6•12 years ago
|
||
I implemented part 3 of comment 1: jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin> svn diff change-oncall.sh Index: change-oncall.sh =================================================================== --- change-oncall.sh (revision 45086) +++ change-oncall.sh (working copy) @@ -7,6 +7,7 @@ SYSADMINONCALLCFG=/etc/nagios/mozilla/oncall.cfg DESKTOPONCALLCFG=/etc/nagios/mozilla/desktop.cfg FLAGFILE=/tmp/flagfile +MAILTO=infra-all@mozilla.com if [ -e $FLAGFILE ]; then echo "${FLAGFILE} already exists, someone else is already running this script?" @@ -140,6 +141,7 @@ CHANGED_FILES=`find /etc/nagios -type f -name \*.cfg -newer $FLAGFILE | wc -l` if [ $CHANGED_FILES -ne 0 ]; then + /usr/bin/nagios -v /etc/nagios/nagios.cfg || echo "Nagios Configuration check doesn't pass. Not changing oncall" | mail -s "can't switch oncall on `hostname`" $MAILTO; exit 1 /etc/init.d/nagios stop && sleep 2 /etc/init.d/nagios stop >/dev/null 2>&1 /etc/init.d/nagios start jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin> svn ci -m "adding safety check before restarting nagios" Sending data-bin/change-oncall.sh Transmitting file data . Committed revision 45114. jabba@jabbamini:~/svn/puppet/trunk/modules/nagios/files/data-bin>
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•