When something goes wrong with puppet we usually lose a lot of capacity. We can change the initscript to touch some file (/etc/puppet/last-update) if everything goes right (puppet agent exits 0 or 2). If puppetizing doesn't work the script should retry N times, then check if /etc/puppet/last-update is not older than X hours. If the file is fresh enough the script should send an email about the failure and exit 0, so the machine boots up properly. If the file is old we can either keep retrying or do something else (reboot?).
It would have helped us today when a simple typo caused tree closures and 1500 Amazon machines up and not running.
And today too
Do we have something in place to prevent running jobs on minis with incorrect resolutions? This plan sounds great *except* that those minis will run jobs until the last-update semaphore file is too old -- and I'm assuming "too old" is on the order of hours to a day.
Created attachment 8375039 [details] [diff] [review] bug959404.patch Totally untested, aside from the perl snippet, but what do you think about this approach?
Comment on attachment 8375039 [details] [diff] [review] bug959404.patch Review of attachment 8375039 [details] [diff] [review]: ----------------------------------------------------------------- Looks great to me. Maybe it'd be great to make it start spamming us whenever it reaches the MAX_SECS_SINCE_GOOD_RUN point just in case if all puppet masters are down, or something wrong in between.
Created attachment 8380765 [details] [diff] [review] bug959404-p1.patch Tested on CentOS, and with sending of email.
Comment on attachment 8380765 [details] [diff] [review] bug959404-p1.patch woot!
Tested fine on Ubuntu, too.
Tested fine on OS X Lion.
I don't see any problems. I watched a spot host apply this, reboot, and successfully re-run puppet (since the worst-case here was puppet runs not working, leaving machines unmanaged).
Woot! Thanks a lot for this. No more puppet typos breaking the WORLD! :)
One hopes.. we'll see :)