puppet agent initscript shouldn't fail



5 years ago
5 years ago


(Reporter: rail, Assigned: dustin)




(1 attachment, 1 obsolete attachment)



5 years ago
When something goes wrong with puppet we usually lose a lot of capacity. We can change the initscript to touch some file (/etc/puppet/last-update) if everything goes right (puppet agent exits 0 or 2). If puppetizing doesn't work the script should retry N times, then check if /etc/puppet/last-update is not older than X hours. If the file is fresh enough the script should send an email about the failure and exit 0, so the machine boots up properly. If the file is old we can either keep retrying or do something else (reboot?).

Comment 1

5 years ago
It would have helped us today when a simple typo caused tree closures and 1500 Amazon machines up and not running.
Assignee: relops → dustin
And today too
Do we have something in place to prevent running jobs on minis with incorrect resolutions?  This plan sounds great *except* that those minis will run jobs until the last-update semaphore file is too old -- and I'm assuming "too old" is on the order of hours to a day.
Created attachment 8375039 [details] [diff] [review]

Totally untested, aside from the perl snippet, but what do you think about this approach?
Attachment #8375039 - Flags: feedback?(rail)

Comment 5

5 years ago
Comment on attachment 8375039 [details] [diff] [review]

Review of attachment 8375039 [details] [diff] [review]:

Looks great to me. Maybe it'd be great to make it start spamming us whenever it reaches the MAX_SECS_SINCE_GOOD_RUN point just in case if all puppet masters are down, or something wrong in between.
Attachment #8375039 - Flags: feedback?(rail) → feedback+
Created attachment 8380765 [details] [diff] [review]

Tested on CentOS, and with sending of email.
Attachment #8375039 - Attachment is obsolete: true
Attachment #8380765 - Flags: review?(rail)

Comment 7

5 years ago
Comment on attachment 8380765 [details] [diff] [review]

Attachment #8380765 - Flags: review?(rail) → review+
Tested fine on Ubuntu, too.
Tested fine on OS X Lion.
I don't see any problems.  I watched a spot host apply this, reboot, and successfully re-run puppet (since the worst-case here was puppet runs not working, leaving machines unmanaged).
Last Resolved: 5 years ago
Resolution: --- → FIXED

Comment 12

5 years ago
Woot! Thanks a lot for this. No more puppet typos breaking the WORLD! :)
One hopes.. we'll see :)
You need to log in before you can comment on or make changes to this bug.