puppet agent initscript shouldn't fail

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps: Puppet
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: rail, Assigned: dustin)

Tracking

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

4 years ago
When something goes wrong with puppet we usually lose a lot of capacity. We can change the initscript to touch some file (/etc/puppet/last-update) if everything goes right (puppet agent exits 0 or 2). If puppetizing doesn't work the script should retry N times, then check if /etc/puppet/last-update is not older than X hours. If the file is fresh enough the script should send an email about the failure and exit 0, so the machine boots up properly. If the file is old we can either keep retrying or do something else (reboot?).
(Reporter)

Comment 1

4 years ago
It would have helped us today when a simple typo caused tree closures and 1500 Amazon machines up and not running.
(Assignee)

Updated

4 years ago
Assignee: relops → dustin
(Assignee)

Comment 2

4 years ago
And today too
(Assignee)

Comment 3

4 years ago
Do we have something in place to prevent running jobs on minis with incorrect resolutions?  This plan sounds great *except* that those minis will run jobs until the last-update semaphore file is too old -- and I'm assuming "too old" is on the order of hours to a day.
(Assignee)

Comment 4

4 years ago
Created attachment 8375039 [details] [diff] [review]
bug959404.patch

Totally untested, aside from the perl snippet, but what do you think about this approach?
Attachment #8375039 - Flags: feedback?(rail)
(Reporter)

Comment 5

4 years ago
Comment on attachment 8375039 [details] [diff] [review]
bug959404.patch

Review of attachment 8375039 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great to me. Maybe it'd be great to make it start spamming us whenever it reaches the MAX_SECS_SINCE_GOOD_RUN point just in case if all puppet masters are down, or something wrong in between.
Attachment #8375039 - Flags: feedback?(rail) → feedback+
(Assignee)

Comment 6

4 years ago
Created attachment 8380765 [details] [diff] [review]
bug959404-p1.patch

Tested on CentOS, and with sending of email.
Attachment #8375039 - Attachment is obsolete: true
Attachment #8380765 - Flags: review?(rail)
(Reporter)

Comment 7

4 years ago
Comment on attachment 8380765 [details] [diff] [review]
bug959404-p1.patch

woot!
Attachment #8380765 - Flags: review?(rail) → review+
(Assignee)

Comment 8

4 years ago
Tested fine on Ubuntu, too.
(Assignee)

Comment 9

4 years ago
Tested fine on OS X Lion.
(Assignee)

Comment 10

4 years ago
https://hg.mozilla.org/build/puppet/rev/97ec0177f2ee
(Assignee)

Comment 11

4 years ago
I don't see any problems.  I watched a spot host apply this, reboot, and successfully re-run puppet (since the worst-case here was puppet runs not working, leaving machines unmanaged).
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Reporter)

Comment 12

4 years ago
Woot! Thanks a lot for this. No more puppet typos breaking the WORLD! :)
(Assignee)

Comment 13

4 years ago
One hopes.. we'll see :)
You need to log in before you can comment on or make changes to this bug.