caught a few masters with a lockfile from august 10; required a restart of puppet to fix. presumably fallout from network issues? in any case, we need a check for these lock files, or some other way of checking if puppet is wedged.
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
at $previous_job we used to have cfengine touch a file every time it ran, then check the age of that file with nagios. That not only tells you that puppet ran, but that it ran successfully.
this bit us again today when signing3 was silently failing to sync up with puppet
Severity: normal → critical
Found a suitable nagios NRPE plugin via Nagios Exchange (thanks to :arr): <https://github.com/aswen/nagios-plugins/blob/1766d1fdf3b32477a8fa0dc3d754188ed1c0e2cc/check_puppet_agent> forked & modified for our puppet version at: <https://github.com/hwine/nagios-plugins> Installed on buildbot-master32 for testing, and works on that host: [root@buildbot-master32 plugins]# ./check_nrpe -H 127.0.0.1 -c check_puppet_agent OK: Puppet agent <unknown> running catalog version <unknown> Next steps: - have relops monitor on this one host - adjust timings as appropriate - put into puppet for deployment to all clients running puppetd (not talos) - activate monitoring on all clients
The state file for our version of puppet doesn't contain lines with 'version:' or 'config:', which leads to those two '<unknown>' in the OK message. Could we finesse those away ?
Sure - I'm the one who put them there. :) My assumption is that when we migrate to a newer version, they may be present and would then auto populate, since they were present in the upstream version of the code.. I can certainly use different constant or dynamic values there for now.
this keeps biting us at really inconvenient times. Hal - do you have time to finish this up?
Priority: P3 → P2
I can work on this soonish - fwiw, it will be my first significant puppet work, so if we need it faster than a week, someone else should grab.
Created attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters Passes manual tests from AWS buildbot master: puppet agent --test --noop --environment test shows 2 files to be deployed, and nrpe to be reconfigured puppet agent --test --environment test shows correct values for the files delivered via puppet upstream of plugin code is https://github.com/hwine/nagios-plugins
Attachment #745427 - Flags: review?(rail)
Comment on attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters Review of attachment 745427 [details] [diff] [review]: ----------------------------------------------------------------- You'll want this on foopies and imaging servers as well. It's probably best to include it from toplevel::server.
Comment on attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters (In reply to Dustin J. Mitchell [:dustin] from comment #10) > You'll want this on foopies and imaging servers as well. It's probably best > to include it from toplevel::server. Agree. It won't work on the AWS puppet masters (unless you disable daemon_check), but it shouldn't hurt them as well. Once we switch to the cluster model this won't be an issue. Hal, can you move "include nrpe::check::puppet_agent" from buildmaster::buildbot_master to toplevel::server when you land?
Attachment #745427 - Flags: review?(rail) → review+
Comment on attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters https://hg.mozilla.org/build/puppet/rev/985bdf0507d0
Attachment #745427 - Flags: checked-in+
Note: this plugin does not work on the ancient buildbot-master12, the only one left inhouse at the moment. This is not an issue, as it's scheduled for replacement in bug 867593 by a modern version this plugin supports.
Created attachment 747043 [details] [diff] [review] remove broken & useless daemon code The test for puppet daemon running is very brittle, and gives lots of false positives. Instead of fixing, remove functionality from this plugin, as many of our servers do not run a traditional puppet daemon. If daemon check needed, we'll implement via proper nagios check_proc
Attachment #747043 - Flags: review?(rail)
In fact, none run a puppet daemon. The servers run puppet from a crontask.
Comment on attachment 747043 [details] [diff] [review] remove broken & useless daemon code https://hg.mozilla.org/build/puppet/rev/32beb7ce6cc0
Attachment #747043 - Flags: checked-in+
Product: mozilla.org → Release Engineering
What's left to do here?
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #17) > What's left to do here? I'm going to assume nothing.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.