nagios check for age of /var/lib/puppet/state/puppetdlock

RESOLVED FIXED

Status

P2
critical
RESOLVED FIXED
7 years ago
5 years ago

People

(Reporter: catlee, Assigned: hwine)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [nagios][buildbotmaster][puppet])

Attachments

(2 attachments)

(Reporter)

Description

7 years ago
caught a few masters with a lockfile from august 10; required a restart of puppet to fix. presumably fallout from network issues?

in any case, we need a check for these lock files, or some other way of checking if puppet is wedged.
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
at $previous_job we used to have cfengine touch a file every time it ran, then check the age of that file with nagios.  That not only tells you that puppet ran, but that it ran successfully.
(Assignee)

Updated

6 years ago
Duplicate of this bug: 751578
(Assignee)

Updated

6 years ago
Assignee: nobody → hwine
(Reporter)

Comment 3

6 years ago
this bit us again today when signing3 was silently failing to sync up with puppet
Severity: normal → critical
(Assignee)

Comment 4

6 years ago
Found a suitable nagios NRPE plugin via Nagios Exchange (thanks to :arr):
 <https://github.com/aswen/nagios-plugins/blob/1766d1fdf3b32477a8fa0dc3d754188ed1c0e2cc/check_puppet_agent>
forked & modified for our puppet version at:
 <https://github.com/hwine/nagios-plugins>

Installed on buildbot-master32 for testing, and works on that host:
 [root@buildbot-master32 plugins]# ./check_nrpe -H 127.0.0.1 -c check_puppet_agent
 OK: Puppet agent <unknown> running catalog version <unknown>

Next steps:
 - have relops monitor on this one host
 - adjust timings as appropriate
 - put into puppet for deployment to all clients running puppetd (not talos)
 - activate monitoring on all clients
(Assignee)

Updated

6 years ago
Depends on: 752332
The state file for our version of puppet doesn't contain lines with 'version:' or 'config:', which leads to those two '<unknown>' in the OK message. Could we finesse those away ?
(Assignee)

Comment 6

6 years ago
Sure - I'm the one who put them there. :)  My assumption is that when we migrate to a newer version, they may be present and would then auto populate, since they were present in the upstream version of the code.. I can certainly use different constant or dynamic values there for now.
(Assignee)

Updated

6 years ago
Status: NEW → ASSIGNED
(Reporter)

Comment 7

6 years ago
this keeps biting us at really inconvenient times.

Hal - do you have time to finish this up?
Priority: P3 → P2
(Assignee)

Comment 8

6 years ago
I can work on this soonish - fwiw, it will be my first significant puppet work, so if we need it faster than a week, someone else should grab.
(Assignee)

Comment 9

5 years ago
Created attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters

Passes manual tests from AWS buildbot master:
 puppet agent --test --noop --environment test
shows 2 files to be deployed, and nrpe to be reconfigured
 puppet agent --test --environment test
shows correct values for the files delivered via puppet

upstream of plugin code is https://github.com/hwine/nagios-plugins
Attachment #745427 - Flags: review?(rail)
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters

Review of attachment 745427 [details] [diff] [review]:
-----------------------------------------------------------------

You'll want this on foopies and imaging servers as well.  It's probably best to include it from toplevel::server.
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters

(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> You'll want this on foopies and imaging servers as well.  It's probably best
> to include it from toplevel::server.

Agree. It won't work on the AWS puppet masters (unless you disable daemon_check), but it shouldn't hurt them as well. Once we switch to the cluster model this won't be an issue.

Hal, can you move "include nrpe::check::puppet_agent" from buildmaster::buildbot_master to toplevel::server when you land?
Attachment #745427 - Flags: review?(rail) → review+
(Assignee)

Comment 12

5 years ago
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters

https://hg.mozilla.org/build/puppet/rev/985bdf0507d0
Attachment #745427 - Flags: checked-in+
(Assignee)

Comment 13

5 years ago
Note: this plugin does not work on the ancient buildbot-master12, the only one left inhouse at the moment. This is not an issue, as it's scheduled for replacement in bug 867593 by a modern version this plugin supports.
(Assignee)

Comment 14

5 years ago
Created attachment 747043 [details] [diff] [review]
remove broken & useless daemon code

The test for puppet daemon running is very brittle, and gives lots of false positives. 

Instead of fixing, remove functionality from this plugin, as many of our servers do not run a traditional puppet daemon. If daemon check needed, we'll implement via proper nagios check_proc
Attachment #747043 - Flags: review?(rail)
In fact, none run a puppet daemon.  The servers run puppet from a crontask.
Attachment #747043 - Flags: review?(rail) → review+
(Assignee)

Comment 16

5 years ago
Comment on attachment 747043 [details] [diff] [review]
remove broken & useless daemon code

https://hg.mozilla.org/build/puppet/rev/32beb7ce6cc0
Attachment #747043 - Flags: checked-in+
(Assignee)

Updated

5 years ago
Blocks: 885560
Product: mozilla.org → Release Engineering
What's left to do here?
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #17)
> What's left to do here?

I'm going to assume nothing.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.