enable nagios check of puppet agent status on bm32

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: hwine, Assigned: ashish)

Tracking

Details

(Reporter)

Description

6 years ago
Trial deploy of new check that puppet agent has been successful recently. (See bug 685527#c4 for details)

Please add checking of "check_puppet_agent" on buildbot-master32 via NRPE, with notifications disabled. Sample nagios service configuration given at:
 <https://github.com/hwine/nagios-plugins/blob/master/check_puppet_agent>

After some burn in and tuning, we'll deploy via puppet on a broader scale and then ask for general activation in another bug.
(Reporter)

Comment 1

6 years ago
To clarify, please use this line for enabling the service:
  check_command check_nrpe!check_puppet_agent!3600!7200
I've added the check to the existing nagios with notifications disabled.  Rick, please make sure to copy this check from admin1.infra.scl1.mozilla.com to nagios1.private.releng.scl3.mozilla.com when you migrate things today.
Assignee: server-ops-releng → rbryce
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong

Comment 3

6 years ago
This is no longer needed. AS releng is staying put on scl1
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → INVALID
(Reporter)

Comment 4

6 years ago
The move to scl3 may be invalid, but tracking this on scl1 nagios is not. Moving back to server-ops-releng and marking fixed instead.
Assignee: rbryce → server-ops-releng
Component: Server Operations → Server Operations: RelEng
QA Contact: phong → arich
Resolution: INVALID → FIXED
(Reporter)

Comment 5

5 years ago
(In reply to Rick Bryce [:rbryce] from comment #3)
> This is no longer needed. AS releng is staying put on scl1

Times have changed, we're in scl3 - please enable check per comment 1 for buildbot-master32 on the nagios server at http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/

plugin functions correctly locally, want to triple check it works okay before rolling out to all hosts.
Assignee: server-ops-releng → server-ops
Status: RESOLVED → REOPENED
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
Resolution: FIXED → ---
Given that we're not building masters with old-puppet any more, and that all current masters are on KVM and thus will be replaced in the move to scl3, is this still necessary?
(Assignee)

Updated

5 years ago
Flags: needinfo?(hwine)
(Reporter)

Comment 7

5 years ago
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> Given that we're not building masters with old-puppet any more, and that all
> current masters are on KVM and thus will be replaced in the move to scl3, is
> this still necessary?

Yes - it will be deployed on ALL puppetized non-talos machines. Just starting with this older one since it used to work there -- before nagios, etc. upgraded. Easier to trouble shoot.

And, yes, we want a nagios alert for this condition. As I understood puppetAgain, the dashboard will flag the error, but not trigger a nagios alert.
Flags: needinfo?(hwine)
We also get an email for every failed puppet run, in the releng-shared mailbox.  I really don't think a nagios alert is necessary.
And I should add, at least Callek and I check those religiously.  I'd like to know that others are watching that mailbox, too.
(Reporter)

Comment 10

5 years ago
Per IRC chat with Dustin, we can proceed on hooking this up.
(Assignee)

Comment 11

5 years ago
OK, there is no buildbot-master32:

Host buildbot-master32.srv.releng.scl3.mozilla.com not found: 3(NXDOMAIN)

Or am I missing something? :)
Assignee: server-ops → ashish
That was one of the buildbot-masters that was recently decommissioned.
(Reporter)

Comment 13

5 years ago
(In reply to Ashish Vijayaram [:ashish] from comment #11)
> OK, there is no buildbot-master32:
> 
> Host buildbot-master32.srv.releng.scl3.mozilla.com not found: 3(NXDOMAIN)
> 
> Or am I missing something? :)

No, I am - it was there when I started testing. I'll move my setup, then update this request. Taking out of your queue for now.
Assignee: ashish → hwine
(Reporter)

Comment 14

5 years ago
Okay, build-master12 is even older than 32 was (puppet version) -- I'll have to do some work there to support the plugin.

Ashish - can you hook up buildbot-master63.srv.releng.use1.mozilla.com (in AWS) please? the plugin runs clean there.

Thanks!
Assignee: hwine → ashish
(Assignee)

Comment 15

5 years ago
Done!

< nagios-releng> ashish: buildbot-master63.srv.releng.use1.mozilla.com:Puppet freshness is OK - OK: Puppet agent last run: 1706 sec ago
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago5 years ago
Resolution: --- → FIXED
(Reporter)

Comment 16

5 years ago
(In reply to Hal Wine [:hwine] from comment #14)
> Okay, build-master12 is even older than 32 was (puppet version) -- I'll have
> to do some work there to support the plugin.

No update needed, see bug 685527 comment 13
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.