Closed Bug 888710 Opened 12 years ago Closed 11 years ago

fix ganglia machines talking to defunct sjc1 puppet server

Categories

(Infrastructure & Operations :: Infrastructure: Puppet, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Atoll)

Details

We have a whole bunch of scl1 machines alerting about ganglia in nagios, e.g. Sun 02:04:45 PDT [441] buildbot-master24.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [03:59:13] nagios-releng Sun 00:59:14 PDT [493] dev-master01.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:00:04] nagios-releng Sun 01:00:05 PDT [494] redis01.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:00:04] nagios-releng Sun 01:00:05 PDT [495] releng-puppet1.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:00:04] nagios-releng Sun 01:00:05 PDT [496] signing2.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:00:04] nagios-releng Sun 01:00:05 PDT [497] slavealloc.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:43] nagios-releng Sun 01:04:44 PDT [498] buildbot-master12.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:44] nagios-releng Sun 01:04:44 PDT [499] buildbot-master29.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:53] nagios-releng Sun 01:04:54 PDT [400] buildbot-master42.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:54] nagios-releng Sun 01:04:54 PDT [401] buildbot-master44.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:54] nagios-releng Sun 01:04:54 PDT [402] buildbot-master45.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:54] nagios-releng Sun 01:04:54 PDT [403] buildbot-master48.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) [04:04:56] nagios-releng Sun 01:04:54 PDT [404] buildbot-master47.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO) among others. I restarted gmond on one buildbot master with no affect on recovering the alert. This might not be a releng bug in the end but wanted to get it on file as fallout form the scl1 power outage first
Nagios has all checks green for ganglia1.build.scl1.mozilla.com but having a look at the gmond service there seems like a good start.
Assignee: nobody → server-ops-releng
Component: Release Engineering: Machine Management → Server Operations: RelEng
QA Contact: armenzg → arich
A number of issues here. 1) ganglia1.build.scl1.mozilla.com lists puppet1.private.sjc1.mozilla.com as it's puppet server. This host no longer exists and therefore this dropped off active puppet management by infra some time ago. A direct swap in of the scl3 puppet master didn't qork (I didn't really expect it would), so will catch up with atoll on this tomorrow. 2) Therefore a number of hosts that were listed in gmetad.conf didn't exist anymore (old, deleted buildbot-masters, etc). I've modified the gmetad.conf by hand and restarted it, which at least seems to have solved the issue for the time being, but we'll need to hook this back up to a real puppet server until we decommission it when we replace ganglia with graphite.
Whiteboard: [buildduty][buildslaves][capacity]
I just double checked, and ganglia3.build.mtv1.mozilla.com has the same issue with the puppet master being the defunct one in sjc1.
Will coordinate with arr and SRE to repuppetize these up to current.
Assignee: server-ops-releng → rsoderberg
Component: Server Operations: RelEng → Infrastructure: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → jdow
Summary: Ganglia Nagios Alerts for scl1 → fix ganglia machines talking to defunct sjc1 puppet server
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2838]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2838] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2844]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2844] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2846]
removing bogus whiteboard tag
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2846]
Pretty sure all the machines mentioned in this bug no longer exist under those hostnames, so this is probably good to close.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.