Closed
Bug 888710
Opened 12 years ago
Closed 11 years ago
fix ganglia machines talking to defunct sjc1 puppet server
Categories
(Infrastructure & Operations :: Infrastructure: Puppet, task, P3)
Infrastructure & Operations
Infrastructure: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: Atoll)
Details
We have a whole bunch of scl1 machines alerting about ganglia in nagios, e.g.
Sun 02:04:45 PDT [441] buildbot-master24.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[03:59:13] nagios-releng Sun 00:59:14 PDT [493] dev-master01.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:00:04] nagios-releng Sun 01:00:05 PDT [494] redis01.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:00:04] nagios-releng Sun 01:00:05 PDT [495] releng-puppet1.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:00:04] nagios-releng Sun 01:00:05 PDT [496] signing2.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:00:04] nagios-releng Sun 01:00:05 PDT [497] slavealloc.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:43] nagios-releng Sun 01:04:44 PDT [498] buildbot-master12.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:44] nagios-releng Sun 01:04:44 PDT [499] buildbot-master29.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:53] nagios-releng Sun 01:04:54 PDT [400] buildbot-master42.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:54] nagios-releng Sun 01:04:54 PDT [401] buildbot-master44.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:54] nagios-releng Sun 01:04:54 PDT [402] buildbot-master45.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:54] nagios-releng Sun 01:04:54 PDT [403] buildbot-master48.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
[04:04:56] nagios-releng Sun 01:04:54 PDT [404] buildbot-master47.build.scl1.mozilla.com:Ganglia IO is UNKNOWN: CHECKGANGLIA UNKNOWN: Error while getting value Host/value not found (http://m.allizom.org/Ganglia+IO)
among others.
I restarted gmond on one buildbot master with no affect on recovering the alert.
This might not be a releng bug in the end but wanted to get it on file as fallout form the scl1 power outage first
Comment 1•12 years ago
|
||
Nagios has all checks green for ganglia1.build.scl1.mozilla.com but having a look at the gmond service there seems like a good start.
Assignee: nobody → server-ops-releng
Component: Release Engineering: Machine Management → Server Operations: RelEng
QA Contact: armenzg → arich
Comment 2•12 years ago
|
||
A number of issues here.
1) ganglia1.build.scl1.mozilla.com lists puppet1.private.sjc1.mozilla.com as it's puppet server. This host no longer exists and therefore this dropped off active puppet management by infra some time ago. A direct swap in of the scl3 puppet master didn't qork (I didn't really expect it would), so will catch up with atoll on this tomorrow.
2) Therefore a number of hosts that were listed in gmetad.conf didn't exist anymore (old, deleted buildbot-masters, etc).
I've modified the gmetad.conf by hand and restarted it, which at least seems to have solved the issue for the time being, but we'll need to hook this back up to a real puppet server until we decommission it when we replace ganglia with graphite.
| Reporter | ||
Updated•12 years ago
|
Whiteboard: [buildduty][buildslaves][capacity]
Comment 3•12 years ago
|
||
I just double checked, and ganglia3.build.mtv1.mozilla.com has the same issue with the puppet master being the defunct one in sjc1.
Will coordinate with arr and SRE to repuppetize these up to current.
Assignee: server-ops-releng → rsoderberg
Updated•12 years ago
|
Component: Server Operations: RelEng → Infrastructure: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → jdow
Summary: Ganglia Nagios Alerts for scl1 → fix ganglia machines talking to defunct sjc1 puppet server
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2838]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2838] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2844]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2844] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2846]
Comment 5•11 years ago
|
||
removing bogus whiteboard tag
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2846]
Comment 6•11 years ago
|
||
Pretty sure all the machines mentioned in this bug no longer exist under those hostnames, so this is probably good to close.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•