Closed Bug 1505216 Opened 7 years ago Closed 7 years ago

monitor all UPS for temperature

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: van, Assigned: ryanc)

Details

it doesn't look like TPE1's UPS alerted of us the temperature issue in TPE1 today (11/6/2018) before the devices started shutting down. can we make sure we have all the UPS monitoring temperature or maybe it is but something is broken so it didn't alert in #sysadmins? i see UPS ups-red01.df401-1.private.tpe1 on the observium [1]list though. van> nagios-mdc1: status ups-red01.df401-1.private.tpe1.mozilla.net:* 3:34 PM <nagios-mdc1> van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:PING is OK - PING OK - Packet loss = 0%, RTA = 136.26 ms Last Checked: 2018-11-06 23:33:22 UTC 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS Battery Replacement is OK - SNMP OK - Status 1 Last Checked: 2018-11-06 23:32:49 UTC 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS Battery Status is OK - SNMP OK - Status 2 Last Checked: 2018-11-06 23:28:47 UTC 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS Output Status is OK - SNMP OK - Status 2 Last Checked: 2018-11-06 23:31:09 UTC [1] https://observium1.private.mdc2.mozilla.com/alert_check/alert_test_id=19/
i thought observium interacted with the irc bots. is it possible to add the temperature/humidty UPS check to nagios so we get the alerts in #sysadmins? thanks!
(In reply to Van Le [:van] from comment #0) > it doesn't look like TPE1's UPS alerted of us the temperature issue in TPE1 > today (11/6/2018) before the devices started shutting down. can we make sure > we have all the UPS monitoring temperature or maybe it is but something is > broken so it didn't alert in #sysadmins? i see UPS > ups-red01.df401-1.private.tpe1 on the observium [1]list though. > > van> nagios-mdc1: status ups-red01.df401-1.private.tpe1.mozilla.net:* > 3:34 PM > <nagios-mdc1> van: [networkops] > ups-red01.df401-1.private.tpe1.mozilla.net:PING is OK - PING OK - Packet > loss = 0%, RTA = 136.26 ms Last Checked: 2018-11-06 23:33:22 UTC > 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS > Battery Replacement is OK - SNMP OK - Status 1 Last Checked: 2018-11-06 > 23:32:49 UTC > 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS > Battery Status is OK - SNMP OK - Status 2 Last Checked: 2018-11-06 23:28:47 > UTC > 3:34 PM van: [networkops] ups-red01.df401-1.private.tpe1.mozilla.net:UPS > Output Status is OK - SNMP OK - Status 2 Last Checked: 2018-11-06 23:31:09 > UTC > > [1] https://observium1.private.mdc2.mozilla.com/alert_check/alert_test_id=19/ Yeah it did, https://observium1.private.mdc2.mozilla.com/graphs/to=1541548780/device=25/type=device_temperature/from=1541462380/legend=yes/ https://mozilla.pagerduty.com/incidents/PODKQRR
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
(In reply to Van Le [:van] from comment #1) > i thought observium interacted with the irc bots. is it possible to add the > temperature/humidty UPS check to nagios so we get the alerts in #sysadmins? > thanks! There's an irc bot to query Observium, it doesn't display alerts, but they do go to Slack, which I can invite you to and others to -- Let me know
And in regards to the title of this bug, "monitor all UPS for temperature", we do monitor all ups', as long as they're in the secrets file we've designated for this host so that Puppet automatically adds them
(In reply to Ryan C [:ryanc] (UTC-4) from comment #4) > And in regards to the title of this bug, "monitor all UPS for temperature", > we do monitor all ups', as long as they're in the secrets file we've > designated for this host so that Puppet automatically adds them puppet_secrets/hiera/nodes/observium1.private.mdc2.mozilla.com.yaml
thanks for clearing this up :ryanc!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.