Closed Bug 1493981 Opened 6 years ago Closed 5 years ago

Add alerts for powered off moonshots

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: fubar, Unassigned)

References

Details

(In reply to Dave House [:dhouse] from comment #39) > We have the moonshots configured to power off after 3 failed boots. If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.
notes: The ilo alertmail sends alert email "for each IML log entry as it is added" (http://h17007.www1.hpe.com/docs/enterprise/servers/moonshot/webhelp/content/s_alertmail_commands.html). I did not find any IML or ilo event log entries for the 3 cartridges that shut off. The boot retry count limit only applies to manually triggered network boot: """ Enabling or disabling Network Boot Retry Support Use the Network Boot Retry Support option to enable or disable the network boot retry function. When enabled, the system BIOS attempts to boot the network device up to the number of times set in the Network Boot Retry Count option before attempting to boot the next network device. This setting only takes effect when attempting to boot a network device from the F12 function key and one-time boot options. """

We already use nagios to monitor the host on the node so I think rather than putting effort into monitor the power status of the node, we should simply use the current nagios checks to determine if a node is not working. Deadman checks in grafana would be a better use of effort.

Also, we currently don't fire off notifications in nagios for workers since it would increase the noise in the slack channel. But we could change that and alert when a worker is offline for an extended period.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.