Closed
Bug 1493981
Opened 6 years ago
Closed 5 years ago
Add alerts for powered off moonshots
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: fubar, Unassigned)
References
Details
(In reply to Dave House [:dhouse] from comment #39)
> We have the moonshots configured to power off after 3 failed boots.
If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.
notes:
The ilo alertmail sends alert email "for each IML log entry as it is added" (http://h17007.www1.hpe.com/docs/enterprise/servers/moonshot/webhelp/content/s_alertmail_commands.html).
I did not find any IML or ilo event log entries for the 3 cartridges that shut off.
The boot retry count limit only applies to manually triggered network boot:
"""
Enabling or disabling Network Boot Retry Support
Use the Network Boot Retry Support option to enable or disable the network boot retry function. When
enabled, the system BIOS attempts to boot the network device up to the number of times set in the
Network Boot Retry Count option before attempting to boot the next network device. This setting only
takes effect when attempting to boot a network device from the
F12
function key and one-time boot
options.
"""
Comment 2•5 years ago
|
||
We already use nagios to monitor the host on the node so I think rather than putting effort into monitor the power status of the node, we should simply use the current nagios checks to determine if a node is not working. Deadman checks in grafana would be a better use of effort.
Also, we currently don't fire off notifications in nagios for workers since it would increase the noise in the slack channel. But we could change that and alert when a worker is offline for an extended period.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•