kittenherder doesn't currently maintain a history of reboot attempts for a given slave, i.e. if the same slave appears in the slaves_needing_reboot.txt list 6 hours later, kittenherder will merrily try to reboot the slave again. This is a great opportunity for kittenherder to recognize a pattern in slave behavior and file an appropriate bug (bug 859403), but in order to do so, we need to start tracking reboot attempts in a persistent manner, and possibly also double-checking whether our reboot attempts were successful before waiting for the next cycle. If we begin tracking state this way, it may allow us to iterate more quickly over the list of slaves needing reboot because we won't spend time on slaves that are in a known bad state. If at all possible, the reboot history should be kept in a format (and location) that is easily digestible by other reporting tools, e.g. slave_health.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
I believe this helps buildduty but I will remove the tag to get it out of the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
https://github.com/mozilla/briar-patch/commit/5c701aaa0361978d9e576e91a675aacec47c871d It doesn't track outcomes, butI'm not sure how we would properly verify that unless we looped on slave state after a reboot attempt. Reboot commands can return success without actual yielding a functional machine out the other side. We can track this based on subsequent reboot attempts though, especially if we start iterating more quickly than every 6 hours.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.