Create a persistent history of slave reboot attempts and outcomes for kittenherder

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
6 years ago
2 years ago

People

(Reporter: coop, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [slaveduty][dashboard][kittenherder])

(Assignee)

Description

6 years ago
kittenherder doesn't currently maintain a history of reboot attempts for a given slave, i.e. if the same slave appears in the slaves_needing_reboot.txt list 6 hours later, kittenherder will merrily try to reboot the slave again. This is a great opportunity for kittenherder to recognize a pattern in slave behavior and file an appropriate bug (bug 859403), but in order to do so, we need to start tracking reboot attempts in a persistent manner, and possibly also double-checking whether our reboot attempts were successful before waiting for the next cycle.

If we begin tracking state this way, it may allow us to iterate more quickly over the list of slaves needing reboot because we won't spend time on slaves that are in a known bad state.

If at all possible, the reboot history should be kept in a format (and location) that is easily digestible by other reporting tools, e.g. slave_health.
(Assignee)

Updated

6 years ago
Blocks: 878051
(Assignee)

Updated

6 years ago
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2

Comment 1

5 years ago
I believe this helps buildduty but I will remove the tag to get it out of the buildduty query.
Whiteboard: [buildduty][slaveduty][dashboard][kittenherder] → [slaveduty][dashboard][kittenherder]
(Assignee)

Updated

5 years ago
Component: Release Engineering: Machine Management → Release Engineering: Developer Tools
QA Contact: armenzg → hwine
(Assignee)

Comment 2

5 years ago
https://github.com/mozilla/briar-patch/commit/5c701aaa0361978d9e576e91a675aacec47c871d

It doesn't track outcomes, butI'm not sure how we would properly verify that unless we looped on slave state after a reboot attempt. Reboot commands can return success without actual yielding a functional machine out the other side. 

We can track this based on subsequent reboot attempts though, especially if we start iterating more quickly than every 6 hours.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.