Closed Bug 1168389 Opened 10 years ago Closed 10 years ago

Mark slaves as "idle" instead of "broken" if there is no pending queue for that pool

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: coop)

Details

(Whiteboard: [buildduty][slavehealth])

In slave health, the "broken" designation for slaves is often inaccurate. When there are no pending jobs for a given pool, it's expected that machines may not have reported a job status in 6+ hours, e.g. on weekends. We can make a pretty simple change to increase the accuracy here. Just like we do for AWS slave classes ("stopped"), we can mark the unused capacity for hardware slaves as "idle" instead of "broken" by default. Once we check the pending counts for that pool, we can change that status to "broken" iff that slave class has any pending jobs.
Some good points brought up in IRC this morning: * "broken" does not necessarily mean "broken" if all pending jobs are in jacuzzis. We meed to update the slave_health page legend to point to jacuzzis regardless so that this is discoverable rather than tribal knowledge. * how does changing hardware machine status from "broken" to "idle" affect batch actions on the slave type page? We don't currently look up pending job counts for the slavetype.html page, so nothing would change here initially. Assuming we start checking pending counts here too, we could also change the intent of "Reboot all broken slaves" action to work on "idle" slaves when we update the state.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.