On occasion, a build slave may burn a lot of jobs back-to-back before it is noticed and disabled. A recent example is a slave added to the build pool had a hard disk issue and burned a lot of jobs. We should figure out a way to detect back-to-back build bustage on a slave and automatically disable/isolate it to prevent more jobs from burning. An infra issue (eg. DNS outage) could cause a similar burning builds situation but on a larger scale, so we still want to stop slaves from building in that case, but leave an audit trail so it's clear which build slaves to reinstate after the outage (ie. not slaves that were disabled for other reasons).
I think the best thing to do here would be to write a pulse consumer that only looks an for non-passing jobs, and notifies releng if any single slave is burning/retrying jobs at an elevated rate. Pulsebuildmonitor makes this an easy project to get started: http://hg.mozilla.org/automation/pulsebuildmonitor What's an elevated rate? Well, for starters I'd say failing 2 jobs in a row within an hour, or failing 5 jobs in a row regardless of timing. The notification could take many forms. We'd probably want to start out with emails to releng/buildduty until we tweak the checks to our liking. After that, we could notify *and* automatically disable/reboot the slave. We'd also want to be able to disable this checking easily when we have planned (TCW) or unexpected closures so we don't end up disabling a whole bunch of slaves for a systemic failure.
catlee also suggested doing this in runner. As part of either the pre- or post-job cleanup, runner could check the job history for the current machine via slaveapi. If the machine is in one of the two failure states mentioned in comment #1, it could disable the slave via slaveapi. I like this approach because each slave is responsible for itself. It does have the potential to increase the load on slaveapi by a non-trivial amount, if all 5000 slaves are checking in with slaveapi before/after every job. How do we handle the case where slaveapi is unreachable? Defer the check until the next job after some small timeout?
We could avoid touching slaveapi altogether if there was a local override to disable a slave. Eg, /builds/slave/DONT_START_ME. That might be better from a security standpoint too - do we really want all slaves to have access to slaveapi, especially when there's no ACLs in it? They'd be able to do things like reboot other machines...
Quis custodiet ipsos custodes? Runner has probably already solved the problem this bug was filed for (or could solve it, if it doesn't already require a disk write), having a slave with a read-only disk so it would fail out of the first buildstep which requires writing, set RETRY, then take the retried job since it was already ready for another, etc., but still... making the slave the only thing responsible for killing a rogue slave implies there being no other states with the same outcome.