Closed
Bug 865727
Opened 11 years ago
Closed 9 years ago
Detect and disable build slaves that are rapidly burning jobs
Categories
(Release Engineering :: General, enhancement, P3)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: jhopkins, Unassigned, Mentored)
References
Details
On occasion, a build slave may burn a lot of jobs back-to-back before it is noticed and disabled. A recent example is a slave added to the build pool had a hard disk issue and burned a lot of jobs. We should figure out a way to detect back-to-back build bustage on a slave and automatically disable/isolate it to prevent more jobs from burning. An infra issue (eg. DNS outage) could cause a similar burning builds situation but on a larger scale, so we still want to stop slaves from building in that case, but leave an audit trail so it's clear which build slaves to reinstate after the outage (ie. not slaves that were disabled for other reasons).
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•11 years ago
|
Priority: -- → P3
Updated•11 years ago
|
Severity: normal → enhancement
Updated•10 years ago
|
Component: Buildduty → Tools
Updated•10 years ago
|
QA Contact: armenzg → hwine
Comment 1•10 years ago
|
||
I think the best thing to do here would be to write a pulse consumer that only looks an for non-passing jobs, and notifies releng if any single slave is burning/retrying jobs at an elevated rate. Pulsebuildmonitor makes this an easy project to get started: http://hg.mozilla.org/automation/pulsebuildmonitor What's an elevated rate? Well, for starters I'd say failing 2 jobs in a row within an hour, or failing 5 jobs in a row regardless of timing. The notification could take many forms. We'd probably want to start out with emails to releng/buildduty until we tweak the checks to our liking. After that, we could notify *and* automatically disable/reboot the slave. We'd also want to be able to disable this checking easily when we have planned (TCW) or unexpected closures so we don't end up disabling a whole bunch of slaves for a systemic failure.
Updated•10 years ago
|
Mentor: coop
Comment 2•10 years ago
|
||
catlee also suggested doing this in runner. As part of either the pre- or post-job cleanup, runner could check the job history for the current machine via slaveapi. If the machine is in one of the two failure states mentioned in comment #1, it could disable the slave via slaveapi. I like this approach because each slave is responsible for itself. It does have the potential to increase the load on slaveapi by a non-trivial amount, if all 5000 slaves are checking in with slaveapi before/after every job. How do we handle the case where slaveapi is unreachable? Defer the check until the next job after some small timeout?
Comment 3•10 years ago
|
||
We could avoid touching slaveapi altogether if there was a local override to disable a slave. Eg, /builds/slave/DONT_START_ME. That might be better from a security standpoint too - do we really want all slaves to have access to slaveapi, especially when there's no ACLs in it? They'd be able to do things like reboot other machines...
Comment 4•10 years ago
|
||
Quis custodiet ipsos custodes? Runner has probably already solved the problem this bug was filed for (or could solve it, if it doesn't already require a disk write), having a slave with a read-only disk so it would fail out of the first buildstep which requires writing, set RETRY, then take the retried job since it was already ready for another, etc., but still... making the slave the only thing responsible for killing a rogue slave implies there being no other states with the same outcome.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•