Closed Bug 865727 Opened 11 years ago Closed 9 years ago

Detect and disable build slaves that are rapidly burning jobs

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: jhopkins, Unassigned, Mentored)

References

Details

John Hopkins (:jhopkins)

Reporter

Description

•

11 years ago

On occasion, a build slave may burn a lot of jobs back-to-back before it is noticed and disabled.  A recent example is a slave added to the build pool had a hard disk issue and burned a lot of jobs.

We should figure out a way to detect back-to-back build bustage on a slave and automatically disable/isolate it to prevent more jobs from burning.

An infra issue (eg. DNS outage) could cause a similar burning builds situation but on a larger scale, so we still want to stop slaves from building in that case, but leave an audit trail so it's clear which build slaves to reinstate after the outage (ie. not slaves that were disabled for other reasons).

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Priority: -- → P3

Armen [:armenzg]

Updated

•

11 years ago

Severity: normal → enhancement

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Component: Buildduty → Tools

Armen [:armenzg]

Updated

•

10 years ago

QA Contact: armenzg → hwine

Chris Cooper [:coop] (he/him)

Comment 1

•

10 years ago

I think the best thing to do here would be to write a pulse consumer that only looks an for non-passing jobs, and notifies releng if any single slave is burning/retrying jobs at an elevated rate.

Pulsebuildmonitor makes this an easy project to get started:

http://hg.mozilla.org/automation/pulsebuildmonitor

What's an elevated rate? Well, for starters I'd say failing 2 jobs in a row within an hour, or failing 5 jobs in a row regardless of timing.

The notification could take many forms. We'd probably want to start out with emails to releng/buildduty until we tweak the checks to our liking. After that, we could notify *and* automatically disable/reboot the slave.

We'd also want to be able to disable this checking easily when we have planned (TCW) or unexpected closures so we don't end up disabling a whole bunch of slaves for a systemic failure.

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Mentor: coop

Chris Cooper [:coop] (he/him)

Comment 2

•

10 years ago

catlee also suggested doing this in runner. 

As part of either the pre- or post-job cleanup, runner could check the job history for the current machine via slaveapi. If the machine is in one of the two failure states mentioned in comment #1, it could disable the slave via slaveapi.

I like this approach because each slave is responsible for itself.

It does have the potential to increase the load on slaveapi by a non-trivial amount, if all 5000 slaves are checking in with slaveapi before/after every job. 

How do we handle the case where slaveapi is unreachable? Defer the check until the next job after some small timeout?

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

10 years ago

We could avoid touching slaveapi altogether if there was a local override to disable a slave. Eg, /builds/slave/DONT_START_ME. That might be better from a security standpoint too - do we really want all slaves to have access to slaveapi, especially when there's no ACLs in it? They'd be able to do things like reboot other machines...

Phil Ringnalda (:philor)

Comment 4

•

10 years ago

Quis custodiet ipsos custodes?

Runner has probably already solved the problem this bug was filed for (or could solve it, if it doesn't already require a disk write), having a slave with a read-only disk so it would fail out of the first buildstep which requires writing, set RETRY, then take the retried job since it was already ready for another, etc., but still... making the slave the only thing responsible for killing a rogue slave implies there being no other states with the same outcome.

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Updated

•

7 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Detect and disable build slaves that are rapidly burning jobs

Categories

(Release Engineering :: General, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: jhopkins, Unassigned, Mentored)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated