Closed Bug 1000210 Opened 10 years ago Closed 6 years ago

slaverebooter should file bugs for and/or attempt to reboot machines that have retried more than the last N jobs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: sheriffing-P1, Whiteboard: slaveapi)

Jobs that fail with a buildbot result of "RETRY" are (correctly) ignored for most of the TBPL starring workflow, but this means that bad machines can go on a rampage, chewing through jobs - and unless someone looking at TBPL is not in onlyunstarred=1 mode and is paying attention, it can go overlooked for several hours if not days.

Ideally slaverebooter would file a bug for, and possibly also reboot machines that have had a buildbot result of RETRY for more than the last N jobs that machine has performed (some of the retry failure modes, like the tools repo clone not working can be fixed by a reboot alone). I would imagine an N of 10-20 might be a good place to start.

At the moment, the sheriffs have to manually go through recent blue jobs on TBPL, load the slave health page for each using the link in the bottom left of the UI, and look at the job history table. Joel is working on something that would be able to help in longer-term bad SD card intermittent failure type cases, but this bug would take care of this "retrying machine chewing through everything in site over the last hour" case.
Ben, is this something that would be easily doable? :-)
Flags: needinfo?(bhearsum)
(In reply to Ed Morley [:edmorley UTC+0] from comment #1)
> Ben, is this something that would be easily doable? :-)

I'm busy with a high priority B2G item right now, I probably won't be able to respond soon. Callek may be able to help you out in the meantime.
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #1)
> > Ben, is this something that would be easily doable? :-)
> 
> I'm busy with a high priority B2G item right now, I probably won't be able
> to respond soon. Callek may be able to help you out in the meantime.

Np, thank you :-)
Flags: needinfo?(bugspam.Callek)
So, slaverebooter already gets the most recent job information for each slave. We _could_ have it pull extra job information if the most recent job is RETRY. That needs a couple of things:
1) A new or modified endpoint to support pulling more than the most recent job. Best way to do that is probably to add a query arg to the /slaves/:slave endpoint (http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/web/slave.py;h=03108a044815dd3623844a5c4b6ad045676f26a0;hb=HEAD#l16)
2) Add the extra call to slaveapi in the slave rebooter script: https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/reboot-idle-slaves.py#l49.
Flags: needinfo?(bhearsum)
I also don't forsee doing this anytime very soon, that said it shouldn't be too hard if I get an inclination/free time sooner than later :)
Flags: needinfo?(bugspam.Callek)
Whiteboard: slaveapi
Component: Tools → General
Mass-closing old bugs I filed that have not had recent activity/no longer affect me.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.