Closed
Bug 1000210
Opened 11 years ago
Closed 7 years ago
slaverebooter should file bugs for and/or attempt to reboot machines that have retried more than the last N jobs
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: emorley, Unassigned)
References
Details
(Keywords: sheriffing-P1, Whiteboard: slaveapi)
Jobs that fail with a buildbot result of "RETRY" are (correctly) ignored for most of the TBPL starring workflow, but this means that bad machines can go on a rampage, chewing through jobs - and unless someone looking at TBPL is not in onlyunstarred=1 mode and is paying attention, it can go overlooked for several hours if not days.
Ideally slaverebooter would file a bug for, and possibly also reboot machines that have had a buildbot result of RETRY for more than the last N jobs that machine has performed (some of the retry failure modes, like the tools repo clone not working can be fixed by a reboot alone). I would imagine an N of 10-20 might be a good place to start.
At the moment, the sheriffs have to manually go through recent blue jobs on TBPL, load the slave health page for each using the link in the bottom left of the UI, and look at the job history table. Joel is working on something that would be able to help in longer-term bad SD card intermittent failure type cases, but this bug would take care of this "retrying machine chewing through everything in site over the last hour" case.
Reporter | ||
Comment 1•11 years ago
|
||
Ben, is this something that would be easily doable? :-)
Flags: needinfo?(bhearsum)
Comment 2•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #1)
> Ben, is this something that would be easily doable? :-)
I'm busy with a high priority B2G item right now, I probably won't be able to respond soon. Callek may be able to help you out in the meantime.
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #1)
> > Ben, is this something that would be easily doable? :-)
>
> I'm busy with a high priority B2G item right now, I probably won't be able
> to respond soon. Callek may be able to help you out in the meantime.
Np, thank you :-)
Flags: needinfo?(bugspam.Callek)
Comment 4•11 years ago
|
||
So, slaverebooter already gets the most recent job information for each slave. We _could_ have it pull extra job information if the most recent job is RETRY. That needs a couple of things:
1) A new or modified endpoint to support pulling more than the most recent job. Best way to do that is probably to add a query arg to the /slaves/:slave endpoint (http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/web/slave.py;h=03108a044815dd3623844a5c4b6ad045676f26a0;hb=HEAD#l16)
2) Add the extra call to slaveapi in the slave rebooter script: https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/reboot-idle-slaves.py#l49.
Flags: needinfo?(bhearsum)
Comment 5•11 years ago
|
||
I also don't forsee doing this anytime very soon, that said it shouldn't be too hard if I get an inclination/free time sooner than later :)
Flags: needinfo?(bugspam.Callek)
Whiteboard: slaveapi
Reporter | ||
Updated•10 years ago
|
Blocks: byebyebuildduty
Assignee | ||
Updated•8 years ago
|
Component: Tools → General
Reporter | ||
Comment 6•7 years ago
|
||
Mass-closing old bugs I filed that have not had recent activity/no longer affect me.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•