Closed
Bug 728459
Opened 12 years ago
Closed 12 years ago
Automatically try to reboot slaves that are hung, broken or slow
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bear, Assigned: bear)
References
()
Details
(Whiteboard: [buildduty][automation][capacity])
Attachments
(1 file)
2.42 KB,
text/plain
|
Details |
the original bug ended up being flagged moco-confidential, so I spun this one up... Create a series of tools that allow slave statuses to be checked and monitored. If any slave is found to be offline, hung or just idle for too long, then reboot the slave.
Assignee | ||
Comment 1•12 years ago
|
||
The set of tools I'm creating (based on catlee and coop's work) is the following: kittenreaper.py scan for slaves that are known to be offline and auto-reboot them kitten.py check and display the status of a slave and optionally do "something" to it/them rbug.py allow buildduty to check and, if necessary, file a reboots bug for a hung slave
Assignee: nobody → bear
Priority: -- → P3
Whiteboard: [buildduty][automation][capacity]
Assignee | ||
Comment 2•12 years ago
|
||
Assignee | ||
Comment 4•12 years ago
|
||
created cron job on new dashboard sandbox vm that runs kittenreaper.py and sends an email
Assignee | ||
Comment 5•12 years ago
|
||
cronjob is now running on cruncher every 6 hours to check for idle hosts and autoreboot them with an email sent for any action taken
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 6•12 years ago
|
||
Since I don't see it addressed here or in the other bug... (In reply to Mike Taylor [:bear] from comment #5) > cronjob is now running on cruncher every 6 hours to check for idle hosts and > autoreboot them with an email sent for any action taken Run every 6 hours is fine, but the "have not heard anything in..." timeframe is what? 6 hours as well?? If so, that does not really feel prudent for 6 hours to be our threshold here, Idleizer is set at 7 hours, and as such I would have expected 8 hours here.
Assignee | ||
Comment 7•12 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #6) > Since I don't see it addressed here or in the other bug... > > (In reply to Mike Taylor [:bear] from comment #5) > > cronjob is now running on cruncher every 6 hours to check for idle hosts and > > autoreboot them with an email sent for any action taken > > Run every 6 hours is fine, but the "have not heard anything in..." timeframe > is what? 6 hours as well?? > > If so, that does not really feel prudent for 6 hours to be our threshold > here, Idleizer is set at 7 hours, and as such I would have expected 8 hours > here. This code may catch some hosts where idleizer is running just before idleizer would have, but since it's checking for zero buildbot activity I wasn't too worried that I've moved the marker an hour earlier than what idleizer had.
Comment 8•12 years ago
|
||
Well idleizer also catches on zero buildbot activity... BUT it basically shuts buildbot down when invoking. ANDonce we update the master even does it more cleanly. I strongly suggest letting it run before this script
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•