Closed Bug 728459 Opened 12 years ago Closed 12 years ago

Automatically try to reboot slaves that are hung, broken or slow

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Assigned: bear)

References

()

Details

(Whiteboard: [buildduty][automation][capacity])

Attachments

(1 file)

the original bug ended up being flagged moco-confidential, so I spun this one up...

Create a series of tools that allow slave statuses to be checked and monitored.  If any slave is found to be offline, hung or just idle for too long, then reboot the slave.
The set of tools I'm creating (based on catlee and coop's work) is the following:

kittenreaper.py

scan for slaves that are known to be offline and auto-reboot them

kitten.py

check and display the status of a slave and optionally do "something" to it/them

rbug.py

allow buildduty to check and, if necessary, file a reboots bug for a hung slave
Assignee: nobody → bear
Priority: -- → P3
Whiteboard: [buildduty][automation][capacity]
Attached file sample output
Blocks: 729548
created cron job on new dashboard sandbox vm that runs kittenreaper.py and sends an email
cronjob is now running on cruncher every 6 hours to check for idle hosts and autoreboot them with an email sent for any action taken
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Since I don't see it addressed here or in the other bug...

(In reply to Mike Taylor [:bear] from comment #5)
> cronjob is now running on cruncher every 6 hours to check for idle hosts and
> autoreboot them with an email sent for any action taken

Run every 6 hours is fine, but the "have not heard anything in..." timeframe is what? 6 hours as well??

If so, that does not really feel prudent for 6 hours to be our threshold here, Idleizer is set at 7 hours, and as such I would have expected 8 hours here.
(In reply to Justin Wood (:Callek) from comment #6)
> Since I don't see it addressed here or in the other bug...
> 
> (In reply to Mike Taylor [:bear] from comment #5)
> > cronjob is now running on cruncher every 6 hours to check for idle hosts and
> > autoreboot them with an email sent for any action taken
> 
> Run every 6 hours is fine, but the "have not heard anything in..." timeframe
> is what? 6 hours as well??
> 
> If so, that does not really feel prudent for 6 hours to be our threshold
> here, Idleizer is set at 7 hours, and as such I would have expected 8 hours
> here.

This code may catch some hosts where idleizer is running just before idleizer would have, but since it's checking for zero buildbot activity I wasn't too worried that I've moved the marker an hour earlier than what idleizer had.
Well idleizer also catches on zero buildbot activity... BUT it basically shuts buildbot down when invoking. ANDonce we update the master even does it more cleanly. I strongly suggest letting it run before this script
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: