Closed Bug 1220296 Opened 9 years ago Closed 9 years ago

Restart buildbot masters more frequently

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(1 file)

Attached patch master-age.diffSplinter Review
We have nagios alerts set up to restart the masters after 41 days. I think we need to cut this in half - to 20 days.
Attachment #8681493 - Flags: review?(arich)
Blocks: 1212993
Comment on attachment 8681493 [details] [diff] [review] master-age.diff Review of attachment 8681493 [details] [diff] [review]: ----------------------------------------------------------------- FYI, we just got done increasing this number because people only wanted to reboot during a TCW every 6+ weeks. We need an accompanying process change to go along with the nagios change.
Attachment #8681493 - Flags: review?(arich) → review+
That's true...but I don't think there's a particular need to do this inside a TCW. Slow rolling restarts should be ok.
any timing adjustments should be reflected in bug 1197853 - fwiw, we cancelled October's restart based on report that is wasn't needed (and would have been on TCW activity)
See Also: → 1057888, 1197853
Attachment #8681493 - Flags: checked-in+
Assignee: nobody → catlee
I think I did the October restart anyway as nagios was alerting. +1 to restarting more frequently. I think we need to take assorted tools we have and productionise them. Pretty sure coop has something, and I've kept catlee's fabric enhancements going at https://github.com/nthomas-mozilla/build-tools/tree/fabric (this got flaky on the last few masters when I used it, for unknown reasons).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Do we have a check to detect the masters starting to get into this bad state? I would assume it won't be hard to sell rebooting the masters more often if it help us keeping the Windows throughput better. Even if there is an increased risk for the master rebooting and getting into a bad state for a short time.
My script is here: https://github.com/ccooper/build-tools/blob/master/buildfarm/maintenance/restart_masters.py I ran it this weekend to restart all the masters. It's not perfect -- we hung on two masters requiring manual intervention -- but we could certainly dig into those issues and fix them. We could schedule the script to trigger restarts every weekend without much issue.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: