Closed
Bug 1220296
Opened 9 years ago
Closed 9 years ago
Restart buildbot masters more frequently
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: catlee)
References
Details
Attachments
(1 file)
755 bytes,
patch
|
arich
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
We have nagios alerts set up to restart the masters after 41 days. I think we need to cut this in half - to 20 days.
Attachment #8681493 -
Flags: review?(arich)
Comment 1•9 years ago
|
||
Comment on attachment 8681493 [details] [diff] [review]
master-age.diff
Review of attachment 8681493 [details] [diff] [review]:
-----------------------------------------------------------------
FYI, we just got done increasing this number because people only wanted to reboot during a TCW every 6+ weeks. We need an accompanying process change to go along with the nagios change.
Attachment #8681493 -
Flags: review?(arich) → review+
Assignee | ||
Comment 2•9 years ago
|
||
That's true...but I don't think there's a particular need to do this inside a TCW. Slow rolling restarts should be ok.
any timing adjustments should be reflected in bug 1197853 - fwiw, we cancelled October's restart based on report that is wasn't needed (and would have been on TCW activity)
Assignee | ||
Updated•9 years ago
|
Attachment #8681493 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → catlee
Comment 4•9 years ago
|
||
I think I did the October restart anyway as nagios was alerting.
+1 to restarting more frequently. I think we need to take assorted tools we have and productionise them. Pretty sure coop has something, and I've kept catlee's fabric enhancements going at https://github.com/nthomas-mozilla/build-tools/tree/fabric (this got flaky on the last few masters when I used it, for unknown reasons).
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 5•9 years ago
|
||
Do we have a check to detect the masters starting to get into this bad state?
I would assume it won't be hard to sell rebooting the masters more often if it help us keeping the Windows throughput better. Even if there is an increased risk for the master rebooting and getting into a bad state for a short time.
Comment 6•9 years ago
|
||
My script is here: https://github.com/ccooper/build-tools/blob/master/buildfarm/maintenance/restart_masters.py
I ran it this weekend to restart all the masters. It's not perfect -- we hung on two masters requiring manual intervention -- but we could certainly dig into those issues and fix them. We could schedule the script to trigger restarts every weekend without much issue.
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•