Closed
Bug 607180
Opened 14 years ago
Closed 14 years ago
Test rolling downtimes across pods
Categories
(Release Engineering :: General, defect, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Unassigned)
References
Details
We used to need a complete downtime of all slaves whenever we stopped/started a master. To avoid this, we've setup 5 pods (5 different sets of masters with its slaves). With this in place, stopping/starting a given master will only take 20% of our infrastructure offline at a time. To test that this works, we should: * declare a downtime * make sure all slaves are shared out across our 5 buildbot-masters * do "graceful shutdown" of one master * watch all slaves and jobs on that master complete gracefully, and slaves shutdown * stop+start the master * restart the master, watch the slaves reconnect and take new jobs * repeat for each of the 5 masters * verify that no jobs burned/failed during these reboots * reopen the tree
Comment 1•14 years ago
|
||
While it's not a completely valid test, I did more or less this about a month ago and it went quite well. One tricky part about the desired end here is that we only have Mac slaves in MPT, plus a few in Castro. We have to be careful to only shut down one of the two masters that handles mac machines at any given time.
Comment 2•14 years ago
|
||
This plan sounds good! Doing these in non-peak hours will save us the hassle of asking for complete tree closures. (In reply to comment #0) > * reopen the tree Isn't the end to avoid closing the tree? (just checking to see if I misunderstood or if it was the natural inertia of having to close the tree)
Reporter | ||
Comment 3•14 years ago
|
||
Earlier this week, bhearsum restarted our two POD masters as part of a rolling deployment, without a downtime, and it all worked. It took approx 3.5 hours from start to finish. There's some followon work in bug#617321, and we'll have better granularity once we have more PODs (bug#607179) but this goal is done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•