We used to need a complete downtime of all slaves whenever we stopped/started a master. To avoid this, we've setup 5 pods (5 different sets of masters with its slaves). With this in place, stopping/starting a given master will only take 20% of our infrastructure offline at a time. To test that this works, we should: * declare a downtime * make sure all slaves are shared out across our 5 buildbot-masters * do "graceful shutdown" of one master * watch all slaves and jobs on that master complete gracefully, and slaves shutdown * stop+start the master * restart the master, watch the slaves reconnect and take new jobs * repeat for each of the 5 masters * verify that no jobs burned/failed during these reboots * reopen the tree
While it's not a completely valid test, I did more or less this about a month ago and it went quite well. One tricky part about the desired end here is that we only have Mac slaves in MPT, plus a few in Castro. We have to be careful to only shut down one of the two masters that handles mac machines at any given time.
This plan sounds good! Doing these in non-peak hours will save us the hassle of asking for complete tree closures. (In reply to comment #0) > * reopen the tree Isn't the end to avoid closing the tree? (just checking to see if I misunderstood or if it was the natural inertia of having to close the tree)
Earlier this week, bhearsum restarted our two POD masters as part of a rolling deployment, without a downtime, and it all worked. It took approx 3.5 hours from start to finish. There's some followon work in bug#617321, and we'll have better granularity once we have more PODs (bug#607179) but this goal is done.