Test rolling downtimes across pods

RESOLVED FIXED

Status

Release Engineering
General
P4
normal
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: joduinn, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

We used to need a complete downtime of all slaves whenever we stopped/started a master. To avoid this, we've setup 5 pods (5 different sets of masters with its slaves). With this in place, stopping/starting a given master will only take 20% of our infrastructure offline at a time.

To test that this works, we should:
* declare a downtime
* make sure all slaves are shared out across our 5 buildbot-masters
* do "graceful shutdown" of one master
* watch all slaves and jobs on that master complete gracefully, and slaves shutdown
* stop+start the master
* restart the master, watch the slaves reconnect and take new jobs
* repeat for each of the 5 masters
* verify that no jobs burned/failed during these reboots
* reopen the tree
While it's not a completely valid test, I did more or less this about a month ago and it went quite well.

One tricky part about the desired end here is that we only have Mac slaves in MPT, plus a few in Castro. We have to be careful to only shut down one of the two masters that handles mac machines at any given time.

Comment 2

7 years ago
This plan sounds good! Doing these in non-peak hours will save us the hassle of asking for complete tree closures.

(In reply to comment #0)
> * reopen the tree
Isn't the end to avoid closing the tree? (just checking to see if I misunderstood or if it was the natural inertia of having to close the tree)
Earlier this week, bhearsum restarted our two POD masters as part of a rolling deployment, without a downtime, and it all worked. It took approx 3.5 hours from start to finish.

There's some followon work in bug#617321, and we'll have better granularity once we have more PODs (bug#607179) but this goal is done.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
(Assignee)

Updated

4 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.