Closed Bug 607180 Opened 14 years ago Closed 14 years ago

Test rolling downtimes across pods

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: joduinn, Unassigned)

References

Details

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

14 years ago

We used to need a complete downtime of all slaves whenever we stopped/started a master. To avoid this, we've setup 5 pods (5 different sets of masters with its slaves). With this in place, stopping/starting a given master will only take 20% of our infrastructure offline at a time.

To test that this works, we should:
* declare a downtime
* make sure all slaves are shared out across our 5 buildbot-masters
* do "graceful shutdown" of one master
* watch all slaves and jobs on that master complete gracefully, and slaves shutdown
* stop+start the master
* restart the master, watch the slaves reconnect and take new jobs
* repeat for each of the 5 masters
* verify that no jobs burned/failed during these reboots
* reopen the tree

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

14 years ago

While it's not a completely valid test, I did more or less this about a month ago and it went quite well.

One tricky part about the desired end here is that we only have Mac slaves in MPT, plus a few in Castro. We have to be careful to only shut down one of the two masters that handles mac machines at any given time.

Armen [:armenzg]

Comment 2

•

14 years ago

This plan sounds good! Doing these in non-peak hours will save us the hassle of asking for complete tree closures.

(In reply to comment #0)
> * reopen the tree
Isn't the end to avoid closing the tree? (just checking to see if I misunderstood or if it was the natural inertia of having to close the tree)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 3

•

14 years ago

Earlier this week, bhearsum restarted our two POD masters as part of a rolling deployment, without a downtime, and it all worked. It took approx 3.5 hours from start to finish.

There's some followon work in bug#617321, and we'll have better granularity once we have more PODs (bug#607179) but this goal is done.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Test rolling downtimes across pods

Categories

(Release Engineering :: General, defect, P4)

Tracking

(Not tracked)

People

(Reporter: joduinn, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated