Closed Bug 616658 Opened 14 years ago Closed 14 years ago

Rewire and Rebalance scl1

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: zandr)

Details

scl1, as deployed, has a number of issues: * The minis aren't in any particular order, so they're hard to find. * The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards. * The minis *are* grouped by platform in racks. This means: ** Losing a circuit will take out all talos slaves for a platform ** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough. * Switches are daisychained, so losing certain circuits will take down more than one platform. I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues. Actions to resolve: * Zandr to investigate distribution of builders * Zandr to complete power measurements of minis running each OS * Zandr to propose new layout of machines: ** Split builders evenly by platform across the racks ** Split support machines (buildmasters) across racks ** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence. Then: ** Take a full day of downtime to rearrange machines and rewire ** I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time. I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
Assignee: server-ops → zandr
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
cc'ing RelEng as this will require coordinating tree closure.
Flags: needs-treeclosure?
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.