Closed
Bug 616658
Opened 14 years ago
Closed 14 years ago
Rewire and Rebalance scl1
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zandr, Assigned: zandr)
Details
scl1, as deployed, has a number of issues: * The minis aren't in any particular order, so they're hard to find. * The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards. * The minis *are* grouped by platform in racks. This means: ** Losing a circuit will take out all talos slaves for a platform ** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough. * Switches are daisychained, so losing certain circuits will take down more than one platform. I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues. Actions to resolve: * Zandr to investigate distribution of builders * Zandr to complete power measurements of minis running each OS * Zandr to propose new layout of machines: ** Split builders evenly by platform across the racks ** Split support machines (buildmasters) across racks ** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence. Then: ** Take a full day of downtime to rearrange machines and rewire ** I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time. I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
Assignee | ||
Updated•14 years ago
|
Assignee: server-ops → zandr
Assignee | ||
Comment 1•14 years ago
|
||
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
Comment 2•14 years ago
|
||
cc'ing RelEng as this will require coordinating tree closure.
Updated•14 years ago
|
Flags: needs-treeclosure?
Assignee | ||
Comment 3•14 years ago
|
||
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•