scl1, as deployed, has a number of issues:
* The minis aren't in any particular order, so they're hard to find.
* The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards.
* The minis *are* grouped by platform in racks. This means:
** Losing a circuit will take out all talos slaves for a platform
** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough.
* Switches are daisychained, so losing certain circuits will take down more than one platform.
I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues.
Actions to resolve:
* Zandr to investigate distribution of builders
* Zandr to complete power measurements of minis running each OS
* Zandr to propose new layout of machines:
** Split builders evenly by platform across the racks
** Split support machines (buildmasters) across racks
** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence.
** Take a full day of downtime to rearrange machines and rewire **
I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time.
I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
cc'ing RelEng as this will require coordinating tree closure.
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.