Rewire and Rebalance scl1

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: zandr, Assigned: zandr)

Tracking

Bug Flags:
needs-treeclosure ?

Details

(Assignee)

Description

7 years ago
scl1, as deployed, has a number of issues:

* The minis aren't in any particular order, so they're hard to find.

* The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards.

* The minis *are* grouped by platform in racks. This means:
** Losing a circuit will take out all talos slaves for a platform
** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough.

* Switches are daisychained, so losing certain circuits will take down more than one platform.

I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues. 

Actions to resolve:
* Zandr to investigate distribution of builders
* Zandr to complete power measurements of minis running each OS
* Zandr to propose new layout of machines:
** Split builders evenly by platform across the racks
** Split support machines (buildmasters) across racks
** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence.

Then:
** Take a full day of downtime to rearrange machines and rewire **

I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time.

I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
(Assignee)

Updated

7 years ago
Assignee: server-ops → zandr
(Assignee)

Comment 1

7 years ago
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
cc'ing RelEng as this will require coordinating tree closure.
Flags: needs-treeclosure?
(Assignee)

Comment 3

7 years ago
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.