Closed
Bug 616658
Opened 14 years ago
Closed 14 years ago
Rewire and Rebalance scl1
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zandr, Assigned: zandr)
Details
scl1, as deployed, has a number of issues:
* The minis aren't in any particular order, so they're hard to find.
* The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards.
* The minis *are* grouped by platform in racks. This means:
** Losing a circuit will take out all talos slaves for a platform
** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough.
* Switches are daisychained, so losing certain circuits will take down more than one platform.
I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues.
Actions to resolve:
* Zandr to investigate distribution of builders
* Zandr to complete power measurements of minis running each OS
* Zandr to propose new layout of machines:
** Split builders evenly by platform across the racks
** Split support machines (buildmasters) across racks
** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence.
Then:
** Take a full day of downtime to rearrange machines and rewire **
I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time.
I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
Assignee | ||
Updated•14 years ago
|
Assignee: server-ops → zandr
Assignee | ||
Comment 1•14 years ago
|
||
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
Comment 2•14 years ago
|
||
cc'ing RelEng as this will require coordinating tree closure.
Updated•14 years ago
|
Flags: needs-treeclosure?
Assignee | ||
Comment 3•14 years ago
|
||
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•