Last Comment Bug 616658 - Rewire and Rebalance scl1
: Rewire and Rebalance scl1
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 All
: -- normal (vote)
: ---
Assigned To: Zandr Milewski [:zandr]
: matthew zeier [:mrz]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-03 21:34 PST by Zandr Milewski [:zandr]
Modified: 2015-03-12 08:17 PDT (History)
13 users (show)
john+bugzilla: needs‑treeclosure?
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Zandr Milewski [:zandr] 2010-12-03 21:34:09 PST
scl1, as deployed, has a number of issues:

* The minis aren't in any particular order, so they're hard to find.

* The minis aren't cabled in any particular order, so they're hard to troubleshoot. In addition, the cabling is not up to our usual standards.

* The minis *are* grouped by platform in racks. This means:
** Losing a circuit will take out all talos slaves for a platform
** Some circuits are running hot because of power mgmt in Fedora, etc. We have already received one warning from the datacenter about overloaded circuits. We did some Q&D rebalancing, but it's not enough.

* Switches are daisychained, so losing certain circuits will take down more than one platform.

I haven't investigated in any depth, but it appears that the iX builders probably also have the same distribution issues. 

Actions to resolve:
* Zandr to investigate distribution of builders
* Zandr to complete power measurements of minis running each OS
* Zandr to propose new layout of machines:
** Split builders evenly by platform across the racks
** Split support machines (buildmasters) across racks
** Split minis across racks by platform, grouping by platform in rows, and keeping strict sequence.

Then:
** Take a full day of downtime to rearrange machines and rewire **

I know this is going to be difficult, but the current situation in scl1 is unmaintainable. We can close the tree for a day and fix this, or risk losing it at any time.

I suggest that we pick a day, perhaps in the week between Christmas and New Years, and get 4-6 people down to the datacenter to get this done. With good prep work, this should go smoothly, but it will take some time.
Comment 1 Zandr Milewski [:zandr] 2010-12-06 16:30:32 PST
Current thinking seems to be centering around 12/17. It's the Friday of the All-Hands, which will be a travel day for a lot of developers.
Comment 2 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2010-12-07 10:46:10 PST
cc'ing RelEng as this will require coordinating tree closure.
Comment 3 Zandr Milewski [:zandr] 2010-12-18 08:50:22 PST
We accomplished everything except rebalancing the iX machines. There is a lot less risk there, and I think I can manage rearranging thtose in smaller groups, thus avoiding a tree closure.

Note You need to log in before you can comment on or make changes to this bug.