Closed Bug 672969 Opened 14 years ago Closed 13 years ago

fan/heatsink/memory upgrades for ix machines in scl1

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dividehex)

References

()

Details

(Whiteboard: [downtime 1/13 9am-12pm PST][buildduty])

All of the machines in scl1 have been upgraded with the new fan/heatsink combo and extra RAM in bug 668395. We still need to do the machines in mtv1. Dustin would like this to happen onsite since there are a few machines that should not be down for an extended period of time. Please refer to the spreadsheet referenced in this bug for the list of machines in mtv1 that still need modifications.
So there are no SPOFs on the machine list, which is good. However, 41 of 61 working w32 builders are in mtv1 - a significant part of our capacity to have out of commission for multiple days. If we can do the bulk of this onsite -- at least the m* machines that are staying in mountain view -- I think that would both save time (unrack, fix on desk nearby, re-rack vs. unrack, transport, fix at iX, transport, re-rack seems like a win to me) and get us the level of capacity we need. I don't think we'd need a "rolling upgrade" like we did in scl1, though - if we disable and halt all of the builders in the CST morning, and bring them back online as they're re-racked, I think that would be adequate. As a rough data point, right now 33 of the 41 w32 machines in mtv1 are building (for comparison, 14 of the 20 working w32 machines in scl1 are building right now, too).
Blocks: 673029
per meeting with IT yesterday: * no progress; bkero at conference so work deferred to next week
(In reply to John O'Duinn [:joduinn] from comment #2) > per meeting with IT yesterday: > > * no progress; bkero at conference so work deferred to next week I think you have the wrong bug here... :}
As I understand it, this is still waiting on a "go" from releng to take some/all of the w32 pool down. It will also need to be delayed until we have enough onsite hands to do the move, so a few weeks.
Assignee: zandr → arich
colo-trip: --- → mtv1
No longer depends on: 712461
The comment from Dustin in bug 642305: Per discussion in the relops meeting, we won't try to do this over the holidays, since the logistics would be pretty painful. Instead, we'll do this in early January, being more conservative in the batching. As a point of reference, let's do three batches of 7: Batch I > w32-ix-slave05.build.mtv1.mozilla.com > w32-ix-slave09.build.mtv1.mozilla.com > w32-ix-slave10.build.mtv1.mozilla.com > w32-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave03.build.mtv1.mozilla.com > linux-ix-slave04.build.mtv1.mozilla.com > linux-ix-slave05.build.mtv1.mozilla.com Batch II > w32-ix-slave13.build.mtv1.mozilla.com > w32-ix-slave14.build.mtv1.mozilla.com > w32-ix-slave15.build.mtv1.mozilla.com > w32-ix-slave16.build.mtv1.mozilla.com > w32-ix-slave17.build.mtv1.mozilla.com > linux-ix-slave07.build.mtv1.mozilla.com > linux-ix-slave09.build.mtv1.mozilla.com Batch III > w32-ix-slave18.build.mtv1.mozilla.com > w32-ix-slave19.build.mtv1.mozilla.com > w32-ix-slave20.build.mtv1.mozilla.com > w32-ix-slave21.build.mtv1.mozilla.com > linux-ix-slave10.build.mtv1.mozilla.com > linux-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave15.build.mtv1.mozilla.com We'll do these sequentially, so Batch II starts after Batch I is back online in scl1. Coop, does this seem reasonable? If so, we'll start scheduling with iX in January.
(In reply to Amy Rich [:arich] [:arr] from comment #5) > Coop, does this seem reasonable? If so, we'll start scheduling with iX in > January. Batches look good to me. Let buildduty know when you need to start pulling these slaves (cc-ing Lukas who is buildduty this week).
Assignee: arich → mlarrain
We should wire up the rack spaces in scl1 with network and power before these machines move, just to make sure things go smoothly and we have allocations for everything.
Jake Watkins can do that tomorrow while I prep the machines that need to move in 3/MDF. Adding Jake to this bug now.
Status: NEW → ASSIGNED
Jake will actually go and get the network and power ready today so we are set for tomorrow's move. I have spoke with iX and they will be at SCL1 to assist in the memory and heatsink upgrades as the machines arrive in each batch. I will be editing DHCP/DNS/Nagios for the machines as they move and arr said she will assist if she is around. Will speak with lsblakk to coordinate this fully tomorrow.
We will not be doing this tomorrow per iX not being availible. Will work on figuring out a good time to perform this task.
Whiteboard: [downtime 1/13 9am-12pm PST]
All machines had their fans and heatskinks upgraded. The following machines did NOT have their memory upgraded (well, they were upgraded, but then didn't even output a VGA signal and wouldn't boot): w32-ix-{05,21,10,18,11,16} and linux-ix-{05,09,11} The bad memory upgrades were removed and the machines are all online and functional minus the added RAM now. We will schedule with iX to come out and fix these at a later date. The rest of the machines had their RAM upgraded.
Summary: fan/heatsink/memory upgrades for ix machines in mtv1 → fan/heatsink/memory upgrades for ix machines in scl1
Making inventory changes today
per relops meeting today, asking matthew to ping me before he goes to the colo so I can downtime the slaves
Whiteboard: [downtime 1/13 9am-12pm PST] → [downtime 1/13 9am-12pm PST][buildduty]
Passing this over to dividedhex as he goes to SCL1 more often. He will be able to work with bear to get these taken care of.
Assignee: mlarrain → jwatkins
IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to investigate/fix the 9 systems that didn't take the ram upgrades well.
(In reply to Jake Watkins [:dividehex] from comment #15) > IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to > investigate/fix the 9 systems that didn't take the ram upgrades well. Jake, are these the machines from comment 11?
(In reply to John Ford [:jhford] from comment #16) > (In reply to Jake Watkins [:dividehex] from comment #15) > > IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to > > investigate/fix the 9 systems that didn't take the ram upgrades well. > > Jake, are these the machines from comment 11? That is correct. For your convenience: w32-ix-{05,21,10,18,11,16} and linux-ix-{05,09,11}
Turned off: w32-ix-slave05 w32-ix-slave21 w32-ix-slave10 w32-ix-slave18 linux-ix-slave05 linux-ix-slave09 Can be turned off (I couldn't ssh in): w32-ix-slave16 Gracefully shutdown (waiting for last build to finish): w32-ix-slave11 linux-ix-slave11
w32-ix-slave05 | Ram upgraded fixed w32-ix-slave21 | Ram upgraded fixed w32-ix-slave10 | Ram upgraded fixed w32-ix-slave18 | Ram upgraded fixed w32-ix-slave16 | Ram upgraded fixed - then sent for repair - see 720167 w32-ix-slave11 | Ram upgraded fixed linux-ix-slave11 | Ram upgraded fixed These 2 didn't not have their ram removed as originally reported and should not have been on this list. linux-ix-slave05 | no fix needed - booted ok linux-ix-slave09 | no fix needed - booted ok
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.