Closed
Bug 672969
Opened 14 years ago
Closed 13 years ago
fan/heatsink/memory upgrades for ix machines in scl1
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: dividehex)
References
()
Details
(Whiteboard: [downtime 1/13 9am-12pm PST][buildduty])
All of the machines in scl1 have been upgraded with the new fan/heatsink combo and extra RAM in bug 668395. We still need to do the machines in mtv1. Dustin would like this to happen onsite since there are a few machines that should not be down for an extended period of time.
Please refer to the spreadsheet referenced in this bug for the list of machines in mtv1 that still need modifications.
Comment 1•14 years ago
|
||
So there are no SPOFs on the machine list, which is good. However, 41 of 61 working w32 builders are in mtv1 - a significant part of our capacity to have out of commission for multiple days.
If we can do the bulk of this onsite -- at least the m* machines that are staying in mountain view -- I think that would both save time (unrack, fix on desk nearby, re-rack vs. unrack, transport, fix at iX, transport, re-rack seems like a win to me) and get us the level of capacity we need. I don't think we'd need a "rolling upgrade" like we did in scl1, though - if we disable and halt all of the builders in the CST morning, and bring them back online as they're re-racked, I think that would be adequate.
As a rough data point, right now 33 of the 41 w32 machines in mtv1 are building (for comparison, 14 of the 20 working w32 machines in scl1 are building right now, too).
Comment 2•14 years ago
|
||
per meeting with IT yesterday:
* no progress; bkero at conference so work deferred to next week
Reporter | ||
Comment 3•14 years ago
|
||
(In reply to John O'Duinn [:joduinn] from comment #2)
> per meeting with IT yesterday:
>
> * no progress; bkero at conference so work deferred to next week
I think you have the wrong bug here... :}
Comment 4•14 years ago
|
||
As I understand it, this is still waiting on a "go" from releng to take some/all of the w32 pool down.
It will also need to be delayed until we have enough onsite hands to do the move, so a few weeks.
Reporter | ||
Updated•14 years ago
|
Assignee: zandr → arich
Reporter | ||
Updated•14 years ago
|
colo-trip: --- → mtv1
Reporter | ||
Comment 5•13 years ago
|
||
The comment from Dustin in bug 642305:
Per discussion in the relops meeting, we won't try to do this over the holidays, since the logistics would be pretty painful. Instead, we'll do this in early January, being more conservative in the batching.
As a point of reference, let's do three batches of 7:
Batch I
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com
Batch II
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com
Batch III
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com
We'll do these sequentially, so Batch II starts after Batch I is back online in scl1.
Coop, does this seem reasonable? If so, we'll start scheduling with iX in January.
Comment 6•13 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #5)
> Coop, does this seem reasonable? If so, we'll start scheduling with iX in
> January.
Batches look good to me.
Let buildduty know when you need to start pulling these slaves (cc-ing Lukas who is buildduty this week).
Reporter | ||
Updated•13 years ago
|
Assignee: arich → mlarrain
Reporter | ||
Comment 7•13 years ago
|
||
We should wire up the rack spaces in scl1 with network and power before these machines move, just to make sure things go smoothly and we have allocations for everything.
Comment 8•13 years ago
|
||
Jake Watkins can do that tomorrow while I prep the machines that need to move in 3/MDF. Adding Jake to this bug now.
Status: NEW → ASSIGNED
Comment 9•13 years ago
|
||
Jake will actually go and get the network and power ready today so we are set for tomorrow's move. I have spoke with iX and they will be at SCL1 to assist in the memory and heatsink upgrades as the machines arrive in each batch. I will be editing DHCP/DNS/Nagios for the machines as they move and arr said she will assist if she is around. Will speak with lsblakk to coordinate this fully tomorrow.
Comment 10•13 years ago
|
||
We will not be doing this tomorrow per iX not being availible. Will work on figuring out a good time to perform this task.
Updated•13 years ago
|
Whiteboard: [downtime 1/13 9am-12pm PST]
Reporter | ||
Comment 11•13 years ago
|
||
All machines had their fans and heatskinks upgraded. The following machines did NOT have their memory upgraded (well, they were upgraded, but then didn't even output a VGA signal and wouldn't boot):
w32-ix-{05,21,10,18,11,16} and linux-ix-{05,09,11}
The bad memory upgrades were removed and the machines are all online and functional minus the added RAM now. We will schedule with iX to come out and fix these at a later date.
The rest of the machines had their RAM upgraded.
Summary: fan/heatsink/memory upgrades for ix machines in mtv1 → fan/heatsink/memory upgrades for ix machines in scl1
Comment 12•13 years ago
|
||
Making inventory changes today
Comment 13•13 years ago
|
||
per relops meeting today, asking matthew to ping me before he goes to the colo so I can downtime the slaves
Whiteboard: [downtime 1/13 9am-12pm PST] → [downtime 1/13 9am-12pm PST][buildduty]
Comment 14•13 years ago
|
||
Passing this over to dividedhex as he goes to SCL1 more often. He will be able to work with bear to get these taken care of.
Assignee: mlarrain → jwatkins
Assignee | ||
Comment 15•13 years ago
|
||
IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to investigate/fix the 9 systems that didn't take the ram upgrades well.
Comment 16•13 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #15)
> IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to
> investigate/fix the 9 systems that didn't take the ram upgrades well.
Jake, are these the machines from comment 11?
Assignee | ||
Comment 17•13 years ago
|
||
(In reply to John Ford [:jhford] from comment #16)
> (In reply to Jake Watkins [:dividehex] from comment #15)
> > IX has confirmed they will be onsite at SCL1 on Tues 1/31 at 10am to
> > investigate/fix the 9 systems that didn't take the ram upgrades well.
>
> Jake, are these the machines from comment 11?
That is correct. For your convenience:
w32-ix-{05,21,10,18,11,16} and linux-ix-{05,09,11}
Comment 18•13 years ago
|
||
Turned off:
w32-ix-slave05
w32-ix-slave21
w32-ix-slave10
w32-ix-slave18
linux-ix-slave05
linux-ix-slave09
Can be turned off (I couldn't ssh in):
w32-ix-slave16
Gracefully shutdown (waiting for last build to finish):
w32-ix-slave11
linux-ix-slave11
Assignee | ||
Comment 19•13 years ago
|
||
w32-ix-slave05 | Ram upgraded fixed
w32-ix-slave21 | Ram upgraded fixed
w32-ix-slave10 | Ram upgraded fixed
w32-ix-slave18 | Ram upgraded fixed
w32-ix-slave16 | Ram upgraded fixed - then sent for repair - see 720167
w32-ix-slave11 | Ram upgraded fixed
linux-ix-slave11 | Ram upgraded fixed
These 2 didn't not have their ram removed as originally reported and should not have been on this list.
linux-ix-slave05 | no fix needed - booted ok
linux-ix-slave09 | no fix needed - booted ok
Assignee | ||
Updated•13 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•