Closed Bug 774829 Opened 12 years ago Closed 12 years ago

upgrade heatsink/fan/memory and move mw32-ix-slave13 - mw32-ix-slave26 to scl1

Categories

(Infrastructure & Operations :: DCOps, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Unassigned)

References

Details

(Whiteboard: mtv to scl1 [reit])

Derek, you asked for a heads up on needing to upgrading and move these boxes.  Coop says that we'll be ready to go soon, so I'm filing this bug to coordinate.  He'll update this bug when the machines are disabled and ready to move.

The following machines are slated to be upgraded and moved:

mw32-ix-slave11
mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24
mw32-ix-slave25
mw32-ix-slave26

The will be changing hostnames when they reach the new datacenter, so if they have hostnames on the labels, please let me know and we can get you an updated list.

Matt or Jake can tell you where the parts for the upgrades are, and give any special instructions.
The parts are stored in boxes in haxxor.
Summary: upgrade heatfink/fan/memory and move mw32-ix-slave11 - mw32-ix-slave26 to scl1 → upgrade heatsink/fan/memory and move mw32-ix-slave11 - mw32-ix-slave26 to scl1
Whiteboard: mtv to scl1
The mapping of the old to new hostnames for this batch:

w64-ix-slave85                  mw32-ix-slave11
w64-ix-slave86                  mw32-ix-slave12
w64-ix-slave87                  mw32-ix-slave13
w64-ix-slave88                  mw32-ix-slave14
w64-ix-slave89                  mw32-ix-slave15
w64-ix-slave90                  mw32-ix-slave16
w64-ix-slave91                  mw32-ix-slave17
w64-ix-slave92                  mw32-ix-slave18
w64-ix-slave93                  mw32-ix-slave19
w64-ix-slave94                  mw32-ix-slave20
w64-ix-slave95                  mw32-ix-slave21
w64-ix-slave96                  mw32-ix-slave22
w64-ix-slave97                  mw32-ix-slave23
w64-ix-slave98                  mw32-ix-slave24
w64-ix-slave99                  mw32-ix-slave25
w64-ix-slave100                 mw32-ix-slave26

w64-ix-slave85-mgmt             mw32-ix-slave11-mgmt
w64-ix-slave86-mgmt             mw32-ix-slave12-mgmt
w64-ix-slave87-mgmt             mw32-ix-slave13-mgmt
w64-ix-slave88-mgmt             mw32-ix-slave14-mgmt
w64-ix-slave89-mgmt             mw32-ix-slave15-mgmt
w64-ix-slave90-mgmt             mw32-ix-slave16-mgmt
w64-ix-slave91-mgmt             mw32-ix-slave17-mgmt
w64-ix-slave92-mgmt             mw32-ix-slave18-mgmt
w64-ix-slave93-mgmt             mw32-ix-slave19-mgmt
w64-ix-slave94-mgmt             mw32-ix-slave20-mgmt
w64-ix-slave95-mgmt             mw32-ix-slave21-mgmt
w64-ix-slave96-mgmt             mw32-ix-slave22-mgmt
w64-ix-slave97-mgmt             mw32-ix-slave23-mgmt
w64-ix-slave98-mgmt             mw32-ix-slave24-mgmt
w64-ix-slave99-mgmt             mw32-ix-slave25-mgmt
w64-ix-slave100-mgmt            mw32-ix-slave26-mgmt
Whiteboard: mtv to scl1 → mtv to scl1 [reit]
colo-trip: --- → mtv1
The focus/purpose of this bug is on re-tasking a portion of the 32-bit builder pool from windows 32 to windows 64 to facilitate bug 758275 (increasing the size of the 64-bit builder pool).  This should have no impact on any chemspills or regular builds because the only machines that should be moved are those that are extra capacity in the 32-bit builder pool.

Changing any these machines to 64-bit builders necessitates moving them to scl1, and moving to scl1 necessitates upgrading the hardware.

The information needed here from releng (and the reason that this bug is currently blocked) is how many servers (some number between 0 and 16) that can be retasked from 32-bit builders to 64-bit builders.

* IFF we only need 10 32-bit builders, then we go ahead and finish this move in its entirety, and it doesn't matter how long it takes us (within reason, say a day) to move and reimage them.  We are only ADDING capacity to the 64-bit builder pool and removing UNUSED capacity from the 32-bit builder pool.  

* IFF we need fewer than 26 32-bit builders but more than 10, then we can move some subset of machines, and it doesn't matter how long it takes us (within reason, say a day) to move and reimage them.  We are only ADDING capacity to the 64-bit builder pool and removing UNUSED capacity from the 32-bit builder pool.  

* IFF we still need all 26 32-bit builders, then we don't do the move at all and we R/F this bug without further action.
Hal I wanted to check in for an update on this.  Any feedback on how many machines we want to re-purpose as 64 bit?
No longer blocks: 780022
Releng has requested that we move 14 of these 16 machines. Does dcops have a preference on which those are?  If not, I suggest we upgrade/move:

mw32-ix-slave11
mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24

In terms of priority, this is behind getting the tegras fully functional, so do you guys know when you might get a chance to upgrade and move them?
Hi Amy,

Speaking with Derek before on this bug, he wants our "SLA" for inter-colo moves to be 10 business days. However, we're not that busy so we can start working on the upgrades today or tomorrow and start moving them Monday. Have these hosts been brought down and can we upgrade them at our convenience?

Thanks,
Van
Van: hwine will follow up when releng is ready for the machines to be taken out of service.
We need to leave 2 in this block so we will be working on the following: 

mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24

Hal is coordinating with RelEng to take these out of service and will update the bug when you can get started.
Actually, based on the list hal provided, the following should be moved (I didn't notice that 25 was not in his list):

mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24
mw32-ix-slave25

That leaves 11 and 26 in mtv1 (no upgrade, no move).
2 corrections to comment #9:
- total count is 14 (not 12)
- the final, official, shut them down any time you want list is:
    mw32-ix-slave13
    mw32-ix-slave14
    mw32-ix-slave15
    mw32-ix-slave16
    mw32-ix-slave17
    mw32-ix-slave18
    mw32-ix-slave19
    mw32-ix-slave20
    mw32-ix-slave21
    mw32-ix-slave22
    mw32-ix-slave23
    mw32-ix-slave24
    mw32-ix-slave25
    mw32-ix-slave26

This is the official go to start unracking them! Thanks!
These hosts have been renamed in inventory (see comment 2 for the new hostname mappings).  DCops: please update the rack and switch info once the machines have moved.
Hosts removed from nagios and commented out entries made in nagios for new hosts.

Still to be done:  Remove old hostnames from DHCP and DNS once the move is complete and verified.
:arr, we spoke in #dcops and you said the hard drives were to be upgraded as well. the hds inside the machine are currently 250gb 7200 rpm SATA drives, and the ones I found inside haxxor are also 250gb 7200 rpm SATA drives. i didnt bother with swapping out the drives since they're identical. please let me know if this is not correct and I should be looking for another set of drives.

thanks,
van
The hosts have been upgraded to 8gb of memory, heat sink and fan replaced. Lisa is scheduling a pick-up from MV and delivery to SCL1 for us Thursday at 4pm. We can probably rack, cable and inventory them Friday. Please let me know if there are any issues.

Thanks,
Van
Van, we can't schedule the move sooner?
Melissa, as Van noted in Comment 6, we generally need 10 business days to schedule a move of this size. When we have more than two or three servers, it becomes necessary to involve WPR and their third-party moving service. Van has actually managed to get everyone scheduled within 6 days of the request, which is already better than the expected forecast.
Summary: upgrade heatsink/fan/memory and move mw32-ix-slave11 - mw32-ix-slave26 to scl1 → upgrade heatsink/fan/memory and move mw32-ix-slave13 - mw32-ix-slave26 to scl1
(In reply to Van Le [:van] from comment #13)
> :arr, we spoke in #dcops and you said the hard drives were to be upgraded as
> well. the hds inside the machine are currently 250gb 7200 rpm SATA drives,
> and the ones I found inside haxxor are also 250gb 7200 rpm SATA drives. i
> didnt bother with swapping out the drives since they're identical. please
> let me know if this is not correct and I should be looking for another set
> of drives.
> 
> thanks,
> van

:arr, was this question about disks ever resolved?
(In reply to Derek Moore from comment #16)
> Melissa, as Van noted in Comment 6, we generally need 10 business days to
> schedule a move of this size. When we have more than two or three servers,
> it becomes necessary to involve WPR and their third-party moving service.
> Van has actually managed to get everyone scheduled within 6 days of the
> request, which is already better than the expected forecast.

:dmoore:
1) Could we shuffle 2-3 machines at a time? :-) We're talking 12 machines here total, but each batch of 2-3 would help out as soon as they came online. 
2) Are there other things we could be doing while we wait?
** machine imaging?
** verify netflows?
** nagios?
...?

As you can probably tell, I'm looking for a way we can have these machines in production helping clear our win2008 backlog asap.
There's nothing left that we can do to these machines until they're in scl1.
No longer blocks: 712456
(In reply to Amy Rich [:arich] [:arr] from comment #19)

Actually, I should be more specific there.  If the hardware's been upgraded, there's nothing more that dcops or relops can do for these machines.  If releng has other bugs that they want to file to prep things on their end, there may be stuff to do there (buildbot, graphs, etc).  This bug is just for the hardware, though, and there's a corresponding bug that relops has for reimaging the machines (which will include monitoring) once the hardware is up.
Hosts have been moved to SCL1. We should have them up by tomorrow.
colo-trip: mtv1 → scl1
correction, by end of day tomorrow. We still need to rack, cable, update inventory and configure the switch these hosts will be residing on.
Move has been completed. New rack, switch, pdu has been installed and configured. All hosts should be reachable and inventory has been updated. Please let me know of any issues.

https://inventory.mozilla.org/en-US/systems/racks/?location=0&status=&rack=246&allocation=
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee: server-ops → server-ops-dcops
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.