upgrade heatfink/fan/memory on remaining ix builder machines and move them to scl3

RESOLVED FIXED

Status

Infrastructure & Operations
DCOps
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: dustin, Unassigned)

Tracking

Details

(Whiteboard: #UJU-709-53233 - replacement drive for 009)

(Reporter)

Description

6 years ago
Low-priority at the moment since scl3 isn't built yet.  These hosts will need to be allocated space in scl3 and moved there, via iX for heatsink/fan replacement.

mw32-ix-slave01
mw32-ix-slave02
mw32-ix-slave03
mw32-ix-slave04
mw32-ix-slave05
mw32-ix-slave06
mw32-ix-slave07
mw32-ix-slave08
mw32-ix-slave09
mw32-ix-slave10
mw32-ix-slave11
mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24
mw32-ix-slave25
mw32-ix-slave26
(Reporter)

Updated

6 years ago
Blocks: 712457
Assignee: server-ops-releng → arich
colo-trip: --- → mtv1
(Reporter)

Comment 1

6 years ago
I don't think this is *quite* ready for an mtv1 trip yet!
colo-trip: mtv1 → ---
Summary: move mw32-ix-slave* to scl3 → upgrade heatfink/fan/memory and move mw32-ix-slave* to scl3
Assigning this to Jake and we can have iX come out and do this once we are ready to move them.
Assignee: arich → jwatkins
Status: NEW → ASSIGNED
linux-ix-ref
mv-moz2-linux-ix-slave01
mv-moz2-linux-ix-slave02
mv-moz2-linux-ix-slave03
mv-moz2-linux-ix-slave04
mv-moz2-linux-ix-slave05
mv-moz2-linux-ix-slave06
mv-moz2-linux-ix-slave07
mv-moz2-linux-ix-slave08
mv-moz2-linux-ix-slave09
mv-moz2-linux-ix-slave10
mv-moz2-linux-ix-slave11
mv-moz2-linux-ix-slave12
mv-moz2-linux-ix-slave13
mv-moz2-linux-ix-slave14
mv-moz2-linux-ix-slave15
mv-moz2-linux-ix-slave16
mv-moz2-linux-ix-slave17
mv-moz2-linux-ix-slave18
mv-moz2-linux-ix-slave19
mv-moz2-linux-ix-slave20
mv-moz2-linux-ix-slave21
mv-moz2-linux-ix-slave22
mv-moz2-linux-ix-slave23
mw32-ix-slave01
mw32-ix-slave02
mw32-ix-slave03
mw32-ix-slave04
mw32-ix-slave05
mw32-ix-slave06
mw32-ix-slave07
mw32-ix-slave08
mw32-ix-slave09
mw32-ix-slave10
mw32-ix-slave11
mw32-ix-slave12
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave16
mw32-ix-slave17
mw32-ix-slave18
mw32-ix-slave19
mw32-ix-slave20
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave23
mw32-ix-slave24
mw32-ix-slave25
mw32-ix-slave26
win32-ix-ref
Summary: upgrade heatfink/fan/memory and move mw32-ix-slave* to scl3 → upgrade heatfink/fan/memory and move mw32-ix-slave* and mv-moz2-linux-ix-slave*, and ix ref machines to scl3
Assignee: jwatkins → mlarrain
No longer blocks: 712457
Depends on: 774829
Blocks: 780022

Updated

5 years ago
Assignee: mlarrain → server-ops
Component: Server Operations: RelEng → Server Operations: DCOps
QA Contact: zandr → dmoore
This project is on hold until the hardware is released for move. Please update here when DC Ops is cleared to begin planning.
Whiteboard: [reit]

Updated

5 years ago
colo-trip: --- → mtv1
This bug is not currently actionable, so I'm making it infra-only to avoid confusion.
Group: infra
No longer blocks: 780022
Blocks: 780022
Group: infra
Summary: upgrade heatfink/fan/memory and move mw32-ix-slave* and mv-moz2-linux-ix-slave*, and ix ref machines to scl3 → upgrade heatfink/fan/memory on remaining ix builder machines and move them to scl3
Blocks: 784721
No longer blocks: 780022
No longer depends on: 774829
Please upgrade the following machines now but leave them in service in mtv1:

mv-moz2-linux-ix-slave01
mv-moz2-linux-ix-slave02
mv-moz2-linux-ix-slave03
mv-moz2-linux-ix-slave04
mv-moz2-linux-ix-slave05
mv-moz2-linux-ix-slave06
mv-moz2-linux-ix-slave07
mv-moz2-linux-ix-slave08
mv-moz2-linux-ix-slave09
mv-moz2-linux-ix-slave10
mv-moz2-linux-ix-slave11
mv-moz2-linux-ix-slave12
mv-moz2-linux-ix-slave13
mv-moz2-linux-ix-slave14
mv-moz2-linux-ix-slave15
mv-moz2-linux-ix-slave16
mv-moz2-linux-ix-slave17
mv-moz2-linux-ix-slave18
mv-moz2-linux-ix-slave19
mv-moz2-linux-ix-slave20
mv-moz2-linux-ix-slave21
mv-moz2-linux-ix-slave22
mv-moz2-linux-ix-slave23

They can be taken offline at any time.
Blocks: 847529
Hostnames have been remapped per upcoming retask:

mv-moz2-linux-ix-slave01 bld-centos6-ix-051
mv-moz2-linux-ix-slave02 bld-centos6-ix-052
mv-moz2-linux-ix-slave03 bld-centos6-ix-053
mv-moz2-linux-ix-slave04 bld-centos6-ix-054
mv-moz2-linux-ix-slave05 bld-centos6-ix-055
mv-moz2-linux-ix-slave06 bld-centos6-ix-056
mv-moz2-linux-ix-slave07 bld-centos6-ix-057
mv-moz2-linux-ix-slave08 bld-centos6-ix-058
mv-moz2-linux-ix-slave09 bld-centos6-ix-059
mv-moz2-linux-ix-slave10 bld-centos6-ix-060
mv-moz2-linux-ix-slave11 bld-centos6-ix-061
mv-moz2-linux-ix-slave12 bld-centos6-ix-062
mv-moz2-linux-ix-slave13 bld-centos6-ix-063
mv-moz2-linux-ix-slave14 bld-centos6-ix-064
mv-moz2-linux-ix-slave15 bld-centos6-ix-065
mv-moz2-linux-ix-slave16 bld-centos6-ix-066
mv-moz2-linux-ix-slave17 bld-centos6-ix-067
mv-moz2-linux-ix-slave18 bld-centos6-ix-068
mv-moz2-linux-ix-slave19 bld-centos6-ix-069
mv-moz2-linux-ix-slave20 bld-centos6-ix-070
mv-moz2-linux-ix-slave21 bld-centos6-ix-071
mv-moz2-linux-ix-slave22 bld-centos6-ix-072
mv-moz2-linux-ix-slave23 bld-centos6-ix-073
My mistake, that should be:

bld-linux64-ix-051.build.mtv1.mozilla.com
bld-linux64-ix-052.build.mtv1.mozilla.com
bld-linux64-ix-053.build.mtv1.mozilla.com
bld-linux64-ix-054.build.mtv1.mozilla.com
bld-linux64-ix-055.build.mtv1.mozilla.com
bld-linux64-ix-056.build.mtv1.mozilla.com
bld-linux64-ix-057.build.mtv1.mozilla.com
bld-linux64-ix-058.build.mtv1.mozilla.com
bld-linux64-ix-059.build.mtv1.mozilla.com
bld-linux64-ix-060.build.mtv1.mozilla.com
bld-linux64-ix-061.build.mtv1.mozilla.com
bld-linux64-ix-062.build.mtv1.mozilla.com
bld-linux64-ix-063.build.mtv1.mozilla.com
bld-linux64-ix-064.build.mtv1.mozilla.com
bld-linux64-ix-065.build.mtv1.mozilla.com
bld-linux64-ix-066.build.mtv1.mozilla.com
bld-linux64-ix-067.build.mtv1.mozilla.com
bld-linux64-ix-068.build.mtv1.mozilla.com
bld-linux64-ix-069.build.mtv1.mozilla.com
bld-linux64-ix-070.build.mtv1.mozilla.com
bld-linux64-ix-071.build.mtv1.mozilla.com
bld-linux64-ix-072.build.mtv1.mozilla.com
bld-linux64-ix-073.build.mtv1.mozilla.com
Blocks: 849022

Comment 9

4 years ago
Slight change of plan here. 19 of the iX boxes are going to become linux foopys to get us off of the Mac minis there. Can we please rename a subset of these for use as foopies, specifically:

bld-linux64-ix-0[55-73]
coop: I'll handle that in a different bug next week since it's just a name change.  That doesn't impact the need to upgrade the hardware in this dcops bug.

Comment 11

4 years ago
So far we have upgraded the heatsinks and memory of Ix systems asset tags 3132, 3133, 3134, 3135, 3136, 3137, 3139, 3140, and 3144. We will complete the rest of the upgrades on Monday.
The only one of those that had a working IPMI was 3139 (foopy125).  Could you please make sure that the IPMI comes up on each device so I can kickstart them?
So I may have found some secret sauce to making the IPMI lan connections recover.  From the local machine, you have to set the IP src type to static, wait till it picks that up, then switch it back to dhcp.  This seems to work MUCH more reliably than doing an mc reset.

ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipsrc dhcp

Doing this I got the connections to all but foopy122 (doesn't seem to take) and foopy124 (can't get to the host) up.  Please check on those two.
We've completed the heatsink/memory upgrade of:

bld-linux64-ix-051.build.mtv1.mozilla.com
bld-linux64-ix-052.build.mtv1.mozilla.com
bld-linux64-ix-053.build.mtv1.mozilla.com
bld-linux64-ix-054.build.mtv1.mozilla.com
bld-linux64-ix-055.build.mtv1.mozilla.com
bld-linux64-ix-056.build.mtv1.mozilla.com
bld-linux64-ix-057.build.mtv1.mozilla.com
bld-linux64-ix-058.build.mtv1.mozilla.com
bld-linux64-ix-059.build.mtv1.mozilla.com
bld-linux64-ix-060.build.mtv1.mozilla.com
bld-linux64-ix-061.build.mtv1.mozilla.com
bld-linux64-ix-062.build.mtv1.mozilla.com
bld-linux64-ix-063.build.mtv1.mozilla.com
bld-linux64-ix-064.build.mtv1.mozilla.com
bld-linux64-ix-065.build.mtv1.mozilla.com
bld-linux64-ix-066.build.mtv1.mozilla.com
bld-linux64-ix-067.build.mtv1.mozilla.com
bld-linux64-ix-068.build.mtv1.mozilla.com
bld-linux64-ix-069.build.mtv1.mozilla.com
bld-linux64-ix-070.build.mtv1.mozilla.com
bld-linux64-ix-071.build.mtv1.mozilla.com
bld-linux64-ix-072.build.mtv1.mozilla.com
bld-linux64-ix-073.build.mtv1.mozilla.com

Foopy124, is still not cooperating, the green indicator light is still not responsive. We troubleshooted the equipment by resetting,swapping the ethernet cable, plugging it in another port. As well as, opening it up and check if anything was lose when we were upgrading it.
Okay, so all of the ones done last week and this week look good except:

foopy124 - I can log into the mgmt console, but when I power the machine on, it doesn't even get a display.  That means that either something needs to be reseated, or the magic smoke got out.

foopy122 - I can get to the running OS but can't get to the IPMI.  May be it just needs to have the power unplugged and plugged back in, or it may be a bad cable or switch port?
Blocks: 851579

Comment 16

4 years ago
foopy122 has bad IPMI, foopy124 had a bad DIMM and is back online.
The rest of these machines are now out of warranty will be decommissioned when their current purpose is fulfilled.  They will not be moving to scl3.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED

Updated

4 years ago
Assignee: server-ops → server-ops-dcops
It turns out we're not going to decommission the iX machines after all, so reopening this bug to get these machines hardware upgraded and moved to scl3:

mw32-ix-slave01.build.mtv1.mozilla.com
mw32-ix-slave02.build.mtv1.mozilla.com
mw32-ix-slave03.build.mtv1.mozilla.com
mw32-ix-slave04.build.mtv1.mozilla.com
mw32-ix-slave05.build.mtv1.mozilla.com
mw32-ix-slave06.build.mtv1.mozilla.com
mw32-ix-slave07.build.mtv1.mozilla.com
mw32-ix-slave08.build.mtv1.mozilla.com
mw32-ix-slave09.build.mtv1.mozilla.com
mw32-ix-slave10.build.mtv1.mozilla.com
mw32-ix-slave11.build.mtv1.mozilla.com
mw32-ix-slave12.build.mtv1.mozilla.com
win32-ix-ref.build.mtv1.mozilla.com
linux-ix-ref.build.mtv1.mozilla.com
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 947950
Blocks: 947951
The primary nics of the machines in comment 18 should be put on vlan 236 in scl3 when they are moved.  The management nics should be on vlan 216.

Comment 20

4 years ago
replaced heat sinks, seems we're short one heat sink for asset-03107, which we'll try to procure.

Comment 21

4 years ago
ServiceNow order submitted.  REQ0021153

Comment 22

4 years ago
RITM0022787

Updated

3 years ago
colo-trip: mtv1 → scl3
van: we have three more machines that need upgrades (bug 948997).  Are we lacking the parts for those?  If so, how much will it cost to buy new parts (it may not be worth it)?
Flags: needinfo?(vle)

Comment 24

3 years ago
:arr, we are short on heat sinks so we would have to order 3 more. they're $35 each before taxes/shipping.

http://www.heatsinkfactory.com/cooljag-den-7-cpu-cjg-36.html
Flags: needinfo?(vle)
Okay, let's go ahead and order the parts in expectation that we're going to do these last 3 machines soon after we wrap up the tegra move.

Comment 26

3 years ago
Order has been placed through servicenow.  RITM0022977
Whiteboard: [reit] → 3 heatsinks ordered RITM0022977
Added the bug for the last three machines to be upgraded and moved to scl3.  That should bring us to a total of 17 machines moving to scl3 and being repurposed as windows2008r2 builders.

We're still hashing out the hostnames for these, but please put all the primary nics on VLAN236 (winbuild) https://inventory.mozilla.org/en-US/core/vlan/139/ and the oob interfaces on VLAN216 (inband) https://inventory.mozilla.org/en-US/core/vlan/138/
Blocks: 948997
I've started a spreadsheet to track the move for these:  https://docs.google.com/a/mozilla.com/spreadsheet/ccc?key=0AhyKG0L2cstIdEMzd2RoMk1UaHg4Ty10Q1NHQXNzd0E

I noticed that there are TWO switch ports listed for the primary nic, and I presume that's either an error or we mistakenly cabled up the secondary nic on those as well.  If we did the latter, that's unnecessary, since we only use the primary and not secondary nic.  Please make sure that we're not wasting cable and switches there if so.  I put in switch1.r202-10.console.scl3.mozilla.net:ge-0/0/<port> in the spreadsheet to update inventory, so if that's incorrect, please let me know and update that column.

I've also added the last three systems so that you guys can fill in switch ports, rack, rack order, and oob switch.

I'm going to work with uberj to get all of the information updated in inventory based on this ss.
Inventory/DNS/DHCP has been updated for the 15 machines that are already in scl3.  dcops: can someone verify that they're on the correct vlans and power cycle them so that the oob interfaces are pingable?

Comment 30

3 years ago
uber: can you take a look at amy's spreadsheet? i have updated the location and switch info for the last 3 hosts. please also note that i have updated the name of the switch, giving it the FQDN and the correct name. 

arr: inband mgmt switch and host switch ports have been configured. let me know if you're having any issues.

vle@switch1.r202-10.ops.releng.scl3.mozilla.net# show 
member-range ge-0/0/23 to ge-0/0/36;
member-range ge-0/0/7 to ge-0/0/9;
unit 0 {
    family ethernet-switching {
        port-mode access;
        vlan {
            members releng-winbuild;
        }
    }
}
:uberj: I also had to change the FQDN of the ipmi since it was missing the "releng" atom (so the SREG, A, PTR, and CNAMEs will need to be updated for that.
Depends on: 959888
:van: I am unable to reach the ipmi interfaces for 0002, 0011, and 0014.

:uberj: might you get a chance to update the information from the spreadsheet and add the CNAMEs today?
Flags: needinfo?(juber)
:van: also, 9 doesn't seem to see its disk, and 13 doesn't look like it's powering on.
I've updated info and created cnames. Just need to track down macs for https://inventory.mozilla.org/en-US/systems/show/1645/ https://inventory.mozilla.org/en-US/systems/show/1644/ and https://inventory.mozilla.org/en-US/systems/show/1643/

Comment 35

3 years ago
:arr, [0002,0011,0014] are back online. 0013 was hung during the reimage phase and ive rebooted it. 0009 has a bad disk that is no longer spinning up. can we replace it with any spare drive(we have spare 1tbs) or does it have to match the specs of the other iX hosts?

we moved [0017-0019] and i've confirmed IPMI is reachable.

Comment 36

3 years ago
>we moved [0017-0019] and i've confirmed IPMI is reachable.

I meant we moved [0015-0017].
van: the drive for 0009 should match specs, please order and replace.
and I can't reach ipmi for 0015

Comment 38

3 years ago
#UJU-709-53233 opened for drive RMA/quote. IPMI fixed for 15 and 17.
Everything but 0009 is up now, thanks!
Flags: needinfo?(juber)

Comment 40

3 years ago
following up with iX regarding hard drive for 009.

Updated

3 years ago
Whiteboard: 3 heatsinks ordered RITM0022977 → #UJU-709-53233 - replacement drive for 009

Comment 41

3 years ago
heat sinks upgraded on hosts and moved to SCL3. we're running into an issue with 0009 as it wont image after drive swap. i have opened Bug 964535 for relops to take a look at "The Dirty Environment" error. closing bug as we have a different bug to track 0009.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago3 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.