Closed Bug 642305 Opened 13 years ago Closed 12 years ago

move RelEng machines from mtv1 to scl1

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: mlarrain)

References

Details

(Whiteboard: [slaveduty][hardware])

from irc with mrz, ravi:

To avoid possible recurrences of bug#636462, we're moving all ix machines in 3rd floor server room of 650castro to real homes in SCL1/SCL2. Filing this to track the work, while also cleaning out other related bugs.

Depending on how many we move at a time, we might be able to do this without needing a tree closure. Otherwise need to find a quiet time in the release schedules.
This is to avoid any number of problems that could arise from the fact that 650 Castro is not a high availability site. As such, we will be moving *all* RelEng infra that does not have a compelling reason to stay in MV.

The machines that are staying in MV are the N900s, due to their dependency on Haxxor, and the Tegras, which we'd like to keep close until we know that remote power is a viable solution. Any support machines that need to be co-located with the N900s and Tegras should stay, everything else should go.

Let's use this bug to do that sorting.
Summary: move ix machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL2
Whiteboard: [slaveduty]
As I failed to articulate in comment 1, I need a list of the machines that are to be left behind because they support N900s and Tegras. Everything else must go, but aside from the w32- and linux-ix-slaves, I don't have a solid list of 'everything else'.

Move this back to ServerOps:RelEng when that list is done.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Priority: -- → P3
Whiteboard: [slaveduty] → [slaveduty][hardware]
OS: Mac OS X → All
Hardware: x86 → All
for mobile:

production-mobile-master (vm)
staging-mobile-master (vm)
test-master01 (vm)

2x imaging netbooks, 1x imaging powerbook (in haxxor)

foopy01 - foopy11
bm-foopy (2u machine)
bm-remote-* (3 minis) + load balancer

n900-001 through 090
tegra-001 through 093

John Ford has a list for Geriatric master.
There are 2 dell P3 1u machines in 650 for geriatric master.
(In reply to comment #4)
> There are 2 dell P3 1u machines in 650 for geriatric master.

Also, geriatric-master.build.mozilla.org
Looks like the list is here, so moving back to ServerOps: Releng as requested.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Assignee: server-ops-releng → zandr
Assignee: zandr → arich
per meeting with IT yesterday:

* low priority; also all remaining ix in mtv are all win32 so will be the last to be touched
per meeting with IT yesterday:

Watching how wait times stabilize this week now that offline win32 machines are re-attached after network hiccup. We'll (releng+relops) try to figure out fewest number of batches to move, while still avoiding tree closures. Will revisit at the relops meeting next week.
per meeting with IT today:

* releng will start planning this week how best to handle the machine move in terms of acceptable downtimes and # batches. zandr would like to free up the racks ASAP. releng will try to have a proposal ready for next week's relops meeting.
From slavealloc, I see 40 win32 ix slaves that need to move:

mw32-ix-slave[01-25]
w32-ix-slave[05,06,09-21]

These machines are currently assorted as follows:

13 build slaves, 22 try slaves, 5 preprod/dev slaves

That represent about half of our ix build capacity on Windows (16 non-mtv ix machines), and two-thirds of our ix try capacity on Windows (7 non-mtv ix machines).

I propose we move these machines in 2 batches:

Batch #1: 6 build, 11 try, 5 preprod
Batch #2: 7 build, 11 try, 0 preprod

releng: does this make sense? We'd still need "soft" downtimes for both batches due to reduced capacity.

relops: do you want/need me to specify exactly which machines go in each batch?
I suspect you meant to put this in bug 672969 since you're talking about the upgrades.  For the list you mention, the w32-ix machines are moving, not the mw32-ix ones.  We also have some linux-ix machines that are destined for scl1, too.  It would probably be easiest to do all of the machines headed to scl1 in one batch if that's possible (not sure how things are split out amongst the w32-ix machines).  Here's the full list of what's going to scl1:

w32-ix-slave05.build.mtv1.mozilla.com
w32-ix-slave06.build.mtv1.mozilla.com
w32-ix-slave09.build.mtv1.mozilla.com
w32-ix-slave10.build.mtv1.mozilla.com
w32-ix-slave11.build.mtv1.mozilla.com
w32-ix-slave12.build.mtv1.mozilla.com
w32-ix-slave13.build.mtv1.mozilla.com
w32-ix-slave14.build.mtv1.mozilla.com
w32-ix-slave15.build.mtv1.mozilla.com
w32-ix-slave16.build.mtv1.mozilla.com
w32-ix-slave17.build.mtv1.mozilla.com
w32-ix-slave18.build.mtv1.mozilla.com
w32-ix-slave19.build.mtv1.mozilla.com
w32-ix-slave20.build.mtv1.mozilla.com
w32-ix-slave21.build.mtv1.mozilla.com
linux-ix-slave03.build.mtv1.mozilla.com
linux-ix-slave04.build.mtv1.mozilla.com
linux-ix-slave05.build.mtv1.mozilla.com
linux-ix-slave07.build.mtv1.mozilla.com
linux-ix-slave09.build.mtv1.mozilla.com
linux-ix-slave10.build.mtv1.mozilla.com
linux-ix-slave11.build.mtv1.mozilla.com
linux-ix-slave15.build.mtv1.mozilla.com

In addition to that the ref machines and the mv-moz2-linux-ix machines also need the upgrade, but we don't need to cover those in this bug since they aren't moving, either.
(In reply to Amy Rich [:arich] from comment #11)
> I suspect you meant to put this in bug 672969 since you're talking about the
> upgrades.  For the list you mention, the w32-ix machines are moving, not the
> mw32-ix ones.  We also have some linux-ix machines that are destined for
> scl1, too.  It would probably be easiest to do all of the machines headed to
> scl1 in one batch if that's possible (not sure how things are split out
> amongst the w32-ix machines).  Here's the full list of what's going to scl1:

I thought they were all going, but thanks for the actual list. I'll figure out how those machines are distributed and see if we can move them en masse.
(In reply to Amy Rich [:arich] [:arr] from comment #11)
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com

These Windows slaves are all reporting to the try server, so we'd lose about 30% of our current Windows try capacity (13 out of 44 machines). We'll need to do 2 batches here unless it's going to be a short outage, then I would be fine moving them en masse.

> w32-ix-slave06.build.mtv1.mozilla.com
> w32-ix-slave12.build.mtv1.mozilla.com

These two machines went to SeaMonkey.

> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com

These represent a  mix of preproduction, try, and build slaves. These can all go in one batch.
(In reply to Chris Cooper [:coop] from comment #13) 
> These Windows slaves are all reporting to the try server, so we'd lose about
> 30% of our current Windows try capacity (13 out of 44 machines). We'll need
> to do 2 batches here unless it's going to be a short outage, then I would be
> fine moving them en masse.

In case it wasn't clear, we are in no rush on our end, but understand that you want to improve dc reliability ASAP. Feel free to schedule the "downtime" (read: reduced capacity) for these with myself and/or buildduty as soon as you need to.
Per discussion in the relops meeting, we won't try to do this over the holidays, since the logistics would be pretty painful.  Instead, we'll do this in early January, being more conservative in the batching.

As a point of reference, let's do three batches of 7:

Batch I
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com

Batch II
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com

Batch III
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com

We'll do these sequentially, so Batch II starts after Batch I is back online in scl1.

Coop, does this seem reasonable?  If so, we'll start scheduling with iX in January.
Assignee: arich → mlarrain
coop, ping, now that you're back?
Summary: move RelEng machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL3
Summary: move RelEng machines from 650castro to SCL1/SCL3 → move RelEng machines from mtv1 to scl1
sent an email to lukas who will be buildduty tomorrow to verify that we can move 21 servers to SCL1 tomorrow. List of servers;

Batch I
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com

Batch II
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com

Batch III
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com
The spreadsheet has been updated with the proper rack locations for each machine.

All machines have DHCP, A, and PTR records in the scl1 datacenter for both the mgmt and public interfaces.

Nagios files are called {hosts.h,services.h,hostgroups}.ix (for the updates) and {hosts.h,services.h,hostgroups}.orig (for things as they are now, in case a rollback is required) on both admin1 and bm-admin01.  The hosts have dependencies on the switch listed in the first sheet of the spreadsheet.
These machines were all successfully moved today.  Inventory needs to be updated to reflect their new location, but they are online and operational.
The list of machines have also had their data center updated in slavealloc.
All of the linux boxes are still alerting that buildbot is not running (presumably they haven't been added back into the pool yet).  I have downtimed these till 1/18.
Making Inventory changes today.
Status: NEW → ASSIGNED
I've fixed up the puppet on the linux boxes in comment #17. They already had /etc/sysconfig/puppet pointed at scl-production-puppet.build.scl1.mozilla.com, but needed another 'puppet --clean' there.

Some of the windows boxes have been doing jobs, so I think OPSI has coped fine.
Thanks nthomas, the linux slaves are doing jobs now too - we can call this FIXED if everything on the ServerOps side is done too.
All machines have been updated in inventory
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.