642305 - move RelEng machines from mtv1 to scl1

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

13 years ago

from irc with mrz, ravi:

To avoid possible recurrences of bug#636462, we're moving all ix machines in 3rd floor server room of 650castro to real homes in SCL1/SCL2. Filing this to track the work, while also cleaning out other related bugs.

Depending on how many we move at a time, we might be able to do this without needing a tree closure. Otherwise need to find a quiet time in the release schedules.

Zandr Milewski [:zandr]

Comment 1

•

13 years ago

This is to avoid any number of problems that could arise from the fact that 650 Castro is not a high availability site. As such, we will be moving *all* RelEng infra that does not have a compelling reason to stay in MV.

The machines that are staying in MV are the N900s, due to their dependency on Haxxor, and the Tegras, which we'd like to keep close until we know that remote power is a viable solution. Any support machines that need to be co-located with the N900s and Tegras should stay, everything else should go.

Let's use this bug to do that sorting.

Summary: move ix machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL2

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

13 years ago

Whiteboard: [slaveduty]

Zandr Milewski [:zandr]

Comment 2

•

13 years ago

As I failed to articulate in comment 1, I need a list of the machines that are to be left behind because they support N900s and Tegras. Everything else must go, but aside from the w32- and linux-ix-slaves, I don't have a solid list of 'everything else'.

Move this back to ServerOps:RelEng when that list is done.

Assignee: server-ops-releng → nobody

Component: Server Operations: RelEng → Release Engineering

QA Contact: zandr → release

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Priority: -- → P3

Whiteboard: [slaveduty] → [slaveduty][hardware]

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

OS: Mac OS X → All

Hardware: x86 → All

Aki Sasaki (not active)

Comment 3

•

13 years ago

for mobile:

production-mobile-master (vm)
staging-mobile-master (vm)
test-master01 (vm)

2x imaging netbooks, 1x imaging powerbook (in haxxor)

foopy01 - foopy11
bm-foopy (2u machine)
bm-remote-* (3 minis) + load balancer

n900-001 through 090
tegra-001 through 093

John Ford has a list for Geriatric master.

John Ford [:jhford] CET/CEST Berlin Time

Comment 4

•

13 years ago

There are 2 dell P3 1u machines in 650 for geriatric master.

John Ford [:jhford] CET/CEST Berlin Time

Comment 5

•

13 years ago

(In reply to comment #4)
> There are 2 dell P3 1u machines in 650 for geriatric master.

Also, geriatric-master.build.mozilla.org

Lukas Blakk [:lsblakk] use ?needinfo

Comment 6

•

13 years ago

Looks like the list is here, so moving back to ServerOps: Releng as requested.

Assignee: nobody → server-ops-releng

Component: Release Engineering → Server Operations: RelEng

QA Contact: release → zandr

Amy Rich [:arr] [:arich]

Updated

•

13 years ago

Assignee: server-ops-releng → zandr

Amy Rich [:arr] [:arich]

Updated

•

13 years ago

Assignee: zandr → arich

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 7

•

13 years ago

per meeting with IT yesterday:

* low priority; also all remaining ix in mtv are all win32 so will be the last to be touched

Chris Cooper [:coop] (he/him)

Comment 8

•

13 years ago

per meeting with IT yesterday:

Watching how wait times stabilize this week now that offline win32 machines are re-attached after network hiccup. We'll (releng+relops) try to figure out fewest number of batches to move, while still avoiding tree closures. Will revisit at the relops meeting next week.

Chris Cooper [:coop] (he/him)

Comment 9

•

13 years ago

per meeting with IT today:

* releng will start planning this week how best to handle the machine move in terms of acceptable downtimes and # batches. zandr would like to free up the racks ASAP. releng will try to have a proposal ready for next week's relops meeting.

Chris Cooper [:coop] (he/him)

Comment 10

•

13 years ago

From slavealloc, I see 40 win32 ix slaves that need to move:

mw32-ix-slave[01-25]
w32-ix-slave[05,06,09-21]

These machines are currently assorted as follows:

13 build slaves, 22 try slaves, 5 preprod/dev slaves

That represent about half of our ix build capacity on Windows (16 non-mtv ix machines), and two-thirds of our ix try capacity on Windows (7 non-mtv ix machines).

I propose we move these machines in 2 batches:

Batch #1: 6 build, 11 try, 5 preprod
Batch #2: 7 build, 11 try, 0 preprod

releng: does this make sense? We'd still need "soft" downtimes for both batches due to reduced capacity.

relops: do you want/need me to specify exactly which machines go in each batch?

Amy Rich [:arr] [:arich]

Comment 11

•

13 years ago

I suspect you meant to put this in bug 672969 since you're talking about the upgrades.  For the list you mention, the w32-ix machines are moving, not the mw32-ix ones.  We also have some linux-ix machines that are destined for scl1, too.  It would probably be easiest to do all of the machines headed to scl1 in one batch if that's possible (not sure how things are split out amongst the w32-ix machines).  Here's the full list of what's going to scl1:

w32-ix-slave05.build.mtv1.mozilla.com
w32-ix-slave06.build.mtv1.mozilla.com
w32-ix-slave09.build.mtv1.mozilla.com
w32-ix-slave10.build.mtv1.mozilla.com
w32-ix-slave11.build.mtv1.mozilla.com
w32-ix-slave12.build.mtv1.mozilla.com
w32-ix-slave13.build.mtv1.mozilla.com
w32-ix-slave14.build.mtv1.mozilla.com
w32-ix-slave15.build.mtv1.mozilla.com
w32-ix-slave16.build.mtv1.mozilla.com
w32-ix-slave17.build.mtv1.mozilla.com
w32-ix-slave18.build.mtv1.mozilla.com
w32-ix-slave19.build.mtv1.mozilla.com
w32-ix-slave20.build.mtv1.mozilla.com
w32-ix-slave21.build.mtv1.mozilla.com
linux-ix-slave03.build.mtv1.mozilla.com
linux-ix-slave04.build.mtv1.mozilla.com
linux-ix-slave05.build.mtv1.mozilla.com
linux-ix-slave07.build.mtv1.mozilla.com
linux-ix-slave09.build.mtv1.mozilla.com
linux-ix-slave10.build.mtv1.mozilla.com
linux-ix-slave11.build.mtv1.mozilla.com
linux-ix-slave15.build.mtv1.mozilla.com

In addition to that the ref machines and the mv-moz2-linux-ix machines also need the upgrade, but we don't need to cover those in this bug since they aren't moving, either.

Chris Cooper [:coop] (he/him)

Comment 12

•

13 years ago

(In reply to Amy Rich [:arich] from comment #11)
> I suspect you meant to put this in bug 672969 since you're talking about the
> upgrades.  For the list you mention, the w32-ix machines are moving, not the
> mw32-ix ones.  We also have some linux-ix machines that are destined for
> scl1, too.  It would probably be easiest to do all of the machines headed to
> scl1 in one batch if that's possible (not sure how things are split out
> amongst the w32-ix machines).  Here's the full list of what's going to scl1:

I thought they were all going, but thanks for the actual list. I'll figure out how those machines are distributed and see if we can move them en masse.

Chris Cooper [:coop] (he/him)

Comment 13

•

13 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #11)
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com

These Windows slaves are all reporting to the try server, so we'd lose about 30% of our current Windows try capacity (13 out of 44 machines). We'll need to do 2 batches here unless it's going to be a short outage, then I would be fine moving them en masse.

> w32-ix-slave06.build.mtv1.mozilla.com
> w32-ix-slave12.build.mtv1.mozilla.com

These two machines went to SeaMonkey.

> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com

These represent a  mix of preproduction, try, and build slaves. These can all go in one batch.

Chris Cooper [:coop] (he/him)

Comment 14

•

13 years ago

(In reply to Chris Cooper [:coop] from comment #13) 
> These Windows slaves are all reporting to the try server, so we'd lose about
> 30% of our current Windows try capacity (13 out of 44 machines). We'll need
> to do 2 batches here unless it's going to be a short outage, then I would be
> fine moving them en masse.

In case it wasn't clear, we are in no rush on our end, but understand that you want to improve dc reliability ASAP. Feel free to schedule the "downtime" (read: reduced capacity) for these with myself and/or buildduty as soon as you need to.

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

13 years ago

Per discussion in the relops meeting, we won't try to do this over the holidays, since the logistics would be pretty painful.  Instead, we'll do this in early January, being more conservative in the batching.

As a point of reference, let's do three batches of 7:

Batch I
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com

Batch II
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com

Batch III
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com

We'll do these sequentially, so Batch II starts after Batch I is back online in scl1.

Coop, does this seem reasonable?  If so, we'll start scheduling with iX in January.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

13 years ago

Blocks: 672969

Amy Rich [:arr] [:arich]

Updated

•

13 years ago

Assignee: arich → mlarrain

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

13 years ago

coop, ping, now that you're back?

Matthew Larrain[:MaRu]

Assignee

Updated

•

13 years ago

Summary: move RelEng machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL3

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

13 years ago

Summary: move RelEng machines from 650castro to SCL1/SCL3 → move RelEng machines from mtv1 to scl1

Matthew Larrain[:MaRu]

Assignee

Comment 17

•

13 years ago

sent an email to lukas who will be buildduty tomorrow to verify that we can move 21 servers to SCL1 tomorrow. List of servers;

Batch I
> w32-ix-slave05.build.mtv1.mozilla.com
> w32-ix-slave09.build.mtv1.mozilla.com
> w32-ix-slave10.build.mtv1.mozilla.com
> w32-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave03.build.mtv1.mozilla.com
> linux-ix-slave04.build.mtv1.mozilla.com
> linux-ix-slave05.build.mtv1.mozilla.com

Batch II
> w32-ix-slave13.build.mtv1.mozilla.com
> w32-ix-slave14.build.mtv1.mozilla.com
> w32-ix-slave15.build.mtv1.mozilla.com
> w32-ix-slave16.build.mtv1.mozilla.com
> w32-ix-slave17.build.mtv1.mozilla.com
> linux-ix-slave07.build.mtv1.mozilla.com
> linux-ix-slave09.build.mtv1.mozilla.com

Batch III
> w32-ix-slave18.build.mtv1.mozilla.com
> w32-ix-slave19.build.mtv1.mozilla.com
> w32-ix-slave20.build.mtv1.mozilla.com
> w32-ix-slave21.build.mtv1.mozilla.com
> linux-ix-slave10.build.mtv1.mozilla.com
> linux-ix-slave11.build.mtv1.mozilla.com
> linux-ix-slave15.build.mtv1.mozilla.com

Amy Rich [:arr] [:arich]

Comment 18

•

13 years ago

The spreadsheet has been updated with the proper rack locations for each machine.

All machines have DHCP, A, and PTR records in the scl1 datacenter for both the mgmt and public interfaces.

Nagios files are called {hosts.h,services.h,hostgroups}.ix (for the updates) and {hosts.h,services.h,hostgroups}.orig (for things as they are now, in case a rollback is required) on both admin1 and bm-admin01.  The hosts have dependencies on the switch listed in the first sheet of the spreadsheet.

Amy Rich [:arr] [:arich]

Comment 19

•

13 years ago

These machines were all successfully moved today.  Inventory needs to be updated to reflect their new location, but they are online and operational.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 20

•

13 years ago

The list of machines have also had their data center updated in slavealloc.

Amy Rich [:arr] [:arich]

Comment 21

•

12 years ago

All of the linux boxes are still alerting that buildbot is not running (presumably they haven't been added back into the pool yet).  I have downtimed these till 1/18.

Matthew Larrain[:MaRu]

Assignee

Comment 22

•

12 years ago

Making Inventory changes today.

Status: NEW → ASSIGNED

Nick Thomas [:nthomas] (UTC+12)

Comment 23

•

12 years ago

I've fixed up the puppet on the linux boxes in comment #17. They already had /etc/sysconfig/puppet pointed at scl-production-puppet.build.scl1.mozilla.com, but needed another 'puppet --clean' there.

Some of the windows boxes have been doing jobs, so I think OPSI has coped fine.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 24

•

12 years ago

Thanks nthomas, the linux slaves are doing jobs now too - we can call this FIXED if everything on the ServerOps side is done too.

Matthew Larrain[:MaRu]

Assignee

Comment 25

•

12 years ago

All machines have been updated in inventory

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations