Closed
Bug 642305
Opened 13 years ago
Closed 12 years ago
move RelEng machines from mtv1 to scl1
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: mlarrain)
References
Details
(Whiteboard: [slaveduty][hardware])
from irc with mrz, ravi: To avoid possible recurrences of bug#636462, we're moving all ix machines in 3rd floor server room of 650castro to real homes in SCL1/SCL2. Filing this to track the work, while also cleaning out other related bugs. Depending on how many we move at a time, we might be able to do this without needing a tree closure. Otherwise need to find a quiet time in the release schedules.
Comment 1•13 years ago
|
||
This is to avoid any number of problems that could arise from the fact that 650 Castro is not a high availability site. As such, we will be moving *all* RelEng infra that does not have a compelling reason to stay in MV. The machines that are staying in MV are the N900s, due to their dependency on Haxxor, and the Tegras, which we'd like to keep close until we know that remote power is a viable solution. Any support machines that need to be co-located with the N900s and Tegras should stay, everything else should go. Let's use this bug to do that sorting.
Summary: move ix machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL2
Updated•13 years ago
|
Whiteboard: [slaveduty]
Comment 2•13 years ago
|
||
As I failed to articulate in comment 1, I need a list of the machines that are to be left behind because they support N900s and Tegras. Everything else must go, but aside from the w32- and linux-ix-slaves, I don't have a solid list of 'everything else'. Move this back to ServerOps:RelEng when that list is done.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Updated•13 years ago
|
Priority: -- → P3
Whiteboard: [slaveduty] → [slaveduty][hardware]
Updated•13 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Comment 3•13 years ago
|
||
for mobile: production-mobile-master (vm) staging-mobile-master (vm) test-master01 (vm) 2x imaging netbooks, 1x imaging powerbook (in haxxor) foopy01 - foopy11 bm-foopy (2u machine) bm-remote-* (3 minis) + load balancer n900-001 through 090 tegra-001 through 093 John Ford has a list for Geriatric master.
Comment 4•13 years ago
|
||
There are 2 dell P3 1u machines in 650 for geriatric master.
Comment 5•13 years ago
|
||
(In reply to comment #4) > There are 2 dell P3 1u machines in 650 for geriatric master. Also, geriatric-master.build.mozilla.org
Comment 6•13 years ago
|
||
Looks like the list is here, so moving back to ServerOps: Releng as requested.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Updated•13 years ago
|
Assignee: server-ops-releng → zandr
Updated•13 years ago
|
Assignee: zandr → arich
Reporter | ||
Comment 7•13 years ago
|
||
per meeting with IT yesterday: * low priority; also all remaining ix in mtv are all win32 so will be the last to be touched
Comment 8•13 years ago
|
||
per meeting with IT yesterday: Watching how wait times stabilize this week now that offline win32 machines are re-attached after network hiccup. We'll (releng+relops) try to figure out fewest number of batches to move, while still avoiding tree closures. Will revisit at the relops meeting next week.
Comment 9•13 years ago
|
||
per meeting with IT today: * releng will start planning this week how best to handle the machine move in terms of acceptable downtimes and # batches. zandr would like to free up the racks ASAP. releng will try to have a proposal ready for next week's relops meeting.
Comment 10•13 years ago
|
||
From slavealloc, I see 40 win32 ix slaves that need to move: mw32-ix-slave[01-25] w32-ix-slave[05,06,09-21] These machines are currently assorted as follows: 13 build slaves, 22 try slaves, 5 preprod/dev slaves That represent about half of our ix build capacity on Windows (16 non-mtv ix machines), and two-thirds of our ix try capacity on Windows (7 non-mtv ix machines). I propose we move these machines in 2 batches: Batch #1: 6 build, 11 try, 5 preprod Batch #2: 7 build, 11 try, 0 preprod releng: does this make sense? We'd still need "soft" downtimes for both batches due to reduced capacity. relops: do you want/need me to specify exactly which machines go in each batch?
Comment 11•13 years ago
|
||
I suspect you meant to put this in bug 672969 since you're talking about the upgrades. For the list you mention, the w32-ix machines are moving, not the mw32-ix ones. We also have some linux-ix machines that are destined for scl1, too. It would probably be easiest to do all of the machines headed to scl1 in one batch if that's possible (not sure how things are split out amongst the w32-ix machines). Here's the full list of what's going to scl1: w32-ix-slave05.build.mtv1.mozilla.com w32-ix-slave06.build.mtv1.mozilla.com w32-ix-slave09.build.mtv1.mozilla.com w32-ix-slave10.build.mtv1.mozilla.com w32-ix-slave11.build.mtv1.mozilla.com w32-ix-slave12.build.mtv1.mozilla.com w32-ix-slave13.build.mtv1.mozilla.com w32-ix-slave14.build.mtv1.mozilla.com w32-ix-slave15.build.mtv1.mozilla.com w32-ix-slave16.build.mtv1.mozilla.com w32-ix-slave17.build.mtv1.mozilla.com w32-ix-slave18.build.mtv1.mozilla.com w32-ix-slave19.build.mtv1.mozilla.com w32-ix-slave20.build.mtv1.mozilla.com w32-ix-slave21.build.mtv1.mozilla.com linux-ix-slave03.build.mtv1.mozilla.com linux-ix-slave04.build.mtv1.mozilla.com linux-ix-slave05.build.mtv1.mozilla.com linux-ix-slave07.build.mtv1.mozilla.com linux-ix-slave09.build.mtv1.mozilla.com linux-ix-slave10.build.mtv1.mozilla.com linux-ix-slave11.build.mtv1.mozilla.com linux-ix-slave15.build.mtv1.mozilla.com In addition to that the ref machines and the mv-moz2-linux-ix machines also need the upgrade, but we don't need to cover those in this bug since they aren't moving, either.
Comment 12•13 years ago
|
||
(In reply to Amy Rich [:arich] from comment #11) > I suspect you meant to put this in bug 672969 since you're talking about the > upgrades. For the list you mention, the w32-ix machines are moving, not the > mw32-ix ones. We also have some linux-ix machines that are destined for > scl1, too. It would probably be easiest to do all of the machines headed to > scl1 in one batch if that's possible (not sure how things are split out > amongst the w32-ix machines). Here's the full list of what's going to scl1: I thought they were all going, but thanks for the actual list. I'll figure out how those machines are distributed and see if we can move them en masse.
Comment 13•13 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #11) > w32-ix-slave05.build.mtv1.mozilla.com > w32-ix-slave09.build.mtv1.mozilla.com > w32-ix-slave10.build.mtv1.mozilla.com > w32-ix-slave11.build.mtv1.mozilla.com > w32-ix-slave13.build.mtv1.mozilla.com > w32-ix-slave14.build.mtv1.mozilla.com > w32-ix-slave15.build.mtv1.mozilla.com > w32-ix-slave16.build.mtv1.mozilla.com > w32-ix-slave17.build.mtv1.mozilla.com > w32-ix-slave18.build.mtv1.mozilla.com > w32-ix-slave19.build.mtv1.mozilla.com > w32-ix-slave20.build.mtv1.mozilla.com > w32-ix-slave21.build.mtv1.mozilla.com These Windows slaves are all reporting to the try server, so we'd lose about 30% of our current Windows try capacity (13 out of 44 machines). We'll need to do 2 batches here unless it's going to be a short outage, then I would be fine moving them en masse. > w32-ix-slave06.build.mtv1.mozilla.com > w32-ix-slave12.build.mtv1.mozilla.com These two machines went to SeaMonkey. > linux-ix-slave03.build.mtv1.mozilla.com > linux-ix-slave04.build.mtv1.mozilla.com > linux-ix-slave05.build.mtv1.mozilla.com > linux-ix-slave07.build.mtv1.mozilla.com > linux-ix-slave09.build.mtv1.mozilla.com > linux-ix-slave10.build.mtv1.mozilla.com > linux-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave15.build.mtv1.mozilla.com These represent a mix of preproduction, try, and build slaves. These can all go in one batch.
Comment 14•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #13) > These Windows slaves are all reporting to the try server, so we'd lose about > 30% of our current Windows try capacity (13 out of 44 machines). We'll need > to do 2 batches here unless it's going to be a short outage, then I would be > fine moving them en masse. In case it wasn't clear, we are in no rush on our end, but understand that you want to improve dc reliability ASAP. Feel free to schedule the "downtime" (read: reduced capacity) for these with myself and/or buildduty as soon as you need to.
Comment 15•13 years ago
|
||
Per discussion in the relops meeting, we won't try to do this over the holidays, since the logistics would be pretty painful. Instead, we'll do this in early January, being more conservative in the batching. As a point of reference, let's do three batches of 7: Batch I > w32-ix-slave05.build.mtv1.mozilla.com > w32-ix-slave09.build.mtv1.mozilla.com > w32-ix-slave10.build.mtv1.mozilla.com > w32-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave03.build.mtv1.mozilla.com > linux-ix-slave04.build.mtv1.mozilla.com > linux-ix-slave05.build.mtv1.mozilla.com Batch II > w32-ix-slave13.build.mtv1.mozilla.com > w32-ix-slave14.build.mtv1.mozilla.com > w32-ix-slave15.build.mtv1.mozilla.com > w32-ix-slave16.build.mtv1.mozilla.com > w32-ix-slave17.build.mtv1.mozilla.com > linux-ix-slave07.build.mtv1.mozilla.com > linux-ix-slave09.build.mtv1.mozilla.com Batch III > w32-ix-slave18.build.mtv1.mozilla.com > w32-ix-slave19.build.mtv1.mozilla.com > w32-ix-slave20.build.mtv1.mozilla.com > w32-ix-slave21.build.mtv1.mozilla.com > linux-ix-slave10.build.mtv1.mozilla.com > linux-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave15.build.mtv1.mozilla.com We'll do these sequentially, so Batch II starts after Batch I is back online in scl1. Coop, does this seem reasonable? If so, we'll start scheduling with iX in January.
Updated•13 years ago
|
Assignee: arich → mlarrain
Comment 16•13 years ago
|
||
coop, ping, now that you're back?
Assignee | ||
Updated•13 years ago
|
Summary: move RelEng machines from 650castro to SCL1/SCL2 → move RelEng machines from 650castro to SCL1/SCL3
Updated•13 years ago
|
Summary: move RelEng machines from 650castro to SCL1/SCL3 → move RelEng machines from mtv1 to scl1
Assignee | ||
Comment 17•13 years ago
|
||
sent an email to lukas who will be buildduty tomorrow to verify that we can move 21 servers to SCL1 tomorrow. List of servers; Batch I > w32-ix-slave05.build.mtv1.mozilla.com > w32-ix-slave09.build.mtv1.mozilla.com > w32-ix-slave10.build.mtv1.mozilla.com > w32-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave03.build.mtv1.mozilla.com > linux-ix-slave04.build.mtv1.mozilla.com > linux-ix-slave05.build.mtv1.mozilla.com Batch II > w32-ix-slave13.build.mtv1.mozilla.com > w32-ix-slave14.build.mtv1.mozilla.com > w32-ix-slave15.build.mtv1.mozilla.com > w32-ix-slave16.build.mtv1.mozilla.com > w32-ix-slave17.build.mtv1.mozilla.com > linux-ix-slave07.build.mtv1.mozilla.com > linux-ix-slave09.build.mtv1.mozilla.com Batch III > w32-ix-slave18.build.mtv1.mozilla.com > w32-ix-slave19.build.mtv1.mozilla.com > w32-ix-slave20.build.mtv1.mozilla.com > w32-ix-slave21.build.mtv1.mozilla.com > linux-ix-slave10.build.mtv1.mozilla.com > linux-ix-slave11.build.mtv1.mozilla.com > linux-ix-slave15.build.mtv1.mozilla.com
Comment 18•13 years ago
|
||
The spreadsheet has been updated with the proper rack locations for each machine. All machines have DHCP, A, and PTR records in the scl1 datacenter for both the mgmt and public interfaces. Nagios files are called {hosts.h,services.h,hostgroups}.ix (for the updates) and {hosts.h,services.h,hostgroups}.orig (for things as they are now, in case a rollback is required) on both admin1 and bm-admin01. The hosts have dependencies on the switch listed in the first sheet of the spreadsheet.
Comment 19•13 years ago
|
||
These machines were all successfully moved today. Inventory needs to be updated to reflect their new location, but they are online and operational.
Comment 20•13 years ago
|
||
The list of machines have also had their data center updated in slavealloc.
Comment 21•12 years ago
|
||
All of the linux boxes are still alerting that buildbot is not running (presumably they haven't been added back into the pool yet). I have downtimed these till 1/18.
Comment 23•12 years ago
|
||
I've fixed up the puppet on the linux boxes in comment #17. They already had /etc/sysconfig/puppet pointed at scl-production-puppet.build.scl1.mozilla.com, but needed another 'puppet --clean' there. Some of the windows boxes have been doing jobs, so I think OPSI has coped fine.
Comment 24•12 years ago
|
||
Thanks nthomas, the linux slaves are doing jobs now too - we can call this FIXED if everything on the ServerOps side is done too.
Assignee | ||
Comment 25•12 years ago
|
||
All machines have been updated in inventory
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•