Closed Bug 743828 Opened 13 years ago Closed 13 years ago

migrate opsi hosts to scl3

Categories

(Infrastructure & Operations :: Virtualization, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: mburns)

References

Details

The opsi vms need to migrate from sjc1 to srv.releng.scl3.mozilla.com. We'll need: * new IPs * appropriate flows * releng to add/fix references to opsi's fqdn * the virtualization team to do the actual migration * likely a tree flosure or downtime in which to perform this work.
production-opsi.srv.releng.scl3.mozilla.com 10.26.48.38 staging-opsi.srv.releng.scl3.mozilla.com 10.26.48.39
Assignee: server-ops-releng → dustin
Heh, google tells me OPSI stands for "Overwhelming post-splenectomy infection". Sounds about right. Windows systems use a registry key to tell them which master to hit: https://wiki.mozilla.org/ReleaseEngineering/OPSI#Check_which_master and it uses an IP. I imagine the easiest way to edit that will be via REG.EXE and cssh? I think we can manage this without a tree closure as follows, assuming the vmware crew can wave some magic wands: * snapshot or copy the existing hosts locally to sjc1 with limited downtime (a downtime for these hosts of, say, 30 minutes, will not significantly impact builds, as hosts will wait for them to come back up) * bring them back up * send the images to scl3 and set up the new machines at a leisurely pace * bring up the new VMs * start redirecting hosts to the new VMs Dan, does that sound doable? When could we schedule this? I don't think any new flows will be required, as the releng BU is currently any/any within itself and to the other build nets, right?
I don't know what this means: "snapshot or copy the existing hosts locally to sjc1 with limited downtime"
Dustin: that should be correct about all all for the releng flows.
Per Dan, we can clone the hosts inside of a 30-minute downtime. We could do this later in the day tomorrow (Thursday), with coordination from Buildduty. No tree-closure is required, as we can do this with, at worst, a slight increase in wait time (as hosts stall at the trying-to-run-OPSI stage). Dan, what does your schedule look like? OK, that's the wrong question - I'm pretty sure it looks like two hours of work for each hour from now until eternity. When should we plan to do this, and who should I look for to work the controls in vCenter?
Assignee: dustin → server-ops
Component: Server Operations: RelEng → Server Operations: Virtualization
QA Contact: arich → dparsons
Oh yeah, and this is on the OMG EVAUCATE THIS WEEK list :(
Severity: normal → critical
(In reply to Dustin J. Mitchell [:dustin] from comment #6) > Oh yeah, and this is on the OMG EVAUCATE THIS WEEK list :( Incorrect - the chassis this is in is *not* moving on Monday, so there's time.
Severity: critical → major
Assignee: server-ops → phong
What do we need to do here? If you know what is needed, one of the SRE can help with the migration.
Assignee: phong → dparsons
Phong: We need to pick a time to take these sjc1 hosts down, clone them, then bring them back up, and start the migration of the clones to scl3.
Phong - I will help coordinate with RelEng for any downtime required to make this happen. Please let me know what timeframe you are looking at for this.
The plan outlined by Dustin in comment #9 works for us - we just need to have the down,clone,restart old,bring up new done sooner than later so I can start the process of migrating the slaves to point to the new instance
Assignee: dparsons → mburns
staging-opsi downed cloned (as staging-opsi-NEW started the old staging-opsi created https://inventory.mozilla.org/en-US/systems/show/6030/ Just need to migrate the to SCL3's esx cluster and bring it up.
do we have the new staging opsi spun up?
is the staging opsi vm running?
There's a new flow required to do the migration - that request is pending. I don't recall the bug #.
bug 746858 is tracking this unexpected missing flow.
staging-opsi.srv.releng.scl3.mozilla.com is migrated and online.
mburns: please coordinate with rail (buildduty) tomorrow to shut down, clone, and migrate the production opsi vm. The downtime starts at 9:00 pacific, but rail will give the all clear in #infra when work can commence.
looking forward to it.
Blocks: 748814
[09:55:55] <mburns> bear: rail-buildduty: 64 bytes from production-opsi.srv.releng.scl3.mozilla.com (10.26.48.38): icmp_seq=1 ttl=61 time=4.90 ms :) [09:56:05] <bear> \o/ [09:56:12] <rail-buildduty> whooo [09:57:05] <bear> I can reach it and it looks like an opsi server
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.