743828 - migrate opsi hosts to scl3

Reporter

Description

•

13 years ago

The opsi vms need to migrate from sjc1 to srv.releng.scl3.mozilla.com. We'll need: * new IPs * appropriate flows * releng to add/fix references to opsi's fqdn * the virtualization team to do the actual migration * likely a tree flosure or downtime in which to perform this work.

Amy Rich [:arr] [:arich]

Reporter

Comment 1

•

13 years ago

production-opsi.srv.releng.scl3.mozilla.com 10.26.48.38 staging-opsi.srv.releng.scl3.mozilla.com 10.26.48.39

Amy Rich [:arr] [:arich]

Reporter

Updated

•

13 years ago

Assignee: server-ops-releng → dustin

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

13 years ago

Heh, google tells me OPSI stands for "Overwhelming post-splenectomy infection". Sounds about right. Windows systems use a registry key to tell them which master to hit: https://wiki.mozilla.org/ReleaseEngineering/OPSI#Check_which_master and it uses an IP. I imagine the easiest way to edit that will be via REG.EXE and cssh? I think we can manage this without a tree closure as follows, assuming the vmware crew can wave some magic wands: * snapshot or copy the existing hosts locally to sjc1 with limited downtime (a downtime for these hosts of, say, 30 minutes, will not significantly impact builds, as hosts will wait for them to come back up) * bring them back up * send the images to scl3 and set up the new machines at a leisurely pace * bring up the new VMs * start redirecting hosts to the new VMs Dan, does that sound doable? When could we schedule this? I don't think any new flows will be required, as the releng BU is currently any/any within itself and to the other build nets, right?

Dan Parsons [:lerxst]

Comment 3

•

13 years ago

I don't know what this means: "snapshot or copy the existing hosts locally to sjc1 with limited downtime"

bhearsum@mozilla.com (:bhearsum)

Updated

•

13 years ago

Blocks: 743965

Amy Rich [:arr] [:arich]

Reporter

Comment 4

•

13 years ago

Dustin: that should be correct about all all for the releng flows.

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

13 years ago

Per Dan, we can clone the hosts inside of a 30-minute downtime. We could do this later in the day tomorrow (Thursday), with coordination from Buildduty. No tree-closure is required, as we can do this with, at worst, a slight increase in wait time (as hosts stall at the trying-to-run-OPSI stage). Dan, what does your schedule look like? OK, that's the wrong question - I'm pretty sure it looks like two hours of work for each hour from now until eternity. When should we plan to do this, and who should I look for to work the controls in vCenter?

Assignee: dustin → server-ops

Component: Server Operations: RelEng → Server Operations: Virtualization

QA Contact: arich → dparsons

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

13 years ago

Oh yeah, and this is on the OMG EVAUCATE THIS WEEK list :(

Severity: normal → critical

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #6) > Oh yeah, and this is on the OMG EVAUCATE THIS WEEK list :( Incorrect - the chassis this is in is *not* moving on Monday, so there's time.

Severity: critical → major

Phong Tran [:phong]

Updated

•

13 years ago

Assignee: server-ops → phong

Phong Tran [:phong]

Comment 8

•

13 years ago

What do we need to do here? If you know what is needed, one of the SRE can help with the migration.

Assignee: phong → dparsons

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

13 years ago

Phong: We need to pick a time to take these sjc1 hosts down, clone them, then bring them back up, and start the migration of the clones to scl3.

Mike Taylor [:bear]

Comment 10

•

13 years ago

Phong - I will help coordinate with RelEng for any downtime required to make this happen. Please let me know what timeframe you are looking at for this.

Mike Taylor [:bear]

Comment 11

•

13 years ago

The plan outlined by Dustin in comment #9 works for us - we just need to have the down,clone,restart old,bring up new done sooner than later so I can start the process of migrating the slaves to point to the new instance

Michael Burns [:mburns]

Assignee

Updated

•

13 years ago

Assignee: dparsons → mburns

Michael Burns [:mburns]

Assignee

Comment 12

•

13 years ago

staging-opsi downed cloned (as staging-opsi-NEW started the old staging-opsi created https://inventory.mozilla.org/en-US/systems/show/6030/ Just need to migrate the to SCL3's esx cluster and bring it up.

Mike Taylor [:bear]

Comment 13

•

13 years ago

do we have the new staging opsi spun up?

Mike Taylor [:bear]

Comment 14

•

13 years ago

is the staging opsi vm running?

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

13 years ago

There's a new flow required to do the migration - that request is pending. I don't recall the bug #.

Michael Burns [:mburns]

Assignee

Comment 16

•

13 years ago

bug 746858 is tracking this unexpected missing flow.

Michael Burns [:mburns]

Assignee

Comment 17

•

13 years ago

staging-opsi.srv.releng.scl3.mozilla.com is migrated and online.

Amy Rich [:arr] [:arich]

Reporter

Comment 18

•

13 years ago

mburns: please coordinate with rail (buildduty) tomorrow to shut down, clone, and migrate the production opsi vm. The downtime starts at 9:00 pacific, but rail will give the all clear in #infra when work can commence.

Michael Burns [:mburns]

Assignee

Comment 19

•

13 years ago

looking forward to it.

Mike Taylor [:bear]

Updated

•

13 years ago

Blocks: 748814

Michael Burns [:mburns]

Assignee

Comment 20

•

13 years ago

[09:55:55] <mburns> bear: rail-buildduty: 64 bytes from production-opsi.srv.releng.scl3.mozilla.com (10.26.48.38): icmp_seq=1 ttl=61 time=4.90 ms :) [09:56:05] <bear> \o/ [09:56:12] <rail-buildduty> whooo [09:57:05] <bear> I can reach it and it looks like an opsi server

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Infrastructure & Operations