Closed Bug 719715 Opened 12 years ago Closed 12 years ago

migration plan for community bugzilla project infrastructure in sjc1

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: wicked)

References

Details

https://wiki.mozilla.org/Bugzilla:Infrastructure lists the Bugzilla hosts, and all are in sjc1.

cg-bugs01.mozilla.org 
cg-bugs02.mozilla.org 
cg-bugs03.mozilla.org
landfill.bugzilla.org (VM on cg-vmware01)
windows.bugzilla.org (VM on cg-vmware01)
oracle.bugzilla.org (VM on cg-vmware01)
tinderbox.mozilla.org (VM on cg-vmware01)

tinderbox.mozilla.org is no longer used and can be deleted.  For the rest, we'll need to work out a migration plan.  From IRC, it looks like it will be best to do this in two moves, with some internal shuffling of services to keep things running.

Justdave, can you take it from here?
I haven't been involved with any of those machines in 2 or 3 years, and they're all community boxes.  wicked is probably a better choice to figure out what needs to be done.

re: tinderbox, are you sure that's not tinderbox.bugzilla.org?
Assignee: justdave → wicked
Summary: migration plan for bugzilla infrastructure in sjc1 → migration plan for community bugzilla project infrastructure in sjc1
(In reply to Dustin J. Mitchell [:dustin] from comment #0)
> From IRC, it looks like it will be
> best to do this in two moves, with some internal shuffling of services to
> keep things running.

Who was this discussion with on IRC, and are there more details from that discussion?  Should probably be recorded here before everyone forgets what was said.

There's 12 days before we can start moving, so this is coming up quick.
(In reply to Dave Miller [:justdave] from comment #1)
> re: tinderbox, are you sure that's not tinderbox.bugzilla.org?

Ack, yes, that was just muscle memory tripping me up.

(In reply to Dave Miller [:justdave] from comment #2)
> Who was this discussion with on IRC, and are there more details from that
> discussion?  Should probably be recorded here before everyone forgets what
> was said.

It was with wicked in PM.  Here's the transcript (paste cleared with wicked):

[23:23:26] <dustin> ping?
[23:23:41] <wicked> ponggg?
[23:24:37] <dustin> hey, are you teh right person to talk to about the cg-bugs* hosts?
[23:24:52] <wicked> sure
[23:25:35] <wicked> or mkanat/justdave, but I've been maintained them lately
[23:25:58] <dustin> cool
[23:26:05] <dustin> so those are in a datacenter that we're leaving soon
[23:26:13] <dustin> I'd like to figure out a plan to migrate them
[23:26:30] <dustin> how mission-critical are they?
[23:26:37] <wicked> you mean a physical move?
[23:27:22] <dustin> yes
[23:27:25] <wicked> landfill and cg-bugs01 are the most important ones for Bugzilla. cg-bugs02 hosts our bots, which are nice to keep up as well.
[23:27:32] <dustin> you're lucky - they're not mac minis (which can't go in the new DC)
[23:27:54] <wicked> cg-bugs03 is going to be taken into production as soon as we have time.. but at the moment it's only configured.
[23:27:59] <wicked> indeed :)
[23:28:15] <wicked> I've been following the SM situation with those minis :)
[23:28:23] <dustin> ah, cool
[23:28:50] <dustin> I should probably open a bug to figure this out - what is your bz ident?
[23:28:59] <wicked> we have this list as well, I've been trying to keep it uptodate -> https://wiki.mozilla.org/Bugzilla:Infrastructure
[23:29:12] <wicked> :wicked finds me in bz
[23:30:18] <dustin> ok, cool
[23:33:34] <wicked> btw, we are trying to wrap up our 4.2 release so would like to avoid downtime for the central services during that
[23:33:45] <wicked> probably still takes a month, depending on how we get things done..
[23:34:05] <dustin> ok
[23:38:20] <dustin> does updates.bz.o still exist? I don't see it on cg-vmware01
[23:39:06] <wicked> it's hosted on cg-bugs02, and yes, it does exist (and I should be migrating one of our most critical service to it)
[23:39:31] <dustin> oh, I see - virtual server in the Apache sense
[23:39:56] <wicked> or qemu+kvm sense :)
[23:40:12] <dustin> either way :)
[23:41:12] <wicked> except it has nginx, not Apache
[23:41:20] <wicked> or I just don't understand what you meant by that :)
[23:41:49] <dustin> in the sense that it's a CNAME for another host
[23:42:24] <dustin> what about tinderbox.bugzilla.org?
[23:42:32] <dustin> also in use, I assume - I just don't see it lister
[23:42:44] <wicked> ah, yeah, that CNAME of updates is the old one on landfill (currently in production)
[23:43:01] <dustin> I see :)
[23:43:22] <wicked> that tinderbox is on the cg-vmware01?
[23:43:40] <dustin> correct
[23:43:55] <wicked> I believe it's not used at the moment or no longer.. and I believe phong has our permission to delete it to give landfill more disk space
[23:44:06] <wicked> our tinderbox stuff are now on cg-bugs01
[23:44:26] <dustin> ok, good to know, thanks
[23:45:00] <wicked> are you going to move the cg-vmware01 or just migrate the virtual servers via some vmware tooling to a new host?
[23:45:37] <dustin> we will probably migrate the VMs
[23:45:44] <dustin> and move the physical hardware
[23:45:55] <dustin> how much trouble would it be to renumber?
[23:46:04] <dustin> (mostly: are you doing authoritative DNS on these hosts?)
[23:46:22] <wicked> yup, makes sense
[23:46:34] <wicked> no public DNS, only internal
[23:46:43] <dustin> ok
[23:46:58] <dustin> so if we time it right, we can do the VM migrations and put the hardware on a truck, and be down for a day or so?
[23:48:07] <wicked> that's probably the best way
[23:48:57] <wicked> I was also thinking about two separate moves (since we can also time some of our own movings there and minimize downtime for some services)
[23:49:27] <dustin> that's possible - so maybe move cg-bugs03, migrate services to it, and then move cg-bugs01/02?
[23:50:34] <wicked> actually, move cg-bugs02 and 03 and then move landfill services to them
[23:51:00] <wicked> it'll still take some time to migrate the landfill VM or is that fast move? (since it's virtual)
[23:51:34] <wicked> then we can move landfill and cg-bugs01, and those are the most critical ones for our development efforts
[23:51:39] <dustin> it will need to be renumbered
[23:51:46] <dustin> so it's more than a few seconds
[23:52:02] <wicked> renumbered means IP change? or something else?
[23:52:06] <dustin> correct, IP change
[23:52:40] <wicked> yup, then we'd appreciate if we can minimize downtime for the updates (and bugbot) by moving them to cg-bugs02 once it's moved
[23:52:59] <dustin> btw, I'm copying justdave on this - he may end up being the Mozilla person responsible for these systems, which I assume would work well for you?
[23:53:28] <wicked> we can live a while without word and new bugbot (those are the bots on bots.bugzilla.lan)
Depends on: 720553
I also see a host known as either bugzilla-api01 or dm-bugzilla-api02 in vmware, but not in bugzilla or DNS or inventory.  I'm assuming this was a one-off VM that was never used, but please speak up if that's not the case!
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> I also see a host known as either bugzilla-api01 or dm-bugzilla-api02 in
> vmware, but not in bugzilla or DNS or inventory.  I'm assuming this was a
> one-off VM that was never used, but please speak up if that's not the case!

Already discussed on IRC, but just to get it on the bug, that's actually part of the production Bugzilla infrastructure (sort of), it's api-dev.bugzilla.mozilla.org.
I'd like to put the first of these two moves on the "B train" on 3/12, for which we need to get tickets by Monday.  So, :wicked, please confirm:

cg-bugs02 and cg-bugs03 are cleared of critical services before 3/12, and will spend most of the day in transit and being connected to the new community network in scl3.  Once that's up, you can migrate services to them from cg-bugs01, freeing up cg-bugs01 to move on the C train (3/26).

If, for some reason, the community network is not ready by 3/12, we won't do the move that day.  The VMs on cg-vmware01 can be migrated any time after next week or so.  I'll check back when we're ready.

Please let me know if that won't work, or if I've missed anything, as soon as possible!
Ugh, what made me think that deadline was Monday?  It was today.  So trains C/D it is:

cg-bugs02 and cg-bugs03 move on 3/26
cg-bugs01 moves on 4/9
Okay.. I guess that'll happen then. :) FWIW, I don't think there's anything stopping to go forward anyway. We can surely live without word, that's on the cg-bugs02, for that day. The cg-bugs01 services actually can't be moved elsewhere (at least not easily) so we just have to take a day or so downtime on our automated tests.

It was landfill services that I was planning to move so I don't think it should move until cg-bugs02/03 are moved and ready to take the some of it's services from it.
Depends on: 733962
This is next Monday!

cg-bugs02 and cg-bugs03 move on 3/26.  Plan on a day-long downtime.

These will get new public IPs.  You're free to do what you like with the private IPs.

If you can be around IRC as much as possible on Monday, that will help.  I'll post more details as I have them here.
Already, wow? :)

At what time are you starting to shutting down the servers and bringing them back online at new location? That downtime day is in PST timezone? I'm sure you will change all old IPs to the new IPs in public name servers as part of the move, yes?

I'll try to be online but I do have to work and sleep sometime as well. ;)
Time flies!

I'd like to shut these down on Sunday night.  I have OOB management information for them, but not login passwords, so if you can communicate such to me via GPG (7F0D15B1) that would be great.  If you'd prefer to keep them and do the shutdown and set-up yourself, that's cool too (I can use single-user mode via OOB to config the new IPs).

Yes, we will update DNS with the new IPs.  And yes, this will be PST daytime.

I don't have IP's yet, but I expect them by COB today.
Sunday night is rapidly approaching - how do you want to handle shutting these down?

I don't have IPs yet, but I'm working on it.
Hey, it's already Monday morning here. ;)

I take it you mean root pw, which I actually don't have (I only have sudo and key ssh to root). I believe justdave installed these hosts so he might have them but t be safe, I'll just reset and sent them to you.

As for shutting down and starting, there's only word and new bugbot that are on production. They should start automatically once their VM starts. And the VMs on cg-bugs02 will automatically suspend and resume when the host is shutdown and started. So you should able to handle shutting down and starting if I don't happen to be online at the moment. I'll fix any breakage this evening if needed.

This move might break bugmail to new bugbot if the vlan23 doesn't work between sites (landfill needs to talk to a cg-bugs02 VM where the bots live) but if that happens, I can fix it by using the public bugbot@ address instead (once I set it up at my end).

Do you happen to know exact times when things are happening, so I can try to be online at those time to help with anything you need?
Status: NEW → ASSIGNED
I don't know the exact times - the process is somewhat black-box for those of us not onsite - but with the information I have, I can get the machines online with their new IP addresses tomorrow, and provide that and access information to you as soon as it's ready for your TLC.  This is my focus for the day, until it's done, so I'll be in touch ASAP.

I'm taking the hosts down now.
Status: ASSIGNED → NEW
These hosts are back up, at
 bugzilla{2,3}.community.scl3.mozilla.com
replete with new hostnames, new DNS resolvers (google's..), and a new private vlan23.  That VLAN currently has the same IP space as in sjc1, but of course you're free to renumber that as you like - it's your VLAN.

Note that we still don't have out-of-band management set up for these hosts.  I will likely want to take a short downtime to set that up at some point.  Also note that I've dropped my SSH public key into root's authorized_keys.  You can remove this if and when you'd like, but for the time being it will probably make things smoother.

I think these hosts should be good to go - please find me with any problems!
I just redirected *.demo.b.o and demo.b.o (CNAMEs) to bugzilla3.  The change should "soak in" to DNS shortly.
Let's focus on getting the VMs moved over the next week or so, then move bugzilla1 on 4/23.

Those are
 landfill.bugzilla.org (VM on cg-vmware01)
 windows.bugzilla.org (VM on cg-vmware01)
 oracle.bugzilla.org (VM on cg-vmware01)
 tinderbox.mozilla.org (VM on cg-vmware01)

and if it's easiest, we can (Dan will correct me if I'm wrong) use VMware Converter to convert them to the new ESX cluster.  Depending on total size, load, and where they're stored, that can take 1-4 hours each.  I think we can overlap that work, so let's figure a combined 6-hour downtime?  We can also split that up if that works better.

When's a good time for this?
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> Let's focus on getting the VMs moved over the next week or so, then move
> bugzilla1 on 4/23.'
..
>  landfill.bugzilla.org (VM on cg-vmware01)

I'll work on getting some critical landfill services migrated this weekend (to the already moved servers) and depending on how this progresses we can set a day.
 
>  windows.bugzilla.org (VM on cg-vmware01)
>  oracle.bugzilla.org (VM on cg-vmware01)

These are not that critical (or might even be more or less unused) so we can migrate these more freely. Justdave or mkanat, do you happen to know more about status of these? I don't think I even have access to them at the moment..

>  tinderbox.mozilla.org (VM on cg-vmware01)

Remember, that's tinderbox.bugzilla.org and it can just go away. But, please please, don't decommission tinderbox.mozilla.org as we need that to publish our tinderbox test results. If you do, we'll need this VM to create a new TB server for us.. :)

> and if it's easiest, we can (Dan will correct me if I'm wrong) use VMware
> Converter to convert them to the new ESX cluster.  Depending on total size,

What kind of conversion is needed? And is it likely we face some problems due to the oldness of landfill filesystems/in general that justdave noticed earier?
(sorry, tinderbox.*mozilla*.org was a copy/paste of my typo in comment 0)

We'll have to see how the conversion works out - that's hard to predict otherwise.
oracle.bugzilla.org can probably be deleted. It backs the Oracle landfill test install, but we should probably set that up again.

windows.bugzilla.org isn't used all that often, but it is used sometimes. You can RDP to it as Administrator with the password that's stored on landfill as /root/windows-password
for my reference, wicked just mentioned that landfill isn't moved yet, but may soon
I just requested that cg-bugs01/bugzilla1 be moved on the 4/23 train.
Apparently that train is on 4/24, so it will move that day.  That should avoid me needing to do anything Sunday night :)
So, we should delete oracle.b.org in sjc1 and install a new RHEL5 Update 2 install in scl3 instead. It can't be a RHEL6 since Oracle DB doesn't support it yet. I can take care of Oracle Express install as soon as base system is there unless you want to or can do it easily. Should this be a separate bug, though?

As for windows.b.org, I managed to access it to take a look and it can move at any time you have some time. Isn't that a good place to test the VMware Converter for the landfill.b.org move or is Windows vs. Linux conversion different?

I'm still working with landfill.b.org to get it ready for migration. Nice that bz1 move wasn't this Monday like originally planned. :)
(In reply to Teemu Mannermaa (:wicked) from comment #24)
> So, we should delete oracle.b.org in sjc1 and install a new RHEL5 Update 2
> install in scl3 instead. It can't be a RHEL6 since Oracle DB doesn't support
> it yet. I can take care of Oracle Express install as soon as base system is
> there unless you want to or can do it easily. Should this be a separate bug,
> though?

Oracle totally works on RHEL6... They want you to move to OEL, but you should be able to install Oracle just fine.
(In reply to Reed Loden [:reed] (very busy) from comment #25)
> Oracle totally works on RHEL6... They want you to move to OEL, but you
> should be able to install Oracle just fine.

According to http://docs.oracle.com/cd/E17781_01/install.112/e18802/toc.htm#BABDHJHB and forum posts on the same site, RHEL6 is neither tested nor supported option. RHEL5 Update 2 is the best bet, if we want to make sure it'll work without too much fighting (and I've heard there's already a lot of that even if we got with Oracle sanctioned platforms).
They don't support OEL6 yet, either.  If they did, I'd say they were just trying to encourage the OEL switch.  But since OEL6 isn't supported either, it's fair to say the lack of RHEL6 support is legitimate.
DC Ops didn't have any record of cg-bugs01 on the F train list, but I just re-iterated.  I'll check again in a day or two to make sure it got on the list.
I got OOB IPs allocated for these hosts, but we don't run DHCP in the OOB network, so unless your IPMI-fu is better than mine, these will remain un-OOB-manageable until we boot them into the BIOS to assign a fixed address.

I'll get bugzilla1's OOB working when it moves.
http://lonesysadmin.net/2007/06/21/how-to-configure-ipmi-on-a-dell-poweredge-running-red-hat-enterprise-linux/ suggests this *is* possible to do on a running system.  I can do this if you're OK with me installing and running ipmi; or you can do it; or we can leave it as-is for now.  Let me know what you'd like.
Okay, both hosts should now have ipmitool which should allow you to set the OOB network information by using it's |ipmitool lan| commands.

I'd like the OOB functioning so MoCo IT can kick the servers remotely if they hang for some reason. :) Even better if I could do it myself but this might not be possible due to security.

BTW, any plans on the windows.b.org VM migration yet?
OK, bugzilla2/3's OOB's are set up.

I've been stalling VM migrations while we iron out network issues, and while the Mozilla virtualization folks deal with a massive glut of migrations, but I'll get windows.b.o on the list.
Cool, thanks for letting me know! On both things. :)

No rush on windows.b.org VM migration on my part. I just thought you'd want to move that (and landfill) soon to get rid of cg-vmware01. Do you have ETA on when you need to get that killed off so I know when I need to get landfill ready (two services still need to be moved from it, the most important ones). And I take it there's a new VMware host for these virtual machines to run on in scl3?
Can you drop my root key on bugzilla1, and reset the password there?  Probably easiest to leave the new password in a file in /root.  Please also install ipmitool at the same time - I'll use all that to preflight the settings (both IPMI and host) that it will need when it moves on Tuesday (April 24).
Sure, should be all set now. I already had ipmitool there as I installed it same time as for the other two servers.

Any ETA on the time bugzilla1 goes down? And comes up on the other side? I do know you don't know the exact timing but was it 6h or more/less downtime for the previous moves?
The movers start at 9am pacific on Tuesday, so I'll shut it down around 8:30, to be safe.

I don't have detailed timing notes for the last moves, but if I recall they've taken around 8 hours from power-down to full network accessibility.  That was without the OOB, though, so we may be able to do better this time around.
Actuall, Google calendar just reminded me I have a doctor's appointment at 8am pacific.  I can shut cg-bugs01 down a half-hour before that, or leave it for you to do later - which would you prefer?
Shutting cg-bugs01 down at 7:30am PDT is fine for me. (That's 17:30 EEST for me so you'll have 12h anyway until I'm online again to check/fix things.) I need to shutdown the tests gracefully at the right state so I'll be doing that a bit earlier.
wicked confirmed ready to shut down in #it.

IPMI changed:

IP Address Source       : Static Address
IP Address              : 10.22.1.40
Subnet Mask             : 255.255.248.0

Hostname changed:

[root@cg-bugs01 ~]# grep HOST /etc/sysconfig/network
HOSTNAME=bugzilla1.community.scl3.mozilla.com

Host address changed:

[root@cg-bugs01 ~]# grep -E '(IPADDR|NETMASK)' /etc/sysconfig/network-scripts/ifcfg-bond0 
IPADDR=63.245.223.24
NETMASK=255.255.255.128

and halted.
Bugzilla1 is ready to go.  Derek's fixing up the IPMI right now.  Vlan23 should be plumbed correctly, but I'll verify that when the host is back up.  That just leaves ye olde VMs.
[root@bugzilla3 ~]# ping 192.168.99.25
PING 192.168.99.25 (192.168.99.25) 56(84) bytes of data.
64 bytes from 192.168.99.25: icmp_seq=1 ttl=64 time=0.659 ms

which is to say, vlan23 is working correctly.
The IPMI on bugzilla1's still not working correctly.  I'm going to see if I can upgrade it.  Please hold off standing this back up in production for a bit.
OK, that was easy - apparently the password was wrong, and the web interface doesn't indicate that very well.  So, fixed, booted, ready for you to bring it back up!  And I'll stop updating this bug every five minutes now.
So, windows.b.o has a bug (bug 746209), and will hopefully get done soon.

AIUI the only other VM to migrate is landfill.  Oracle isn't moving (or will be rebuilt), and tinderbox is already off.

Should we get a bug filed for moving landfill?
The ESX host containing landfill and oracle is on the 5/14 train.  :wicked, what's the status?
cg-vmware01 is on today's train (bug 753011).  I suspect that this bug will be finished once that's done.
OK, bugzilla-landfill.community.scl3.mozilla.com is up and available.  It doesn't yet have vlan23, but mburns is working on that.

Please let me know what other hosts need vlan23 that don't have it.

I'll fix up DNS tomorrow -- I assume reverse and forward should be *.bugzilla.org, and we should drop the bugzilla-xxx.community.scl3 stuff entirely.
Added a network adapter in Inventory.

( nic0,	63.245.217.61,	00:50:56:94:00:11,	scl3-vlan23 )

There wasn't an existing vlan23,  (or vlan20) yet, fyi.
vlan23 is waiting on bug 753786.
And that's done:

[root@landfill ~]# ping 192.168.99.2
PING 192.168.99.2 (192.168.99.2) 56(84) bytes of data.
64 bytes from 192.168.99.2: icmp_seq=1 ttl=64 time=2.97 ms

I *think* that's everything here.  If I'm wrong, re-open!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 831158
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.