Closed Bug 674144 Opened 14 years ago Closed 14 years ago

Add disks to kvm[1..4].scl1

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: arich)

References

Details

(Whiteboard: [buildduty])

Disks are in hand, and I'll install them tomorrow AM. Will reassign to bkero to get them in service once they're installed.
Blocks: 674146
Assignee: server-ops-releng → zandr
Assignee: zandr → bkero
Disks are installed. Assigning to bkero to figure out how to get them into service.
So because we're using hardware RAID on this, the controller does not allow us to add additional drives to the RAID1 array. Instead, we will need to rebuild the RAID array using a 4-drive RAID10 setup. This will mean a need to reinstall the OS. We could do this with a simple migrate-the-VMs-off, then rsync-the-rootfs-off, then rebuild the RAID, then rsync-everything-back, or we could use this as an opportunity to deploy RHEL6.1 with our Ganeti puppet class.
Perhaps we can give it a shot on kvm3 if it is still empty. Is RHEL6.1 with our Ganeti puppet class the long term desired solution? I vote for whatever we want to support long term but I really have no knowledge of what is better/preferred.
After talking this over with zandr, I think the downtime and possible associated risks required to move to RHEL are too significant at this point. We should formulate a plan with releng to live migrate vms with no downtime (staying with the current OS) for the time being. At some later date, we can revisit scheduling something like a day's worth of downtime to get to RHEL.
SO at this point, we need someone from the releng side to work with bkero on scheduling the migration of things off the buildbot masters (to be on the safe side) as we migrate those and the other hosts on the kvm servers one by one so that we can rebuild each of the four physical kvm/ganeti machines. Who should he coordinate with to come up with a schedule and a game plan?
Is kvm3 still empty as Armen suggests in comment #3? If we can get the unused machine ready, we can do this step-wise migration of masters. I will sign up for this if I cannot find someone else in releng to take it on.
kvm3 is not empty, but it only has a few hosts that will be easy to move. For some reason, most of the new masters are stacked up on kvm4. If we aren't changing node (host) operating systems, these can be live migrations (zero downtime). However, masters are busy enough that the hypervisor has a hard time finding a quiet moment to flip the switch. So we'll need to reduce load or pause various masters to get them to jump. Once we're done shuffling instances around to bring the disks online, we'll probably want to do a rebalance, which has the same problem. This probably boils down to a day or so of releng moving load around and bkero working on nodes. gnt-instance info, for reference: Instance Hypervisor OS Primary_node Status Memory admin1.infra.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 1.0G arr-client-test.build.scl1.mozilla.com kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M arr-test.build.scl1.mozilla.com kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M autoland-staging01.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 4.0G autoland-staging02.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 4.0G buildapi01.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 4.0G buildbot-master4 kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G buildbot-master6.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G buildbot-master11.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 6.0G buildbot-master12.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 6.0G buildbot-master13.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G buildbot-master14.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G buildbot-master15.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G buildbot-master16.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G buildbot-master17.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G dc1.sandbox.scl1.mozilla.com kvm image+manual kvm3.infra.scl1.mozilla.com running 2.0G dc01.winbuild.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 2.0G dev-master01.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G ganglia1.build.scl1.mozilla.com kvm image+rhel-60 kvm2.infra.scl1.mozilla.com running 512M linux-hgwriter-slave03.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 2.0G linux-hgwriter-slave04.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 2.0G master-puppet1.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 256M ns1.infra.scl1.mozilla.com kvm image+manual kvm2.infra.scl1.mozilla.com running 512M ns2.infra.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 512M redis01.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 8.0G releng-mirror01.build.scl1.mozilla.com kvm image+centos-55 kvm3.infra.scl1.mozilla.com running 512M scl-production-puppet-new.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 2.0G slavealloc.build.scl1.mozilla.com kvm image+rhel-60 kvm2.infra.scl1.mozilla.com running 256M talos-addon-master1.amotest.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 4.0G testhost1 kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M wds1.sandbox.scl1.mozilla.com kvm image+manual kvm3.infra.scl1.mozilla.com running 2.0G wds01.winbuild.scl1.mozilla.com kvm image+manual kvm2.infra.scl1.mozilla.com running 2.0G
Whiteboard: [buildduty]
Ben, I can handle disabling masters in preparation for migration. Let me know which one you want to start with.
catlee, I'm on call this ewek so scheduling is difficult. Today I'm rebuilding another host out as it's own ganeti cluster, so I don't have time to do this today. Would Thursday afternoon (2PM PST) be an appropriate time for this?
Ok, on Thursday we can definitely move off kvm3, as there are no build masters there. We can probably move off kvm1, since there are only two masters there. I can start getting them idle earlier in the day so that by 2pm they're ready to migrate.
Alice: can we live migrate talos-addon-master1.amotest.scl1.mozilla.com tomorrow so we can rebuild the kvm server it's sitting on? In theory, there will only be a slight disruption as the host migrates (no connections lost). Worst case, we'd have to shut it down briefly if it hangs during the migration.
These aren't currently being used and can be shut off / moved any time: linux-hgwriter-slave03.build.scl1.mozilla.com (kvm1) linux-hgwriter-slave04.build.scl1.mozilla.com (kvm1) buildapi01.build.scl1.mozilla.com (kvm1) redis01.build.scl1.mozilla.com (kvm2)
disabled bm11,bm12 in slavealloc, will let you know when they're idle enough to move
all VMs are evacuated from kvm3, and only talos-addon-master1.amotest.scl1 remains on kvm1, (it encountered a bug to where it cannot be live migrated). As part of the rebuild we're installing a newer minor version of KVM that should fix these issues. kvm3.infra.scl1's array has been upgraded to RAID10, although it's awaiting action from netops to switch it to vlan6 so it can be re-netinstalled.
Assignee: bkero → network-operations
Netops has nothing to do with adding disks. We do have something to do with bug 678407 though.
Assignee: network-operations → server-ops-releng
Assignee: server-ops-releng → bkero
(In reply to Ben Kero [:bkero] from comment #14) > all VMs are evacuated from kvm3, and only talos-addon-master1.amotest.scl1 > remains on kvm1, (it encountered a bug to where it cannot be live migrated). > As part of the rebuild we're installing a newer minor version of KVM that > should fix these issues. > > kvm3.infra.scl1's array has been upgraded to RAID10, although it's awaiting > action from netops to switch it to vlan6 so it can be re-netinstalled. (In reply to Ravi Pina [:ravi] from comment #15) > Netops has nothing to do with adding disks. We do have something to do with > bug 678407 though. What's next step here?
Ben is working on installing kvm and getting these nodes back into the cluster this morning.
Ben is working on replicating disks to kvm1 and kvm3, at which point these nodes will be re-merged to the cluster.
kvm1 and kvm3 were rebuilt and added back into the cluster. Today the VMs were replicated onto those two hosts, and they were added back into the cluster. That work just finished. The next step is to migrate VMs from kvm2 and kvm4 onto the newly built machines, then rebuild kvm2 and kvm4.
per meeting with IT yesterday: * bkero at conference so work deferred to next week
I'm ready to do the rest of the migrations this week. When is good for build/releng for me to do the work? Will thursday afternoon (2PM PST) be enough lead time to schedule everything? I propose that day/time.
Based on the issues we had last time, I don't think we're going to be able to do both hosts in a single day. I propose that we do kvm4 first since it only has four hosts left on it at this point: buildbot-master13.build.scl1.mozilla.com 6.0G buildbot-master14.build.scl1.mozilla.com 6.0G buildbot-master15.build.scl1.mozilla.com 6.0G buildbot-master16.build.scl1.mozilla.com 6.0G buildbot-master17.build.scl1.mozilla.com 6.0G I would also like to move ns2 off to another kvm host at the same time, since right now both ns hosts are on kvm2. Then, in the second pass, we do kvm2, which currently has the following on it (but will have more once those buildbot hosts migrate): admin1.infra.scl1.mozilla.com 1.0G buildbot-master4 6.0G buildbot-master6.build.scl1.mozilla.com 6.0G buildbot-master11.build.scl1.mozilla.com 6.0G buildbot-master12.build.scl1.mozilla.com 6.0G dev-master01.build.scl1.mozilla.com 6.0G ns1.infra.scl1.mozilla.com 512M ns2.infra.scl1.mozilla.com 512M slavealloc.build.scl1.mozilla.com 256M At the end, I would like to balance out the hosts so that we have a good set of redundancy + spread load across all four servers.
Okay, kvm4 has five hosts. I can count.
I chatted with arr in IRC and this is what we think makes more sense: 1) move as many masters away from kvm4 Wednesday EDT morning (before high loads) * arr and I 2) bkero does his work at his own time on kvm4 * arr and I will give a clear flag to bkero For #1, there are 2 concerns: * when we move the masters (even after being shut off) they might need a reboot * bm15 and bm16 are the only tests-scl1-windows masters that can accept rev3 Windows testers ** this means that we will have to do them sequentially and when the load is low. buildbot-master17 has already been moved since there was nothing running on it.
As discussed on irc, armen and I will work to clear off all of the buildbot-master vms on kvm4 tomorrow morning (EDT), and bkero will work on upgrading kvm4 after he gets back into the office in the afternoon/evening (PDT). Armen will start migrating processes off of the buildbot masters on kvm4 at 11:00 and let me know when each is done. Then I will migrate the vm itself to a new kvm host.
I got mid-aired: > I have marked masters 13, 14 & 15 to be disabled from slavealloc and marked them > for "clean shutdown". > > I will do #16 once we have migrated #15. arr and I have already moved #14 and #15 is ready to be moved once a ganeti job times out.
All vms have been migrated off of kvm4, and it's ready for you to begin work whenever you're ready, bkero. I have not put the host into downtime, so you may want to do that before you begin.
The upgrade of kvm4 is complete. It's now back in the cluster and vms that use it as a secondary are now replicating their disks there. As of right now the following disks are left to replicate: autoland-staging02.build.scl1.mozilla.com buildapi01.build.scl1.mozilla.com buildbot-master13.build.scl1.mozilla.com buildbot-master14.build.scl1.mozilla.com buildbot-master15.build.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com dc02.winbuild.scl1.mozilla.com releng-mirror01.build.scl1.mozilla.com scl-production-puppet-new.build.scl1.mozilla.com talos-addon-master1.amotest.scl1.mozilla.com testhost1 The next step is to migrate some hosts back onto it and migrate hosts off of kvm2 to finish the upgrade. Buildduty: I recommend scheduling the move of the following vms sometime tomorrow or Monday (giving the cluster a chance to finish replicating the disks today): admin1.infra.scl1.mozilla.com buildbot-master4 buildbot-master6.build.scl1.mozilla.com buildbot-master11.build.scl1.mozilla.com buildbot-master12.build.scl1.mozilla.com dev-master01.build.scl1.mozilla.com ns1.infra.scl1.mozilla.com slavealloc.build.scl1.mozilla.com
All vms are migrated off of kvm2.
bkero any updates? On another note, how can we measure the improvements to before/after this bug? Should this be by checking cpu_wio overtime? Is there a way to check the performance of the actual kvm hosts rather than the VMs that they host? Not sure if all the questions can be answered but it would be awesome to know how we can measure the improvement. I would like to measure after kvm2 is back together and we spread the load across.
kvm2 was rebuilt last night/this morning, and I validated/tested the configuration this morning and am in the process of resyncing the secondary disks back over. Once that is complete, we'll be ready to move the following hosts back onto kvm2: buildbot-master6.build.scl1.mozilla.com buildbot-master16.build.scl1.mozilla.com autoland-staging01.build.scl1.mozilla.com dc01.winbuild.scl1.mozilla.com linux-hgwriter-slave03.build.scl1.mozilla.com ns1.infra.scl1.mozilla.com wds01.winbuild.scl1.mozilla.com If buildduty could coordinate with me to do a clean shutdown of buildbot on buildbot-master6 and buildbot-master16 today, that would be great.
Assignee: bkero → arich
All nodes have been upgraded and the cluster is back in fully operational mode that this point. Thanks to bkero for his help.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.