Closed
Bug 674144
Opened 14 years ago
Closed 14 years ago
Add disks to kvm[1..4].scl1
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zandr, Assigned: arich)
References
Details
(Whiteboard: [buildduty])
Disks are in hand, and I'll install them tomorrow AM.
Will reassign to bkero to get them in service once they're installed.
| Assignee | ||
Updated•14 years ago
|
Assignee: server-ops-releng → zandr
| Reporter | ||
Updated•14 years ago
|
Assignee: zandr → bkero
| Reporter | ||
Comment 1•14 years ago
|
||
Disks are installed. Assigning to bkero to figure out how to get them into service.
Comment 2•14 years ago
|
||
So because we're using hardware RAID on this, the controller does not allow us to add additional drives to the RAID1 array. Instead, we will need to rebuild the RAID array using a 4-drive RAID10 setup. This will mean a need to reinstall the OS.
We could do this with a simple migrate-the-VMs-off, then rsync-the-rootfs-off, then rebuild the RAID, then rsync-everything-back, or we could use this as an opportunity to deploy RHEL6.1 with our Ganeti puppet class.
Comment 3•14 years ago
|
||
Perhaps we can give it a shot on kvm3 if it is still empty.
Is RHEL6.1 with our Ganeti puppet class the long term desired solution? I vote for whatever we want to support long term but I really have no knowledge of what is better/preferred.
| Assignee | ||
Comment 4•14 years ago
|
||
After talking this over with zandr, I think the downtime and possible associated risks required to move to RHEL are too significant at this point. We should formulate a plan with releng to live migrate vms with no downtime (staying with the current OS) for the time being. At some later date, we can revisit scheduling something like a day's worth of downtime to get to RHEL.
| Assignee | ||
Comment 5•14 years ago
|
||
SO at this point, we need someone from the releng side to work with bkero on scheduling the migration of things off the buildbot masters (to be on the safe side) as we migrate those and the other hosts on the kvm servers one by one so that we can rebuild each of the four physical kvm/ganeti machines. Who should he coordinate with to come up with a schedule and a game plan?
Comment 6•14 years ago
|
||
Is kvm3 still empty as Armen suggests in comment #3?
If we can get the unused machine ready, we can do this step-wise migration of masters.
I will sign up for this if I cannot find someone else in releng to take it on.
| Reporter | ||
Comment 7•14 years ago
|
||
kvm3 is not empty, but it only has a few hosts that will be easy to move. For some reason, most of the new masters are stacked up on kvm4.
If we aren't changing node (host) operating systems, these can be live migrations (zero downtime). However, masters are busy enough that the hypervisor has a hard time finding a quiet moment to flip the switch. So we'll need to reduce load or pause various masters to get them to jump.
Once we're done shuffling instances around to bring the disks online, we'll probably want to do a rebalance, which has the same problem. This probably boils down to a day or so of releng moving load around and bkero working on nodes.
gnt-instance info, for reference:
Instance Hypervisor OS Primary_node Status Memory
admin1.infra.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 1.0G
arr-client-test.build.scl1.mozilla.com kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M
arr-test.build.scl1.mozilla.com kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M
autoland-staging01.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 4.0G
autoland-staging02.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 4.0G
buildapi01.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 4.0G
buildbot-master4 kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G
buildbot-master6.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G
buildbot-master11.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 6.0G
buildbot-master12.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 6.0G
buildbot-master13.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G
buildbot-master14.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G
buildbot-master15.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G
buildbot-master16.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G
buildbot-master17.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 6.0G
dc1.sandbox.scl1.mozilla.com kvm image+manual kvm3.infra.scl1.mozilla.com running 2.0G
dc01.winbuild.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 2.0G
dev-master01.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 6.0G
ganglia1.build.scl1.mozilla.com kvm image+rhel-60 kvm2.infra.scl1.mozilla.com running 512M
linux-hgwriter-slave03.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 2.0G
linux-hgwriter-slave04.build.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 2.0G
master-puppet1.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 256M
ns1.infra.scl1.mozilla.com kvm image+manual kvm2.infra.scl1.mozilla.com running 512M
ns2.infra.scl1.mozilla.com kvm image+manual kvm1.infra.scl1.mozilla.com running 512M
redis01.build.scl1.mozilla.com kvm image+centos-55 kvm2.infra.scl1.mozilla.com running 8.0G
releng-mirror01.build.scl1.mozilla.com kvm image+centos-55 kvm3.infra.scl1.mozilla.com running 512M
scl-production-puppet-new.build.scl1.mozilla.com kvm image+centos-55 kvm4.infra.scl1.mozilla.com running 2.0G
slavealloc.build.scl1.mozilla.com kvm image+rhel-60 kvm2.infra.scl1.mozilla.com running 256M
talos-addon-master1.amotest.scl1.mozilla.com kvm image+centos-55 kvm1.infra.scl1.mozilla.com running 4.0G
testhost1 kvm image+rhel-60 kvm3.infra.scl1.mozilla.com running 512M
wds1.sandbox.scl1.mozilla.com kvm image+manual kvm3.infra.scl1.mozilla.com running 2.0G
wds01.winbuild.scl1.mozilla.com kvm image+manual kvm2.infra.scl1.mozilla.com running 2.0G
Updated•14 years ago
|
Whiteboard: [buildduty]
Comment 8•14 years ago
|
||
Ben, I can handle disabling masters in preparation for migration. Let me know which one you want to start with.
Comment 9•14 years ago
|
||
catlee,
I'm on call this ewek so scheduling is difficult. Today I'm rebuilding another host out as it's own ganeti cluster, so I don't have time to do this today. Would Thursday afternoon (2PM PST) be an appropriate time for this?
Comment 10•14 years ago
|
||
Ok, on Thursday we can definitely move off kvm3, as there are no build masters there. We can probably move off kvm1, since there are only two masters there. I can start getting them idle earlier in the day so that by 2pm they're ready to migrate.
| Assignee | ||
Comment 11•14 years ago
|
||
Alice: can we live migrate talos-addon-master1.amotest.scl1.mozilla.com tomorrow so we can rebuild the kvm server it's sitting on? In theory, there will only be a slight disruption as the host migrates (no connections lost). Worst case, we'd have to shut it down briefly if it hangs during the migration.
Comment 12•14 years ago
|
||
These aren't currently being used and can be shut off / moved any time:
linux-hgwriter-slave03.build.scl1.mozilla.com (kvm1)
linux-hgwriter-slave04.build.scl1.mozilla.com (kvm1)
buildapi01.build.scl1.mozilla.com (kvm1)
redis01.build.scl1.mozilla.com (kvm2)
Comment 13•14 years ago
|
||
disabled bm11,bm12 in slavealloc, will let you know when they're idle enough to move
Comment 14•14 years ago
|
||
all VMs are evacuated from kvm3, and only talos-addon-master1.amotest.scl1 remains on kvm1, (it encountered a bug to where it cannot be live migrated). As part of the rebuild we're installing a newer minor version of KVM that should fix these issues.
kvm3.infra.scl1's array has been upgraded to RAID10, although it's awaiting action from netops to switch it to vlan6 so it can be re-netinstalled.
Assignee: bkero → network-operations
Comment 15•14 years ago
|
||
Netops has nothing to do with adding disks. We do have something to do with bug 678407 though.
Assignee: network-operations → server-ops-releng
Updated•14 years ago
|
Assignee: server-ops-releng → bkero
Comment 16•14 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #14)
> all VMs are evacuated from kvm3, and only talos-addon-master1.amotest.scl1
> remains on kvm1, (it encountered a bug to where it cannot be live migrated).
> As part of the rebuild we're installing a newer minor version of KVM that
> should fix these issues.
>
> kvm3.infra.scl1's array has been upgraded to RAID10, although it's awaiting
> action from netops to switch it to vlan6 so it can be re-netinstalled.
(In reply to Ravi Pina [:ravi] from comment #15)
> Netops has nothing to do with adding disks. We do have something to do with
> bug 678407 though.
What's next step here?
| Assignee | ||
Comment 17•14 years ago
|
||
Ben is working on installing kvm and getting these nodes back into the cluster this morning.
Comment 18•14 years ago
|
||
Ben is working on replicating disks to kvm1 and kvm3, at which point these nodes will be re-merged to the cluster.
Comment 19•14 years ago
|
||
kvm1 and kvm3 were rebuilt and added back into the cluster. Today the VMs were replicated onto those two hosts, and they were added back into the cluster. That work just finished.
The next step is to migrate VMs from kvm2 and kvm4 onto the newly built machines, then rebuild kvm2 and kvm4.
Comment 20•14 years ago
|
||
per meeting with IT yesterday:
* bkero at conference so work deferred to next week
Comment 21•14 years ago
|
||
I'm ready to do the rest of the migrations this week. When is good for build/releng for me to do the work? Will thursday afternoon (2PM PST) be enough lead time to schedule everything? I propose that day/time.
| Assignee | ||
Comment 22•14 years ago
|
||
Based on the issues we had last time, I don't think we're going to be able to do both hosts in a single day. I propose that we do kvm4 first since it only has four hosts left on it at this point:
buildbot-master13.build.scl1.mozilla.com 6.0G
buildbot-master14.build.scl1.mozilla.com 6.0G
buildbot-master15.build.scl1.mozilla.com 6.0G
buildbot-master16.build.scl1.mozilla.com 6.0G
buildbot-master17.build.scl1.mozilla.com 6.0G
I would also like to move ns2 off to another kvm host at the same time, since right now both ns hosts are on kvm2.
Then, in the second pass, we do kvm2, which currently has the following on it (but will have more once those buildbot hosts migrate):
admin1.infra.scl1.mozilla.com 1.0G
buildbot-master4 6.0G
buildbot-master6.build.scl1.mozilla.com 6.0G
buildbot-master11.build.scl1.mozilla.com 6.0G
buildbot-master12.build.scl1.mozilla.com 6.0G
dev-master01.build.scl1.mozilla.com 6.0G
ns1.infra.scl1.mozilla.com 512M
ns2.infra.scl1.mozilla.com 512M
slavealloc.build.scl1.mozilla.com 256M
At the end, I would like to balance out the hosts so that we have a good set of redundancy + spread load across all four servers.
| Assignee | ||
Comment 23•14 years ago
|
||
Okay, kvm4 has five hosts. I can count.
Comment 24•14 years ago
|
||
I chatted with arr in IRC and this is what we think makes more sense:
1) move as many masters away from kvm4 Wednesday EDT morning (before high loads)
* arr and I
2) bkero does his work at his own time on kvm4
* arr and I will give a clear flag to bkero
For #1, there are 2 concerns:
* when we move the masters (even after being shut off) they might need a reboot
* bm15 and bm16 are the only tests-scl1-windows masters that can accept rev3 Windows testers
** this means that we will have to do them sequentially and when the load is low.
buildbot-master17 has already been moved since there was nothing running on it.
| Assignee | ||
Comment 25•14 years ago
|
||
As discussed on irc, armen and I will work to clear off all of the buildbot-master vms on kvm4 tomorrow morning (EDT), and bkero will work on upgrading kvm4 after he gets back into the office in the afternoon/evening (PDT). Armen will start migrating processes off of the buildbot masters on kvm4 at 11:00 and let me know when each is done. Then I will migrate the vm itself to a new kvm host.
Comment 26•14 years ago
|
||
I got mid-aired:
> I have marked masters 13, 14 & 15 to be disabled from slavealloc and marked them
> for "clean shutdown".
>
> I will do #16 once we have migrated #15.
arr and I have already moved #14 and #15 is ready to be moved once a ganeti job times out.
| Assignee | ||
Comment 27•14 years ago
|
||
All vms have been migrated off of kvm4, and it's ready for you to begin work whenever you're ready, bkero. I have not put the host into downtime, so you may want to do that before you begin.
| Assignee | ||
Comment 28•14 years ago
|
||
The upgrade of kvm4 is complete. It's now back in the cluster and vms that use it as a secondary are now replicating their disks there.
As of right now the following disks are left to replicate:
autoland-staging02.build.scl1.mozilla.com
buildapi01.build.scl1.mozilla.com
buildbot-master13.build.scl1.mozilla.com
buildbot-master14.build.scl1.mozilla.com
buildbot-master15.build.scl1.mozilla.com
buildbot-master16.build.scl1.mozilla.com
dc02.winbuild.scl1.mozilla.com
releng-mirror01.build.scl1.mozilla.com
scl-production-puppet-new.build.scl1.mozilla.com
talos-addon-master1.amotest.scl1.mozilla.com
testhost1
The next step is to migrate some hosts back onto it and migrate hosts off of kvm2 to finish the upgrade.
Buildduty: I recommend scheduling the move of the following vms sometime tomorrow or Monday (giving the cluster a chance to finish replicating the disks today):
admin1.infra.scl1.mozilla.com
buildbot-master4
buildbot-master6.build.scl1.mozilla.com
buildbot-master11.build.scl1.mozilla.com
buildbot-master12.build.scl1.mozilla.com
dev-master01.build.scl1.mozilla.com
ns1.infra.scl1.mozilla.com
slavealloc.build.scl1.mozilla.com
| Assignee | ||
Comment 29•14 years ago
|
||
All vms are migrated off of kvm2.
Comment 30•14 years ago
|
||
bkero any updates?
On another note, how can we measure the improvements to before/after this bug?
Should this be by checking cpu_wio overtime? Is there a way to check the performance of the actual kvm hosts rather than the VMs that they host?
Not sure if all the questions can be answered but it would be awesome to know how we can measure the improvement.
I would like to measure after kvm2 is back together and we spread the load across.
| Assignee | ||
Comment 31•14 years ago
|
||
kvm2 was rebuilt last night/this morning, and I validated/tested the configuration this morning and am in the process of resyncing the secondary disks back over. Once that is complete, we'll be ready to move the following hosts back onto kvm2:
buildbot-master6.build.scl1.mozilla.com
buildbot-master16.build.scl1.mozilla.com
autoland-staging01.build.scl1.mozilla.com
dc01.winbuild.scl1.mozilla.com
linux-hgwriter-slave03.build.scl1.mozilla.com
ns1.infra.scl1.mozilla.com
wds01.winbuild.scl1.mozilla.com
If buildduty could coordinate with me to do a clean shutdown of buildbot on buildbot-master6 and buildbot-master16 today, that would be great.
| Assignee | ||
Updated•14 years ago
|
Assignee: bkero → arich
| Assignee | ||
Comment 32•14 years ago
|
||
All nodes have been upgraded and the cluster is back in fully operational mode that this point. Thanks to bkero for his help.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•