Closed Bug 676560 Opened 13 years ago Closed 13 years ago

increase capacity of mtv1 kvm cluster

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

Details

In order to move hosts off of the corp vmware cluster in mtv1, we need to increase the capacity of the mtv1 kvm cluster.  Right now we have two nodes with 4 cores and 12G of RAM each (for a total capacity of 4 cores and 12G).  The following nodes already exist on this cluster:

ganglia3.build.mtv1.mozilla.com image+rhel-60 kvm2 512M
mv-production-puppet-new.build.mtv1.mozilla.com image+centos-55 kvm1 2.0G
ns1.build.mtv1.mozilla.com image+centos-55 kvm2 768M
ns2.build.mtv1.mozilla.com image+centos-55 kvm2 768M
tools-staging-master02.mv.mozilla.com image+centos-55 kvm2 4.0G

ns1 and ns2 are designed to replace mv-buildproxy01, and mv-production-puppet-new is designed to replace mv-production-puppet.


Assuming that the geriatric master is going to stay on the vmware cluster, we definitely need to move the following host (which are not already accounted for above):

test-master01 (which will become buildbot-master01): 2 cores/6G

We may also need to move the following two hosts that support the n900s:

staging-mobile-master: 2 cores/6G
production-mobile-master: 2 cores/6G
mobile-image03

And there is one vmware vm that's likely mislabeled and might also need to move:

test-master02 (is really a production buildbot master?)

At minimum (to move test-master01 and have any CPU allocations left), we will require another node with 4 cores and another 12G of RAM (to match existing servers).  

If we move any other hosts, we will require a RAM upgrade as well as an additional node, probably to at least 24G per node.
(In reply to comment #0)

> 
> And there is one vmware vm that's likely mislabeled and might also need to
> move:
> 
> test-master02 (is really a production buildbot master?)
> 

test-master02 is not an active buildmaster
(In reply to comment #1)
> test-master02 is not an active buildmaster

For the record, it is - see bug 675793.
(In reply to comment #1)
> (In reply to comment #0)
> 
> > 
> > And there is one vmware vm that's likely mislabeled and might also need to
> > move:
> > 
> > test-master02 (is really a production buildbot master?)
> > 
> 
> test-master02 is not an active buildmaster

ok, so i'm being told that, for some silly reason, the name of the vm does not map to the name of the build master

so vm test-master02 == buildbot-master3

OMGWTF
We should purchase:

1 IX Systems - IX1204R with 24G of RAM (4G DIMMs)
6 4G DIMMs to upgrade the other two KVM servers to 24G

Optionally, if we decide that RAM is cheap and we'll want to reuse these machines later, purchase the new machine with 48G and purchase 18 4G DIMMs to upgrade the other two servers to 48G and fully populate each DIMM slot.
This cluster will be one of the stragglers in mtv1, and may even be stuck there forever with the mobile devices.  However, if the nodes have more RAM, that means we can probably pull all but 2 nodes out and move them to scl3 when the time comes, rather than being stuck with more, lower-RAM systems propping up mobile.

IMHO, we should make the nodes in this cluster interchangeable with the nodes in the scl1 cluster, for optimal fungibility when moving to scl3.
Should we upgrade disk while we're at it?
We need the CPU power of the nodes as well as the memory capacity.  If we want to make these interchangeable with the scl1 nodes, though, we should upgrade the RAM to 48G and also add disks, yes.
per meeting with IT yesterday:

* hardware not ordered yet. The mtv cluster needs to be increased so IT can migrate from ESX to KVM in 650castro.
To be clear, this is so releng VMs can be migrated off of the (non-releng) corp vmware servers onto the kvm servers in mtv1.  IT needs to reclaim space on the corp vmware servers.
We should be able to retire production-mobile-master and staging-mobile-master (and all n900s) after 7.0 ships in 6 weeks.
By my count, we need to migrate test-master01 and test-master02 (which is actually buildbot-master3).  The rest of the vms on the IT vmware cluster can be retired this year and are not worth moving (this includes the completion of ns1 and ns2 to replace mv-buildproxy01).
The replacement hardware hasn't been ordered yet.  We're also waiting on 7.0 to ship as per comment 10.
Assignee: zandr → arich
Hardware on order and tracked in the ordering spreadsheet.
Hardware has arrived and is sitting next to LOL.  Memory upgrades done to the two existing machines.  Need to rack, cable, and install the new machine.
New machine is installed and in the cluster, and cluster verify passes.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.