Closed Bug 867593 Opened 11 years ago Closed 10 years ago

Provision enough in-house master capacity

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: Callek)

References

Details

(Keywords: sheriffing-P1)

As much as possible, we should be able to build, test and ship our supported products relying only on in-house infrastructure.

We migrated most of our buildbot masters off of KVM to AWS in bug 864364. However, we still need some in-house capacity.

Here's a list of get us started:
- 1 scheduler master (to replace bm36)
- 1 or 2 build masters (for builds and releases)
- 0 or 1 try masters
- 5 test masters (linux, windows, mac, tegra, panda)

Anything else?
A heads up to the virtualization team that a request for 7-9 servers is coming.  Buildbot masters on KVM have the following specs:

OS: CentOS (presumably we'll be moving to 6.2 to match the AWS masters, releng?)
RAM: 6G
CPU: 2
Disk: 30G (but high I/O load, so SATA might be problematic)
Network: will need to be plumbed to the srv.releng.scl3.mozilla.com VLAN

Releng: is there any specific way the disk needs to be partitioned or any of these specs that need adjusting? I'm not sure what these look like on AWS now.
Blocks: 867602
(In reply to Amy Rich [:arich] [:arr] from comment #1)
> OS: CentOS (presumably we'll be moving to 6.2 to match the AWS masters,
> releng?)

Correct. We use the same OS for our mock builders.
Blocks: 864364
No longer depends on: 864364
How about this:
buildbot-master81 - scheduler (build and try)
buildbot-master82 - build
buildbot-master83 - build
buildbot-master83 - try
buildbot-master84 - linux tests
buildbot-master85 - mac tests
buildbot-master86 - windows tests
buildbot-master87 - tegra
buildbot-master88 - panda
Flags: needinfo?(catlee)
WFM
Flags: needinfo?(catlee)
Amy, is there anything else you need to know before you proceed?
Assignee: nobody → server-ops-releng
Component: Release Engineering: Automation (General) → Server Operations: RelEng
QA Contact: catlee → arich
I've created DNS and stub inventory entries for these.  VMWare folks, please create the above 9 vms with the following specs:

OS: CentOS 6
RAM: 6G
CPU: 2
Disk: 30G (but high I/O load when in use, so SATA might be problematic)
Network: SCL3 VLAN 248 (.srv.releng.scl3.mozilla.com)
Assignee: server-ops-releng → server-ops-virtualization
Component: Server Operations: RelEng → Server Operations: Virtualization
QA Contact: arich → dparsons
Note, we'll wind up re-kickstarting these, so no need to puppetize.  Just make sure we know what the root pw is  :}

If that means you need to reinstall vmware tools after the fact, please let us know.
Small nitpick: comment 4 has a doubleup on 83, so is a buildbot-master89 needed?
Oh, good catch... rail?

I only made entries for up to 88.
Flags: needinfo?(rail)
Oh, right. 89 is needed.
Flags: needinfo?(rail)
Greg took care of this.  I need to work on bug 870456, then I'll get these KS'd.
Assignee: server-ops-virtualization → dustin
Component: Server Operations: Virtualization → Server Operations: RelEng
QA Contact: dparsons → arich
made buildbot-master8[1-9].srv.releng.scl3.mozilla.com from the template (just to have something).  Fixed to 6G ram, 2 vCPU, 40g disk, vlan248.  Added macs for 81-88 to inventory stubs.

Added 89 to inventory and DNS (change 65363).
greg, I seem to not have permission to edit 89, to boot to bios (so I can netboot).  can you take a look?
Flags: needinfo?(gcox)
Hrm.  Not sure what that's all about.  As admin-me, I did pretty much a noop (changed the guest's OS label from CentOS to CentOS).  Then relops-me was able to set the boot-to-bios flag.

Funky.  Try again?
Flags: needinfo?(gcox)
OK, these are puppetized and ready to rock.  They're running 3.2.0.  They're in moco-nodes.pp as

# temporary node defs for these hosts
node /buildbot-master8[1-9].srv.releng.scl3.mozilla.com/ {
    include toplevel::server
}

so that will need to be fleshed out with master instances.

Let me know if there are any problems with 3.2.0.
Assignee: dustin → nobody
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
I added `hostname` to /etc/hosts manually on these hosts (see bug 891333)
Is there an eta on releng moving over to these new servers so we can decommission the older buildbot masters still on kvm?

We're still waiting to delete:

buildbot-master10.build.mtv1.mozilla.com
buildbot-master12.build.scl1.mozilla.com
buildbot-master19.build.mtv1.mozilla.com
buildbot-master20.build.mtv1.mozilla.com
buildbot-master22.build.mtv1.mozilla.com
buildbot-master24.build.scl1.mozilla.com
buildbot-master28.build.mtv1.mozilla.com
buildbot-master29.build.scl1.mozilla.com
buildbot-master35.srv.releng.scl3.mozilla.com
buildbot-master36.srv.releng.scl3.mozilla.com
buildbot-master37.srv.releng.scl3.mozilla.com
buildbot-master38.srv.releng.scl3.mozilla.com
buildbot-master42.build.scl1.mozilla.com
buildbot-master43.build.scl1.mozilla.com
buildbot-master44.build.scl1.mozilla.com
buildbot-master45.build.scl1.mozilla.com
buildbot-master47.build.scl1.mozilla.com
buildbot-master48.build.scl1.mozilla.com
buildbot-master49.srv.releng.scl3.mozilla.com
(In reply to Amy Rich [:arich] [:arr] from comment #18)
> Is there an eta on releng moving over to these new servers so we can
> decommission the older buildbot masters still on kvm?

Without looking into which are which, we should be off the tegra and panda masters within the month, if all goes well. Slavealloc for tegras and pandas should make that VERY easy for us!

If slavealloc doesn't go well, we can look at doing it manually but I expect a few days straight just for the moving devices from one master to another, all manual work. Not counting setup/etc of the new masters (if needed)
Product: mozilla.org → Release Engineering
Found in triage, moving to "GeneralAutomation" to be with the rest of the KVM-related bugs.


(In reply to Dustin J. Mitchell [:dustin] from comment #16)
> OK, these are puppetized and ready to rock.  They're running 3.2.0.  They're
> in moco-nodes.pp as
> 
> # temporary node defs for these hosts
> node /buildbot-master8[1-9].srv.releng.scl3.mozilla.com/ {
>     include toplevel::server
> }
> 
> so that will need to be fleshed out with master instances.
> 
> Let me know if there are any problems with 3.2.0.

dustin: Why change to use 3.2.0, instead of whatever is currently used on the other production masters? (Keeping masters identical seems righteous, so what am I missing? If change needed, what testing have you done to verify this?) 




(In reply to Amy Rich [:arich] [:arr] from comment #18)
> Is there an eta on releng moving over to these new servers so we can
> decommission the older buildbot masters still on kvm?
> 
> We're still waiting to delete:
> 
> buildbot-master10.build.mtv1.mozilla.com
> buildbot-master12.build.scl1.mozilla.com
> buildbot-master19.build.mtv1.mozilla.com
> buildbot-master20.build.mtv1.mozilla.com
> buildbot-master22.build.mtv1.mozilla.com
> buildbot-master24.build.scl1.mozilla.com
> buildbot-master28.build.mtv1.mozilla.com
> buildbot-master29.build.scl1.mozilla.com
> buildbot-master35.srv.releng.scl3.mozilla.com
> buildbot-master36.srv.releng.scl3.mozilla.com
> buildbot-master37.srv.releng.scl3.mozilla.com
> buildbot-master38.srv.releng.scl3.mozilla.com
> buildbot-master42.build.scl1.mozilla.com
> buildbot-master43.build.scl1.mozilla.com
> buildbot-master44.build.scl1.mozilla.com
> buildbot-master45.build.scl1.mozilla.com
> buildbot-master47.build.scl1.mozilla.com
> buildbot-master48.build.scl1.mozilla.com
> buildbot-master49.srv.releng.scl3.mozilla.com

arr: are you still waiting for formal "ok to kill these KVM masters"?
Component: Buildduty → General Automation
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
QA Contact: armenzg → catlee
joduinn: if you look at dustin's update, that was back in may (when that's the version we were putting on new machines.  These are running the same version of puppet as everything else, now.  And, yep, still waiting for the okay to remove the old buildbot masters on KVM.
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
(In reply to Amy Rich [:arich] [:arr] from comment #18)

> buildbot-master10.build.mtv1.mozilla.com
> buildbot-master19.build.mtv1.mozilla.com
> buildbot-master20.build.mtv1.mozilla.com
> buildbot-master22.build.mtv1.mozilla.com

We have bm[95-99] waiting in AWS to pick up this slack, but I don't imagine we can pull that trigger until *after* the watcher upgrade is performed this weekend (bug 914302). Don't want too many variables changing at once.

> buildbot-master29.build.scl1.mozilla.com
> buildbot-master42.build.scl1.mozilla.com
> buildbot-master43.build.scl1.mozilla.com
> buildbot-master44.build.scl1.mozilla.com
> buildbot-master45.build.scl1.mozilla.com

These are panda masters. We have bm[90-94] waiting in AWS to replace them, but still need to pull the switch. Not blocked on bug 914302 per se, but will benefit when the people working on that bug are freed up afterwards.

> buildbot-master12.build.scl1.mozilla.com
> buildbot-master24.build.scl1.mozilla.com
> buildbot-master28.build.mtv1.mozilla.com
> buildbot-master35.srv.releng.scl3.mozilla.com
> buildbot-master36.srv.releng.scl3.mozilla.com
> buildbot-master37.srv.releng.scl3.mozilla.com
> buildbot-master38.srv.releng.scl3.mozilla.com
> buildbot-master47.build.scl1.mozilla.com
> buildbot-master48.build.scl1.mozilla.com
> buildbot-master49.srv.releng.scl3.mozilla.com

These masters have all been disabled for months. They can be safely deleted if they haven't been already.
I've removed the following hosts form nagios and shut them down:

buildbot-master12.build.scl1.mozilla.com
buildbot-master24.build.scl1.mozilla.com
buildbot-master28.build.mtv1.mozilla.com
buildbot-master35.srv.releng.scl3.mozilla.com
buildbot-master36.srv.releng.scl3.mozilla.com
buildbot-master37.srv.releng.scl3.mozilla.com
buildbot-master38.srv.releng.scl3.mozilla.com
buildbot-master47.build.scl1.mozilla.com
buildbot-master48.build.scl1.mozilla.com
buildbot-master49.srv.releng.scl3.mozilla.com

If I do not hear of any issues, I will delete them from kvm and inventory tomorrow morning.
buildbot-master36.srv.releng.scl3.mozilla.com is still needed, AFAIK. We run self-serve agent on it.
Depends on: 894133
Brought 36 back up
Seems buildbot-master36.srv.releng.scl3.mozilla.com had the following hostgroup assignments;
"selfserve"
"release-runner"
"bm36-bouncer"

which when the host was removed broke nagios as the service names didn't have a host they belonged to.


For reference the checks tied to them, did the following;
selfserve
check_command => 'check_nrpe_selfserve',

release-runner-procs
check_command => 'check_nrpe_procs_regex!release-runner.py!1!1',

bm36-bouncer
check_command => 'check_bouncer',


Keep those in mind if buildbot-master36.srv.releng.scl3.mozilla.com gets added back to monitoring.
I did an svn diff -rPREV and didn't see any mention of bm36-bouncer in the node definition. I put back what was there before I made the change (plus the addition of the core parent, which I see had been added after things had been split out into multiple files):

        'buildbot-master36.srv.releng.scl3.mozilla.com' => {
            contact_groups => 'build',
            parents => 'core1.scl1.mozilla.net',            
            hostgroups => [
                'selfserve',
                'release-runner'
            ]
        },
BTW, release-runner has been moved to bm81 in bug 836289.
(In reply to Chris Cooper [:coop] from comment #22)

Checking back in on this since some things were supposed to have landed before the summit.
Did bm[95-99] get stood up so that we are ready to decommission these masters:

buildbot-master10.build.mtv1.mozilla.com
buildbot-master19.build.mtv1.mozilla.com
buildbot-master20.build.mtv1.mozilla.com
buildbot-master22.build.mtv1.mozilla.com

And for scl1, did we spin up bm[90-94] so that we can decommission:

buildbot-master29.build.scl1.mozilla.com
buildbot-master42.build.scl1.mozilla.com
buildbot-master43.build.scl1.mozilla.com
buildbot-master44.build.scl1.mozilla.com
buildbot-master45.build.scl1.mozilla.com


And the last buildbot master... Has a replacement (or set of replacements) been stood up in AWS to replace dev-master01.build.scl1.mozilla.com?
Flags: needinfo?(coop)
Oh, and I forgot the one in scl3: buildbot-master36.srv.releng.scl3.mozilla.com since self-serve was still on there.
Allow me to somewhat undo my statements in comment #22.

Cross-colo links are hard, especially those that go from our colos to the AWS US-East region. Bug 710942, bug 844648, and bug 898317 contain evidence as to the frequency of these network events knocking builds and tests offline.

It would be great if we could rely on the network to just work, but until we can, we'll need in-house masters to control our in-house slaves.

I'm reluctant to have *more* master<->slave traffic crossing the US-East boundary until we have a better handle on the network issues. We won't be shunting the mobile test traffic to AWS masters unless we have no other choice.
Flags: needinfo?(coop)
Depends on: 927129
The options are to move them to AWS or to move them to VMware in SCL3, but they can't stay on kvm in scl1 or mtv1 since those are going away. If you want to go the scl3 route, can you please submit a bug specifying how many you want and with what specs so we can get those base vms built for you (assuming you don't want to take over the ones that already built and unused)?
(In reply to Amy Rich [:arich] [:arr] from comment #32)
> The options are to move them to AWS or to move them to VMware in SCL3, but
> they can't stay on kvm in scl1 or mtv1 since those are going away. If you
> want to go the scl3 route, can you please submit a bug specifying how many
> you want and with what specs so we can get those base vms built for you
> (assuming you don't want to take over the ones that already built and
> unused)?

We'll absolutely use the ones already setup. Callek is signed up for bug 927129 (windows builders), with an eye towards how much extra setup is still required. We'll move on to test masters for Windows and mobile after that.

I'm hoping the existing 10 masters waiting in scl3 will get us most of the way there, but we may need a few more for mobile. Mac and Linux don't see nearly as many connection issues as Windows and mobile.

Out of curiosity, how many master VMs _could_ we run in scl3? Since the majority of our test hardware is still in-house, that may be pertinent for standing up, e.g., 10.9 minis.
Keywords: sheriffing-P1
We're in the process of rebuilding the vmware infrastructure this quarter (all new hosts), so we should have increased capacity by Q1.  Part of the problem is that we don't know what resources the buildbot-masters will take since none of them have been set up yet (to my knowledge).  Once that's done, we can get some kind of baseline on usage and be better able to extrapolate how many more the existing infrastructure can hold.  Greg or Dan might be able to give us better numbers then.  I've flagged them for needinfo to give them a heads up and see if they have any ballpark info now.

As far as order/criticality of moving things, the kvm servers in mtv1 are slated to be decommissioned first, so if you think there's a chance that not everything can be done this quarter, you may want to focus your energy there earlier.
Flags: needinfo?(gcox)
Flags: needinfo?(dparsons)
(In reply to Amy Rich [:arich] [:arr] from comment #30)
> Oh, and I forgot the one in scl3:
> buildbot-master36.srv.releng.scl3.mozilla.com since self-serve was still on
> there.

I'm working on this in bug 870414 and hope to have a replacement ready this or next week.
I agree with :arr, without a usage baseline, I can't really advice on how many we can support. Reading over this bug, it is not clear to many new or repurposed VMs are going to become high-io VMs. Can anyone give me a number?
Flags: needinfo?(dparsons)
Depends on: 931805
Flags: needinfo?(gcox)
Depends on: 942206
Do we have metrics from the KVM hosts we could use as a baseline? Alternatively can we set up a single VM and see how it behaves before provisioning more?
There are four set up now - 99..102 - if you want to experiment (bug 940659)
Assignee: nobody → bugspam.Callek
Priority: -- → P1
The only buildbot master left on kvm is dev-master01.
No longer blocks: 918677
[09:53am] catlee: why do we have windows test masters in AWS?
[09:54am] coop: catlee: huh, I thought Callek fixed that
[09:54am] coop: or are they idle?
[09:55am] catlee: this one looks busy
[09:55am] catlee: they're in slavealloc
[09:55am] catlee: as active
[09:56am] catlee: tests-scl1-*
[09:56am] catlee: are all in aws
[09:57am] coop: Callek^^
[09:57am] bhearsum: https://bugzilla.mozilla.org/show_bug.cgi?id=867593
[09:58am] bhearsum: i think only the panda/tegra stuff masters were moved in house
[09:59am] catlee: we have scl3 masters though
[10:01am] rail: we should fix that
[10:01am] rail: iirc that was a temporary solution because our old masters sucked
[10:01am] rail: and aws masters were much faster
[10:02am] rail: catlee: you had some timings for simple steps like set property, they were taking 20s or something on the old masters. IIRC
[10:07am] coop: rail, catlee: is that something we still need to worry about? i don't want to push Callek to get us switched if we're worse off
[10:07am] coop: or is in-house still better than saturating the network pipe?
[10:08am] rail: if we already have windows masters in scl3 we can just disable the ones in amazon and see what happens
[10:08am] rail: if it sticks we can kill the amazon masters
[10:08am] bhearsum: even just moving some machines to scl3 masters would help
[10:12am] rail: oh, we don't have windows masters in scl3
[10:12am] rail: (per production-masters.json)
[10:12am] rail: 6 in amazon

Callek: all the dependencies here are resolved, but it sounds like the work isn't done. Can we get bugs on file for the remaining cross-colo cases, or comment as to why we're not, please?
Depends on: 971780
so summary -- we didn't move over physical machine testers to in-house masters, the delay here was due to the fact that things looked good in terms of rando-disconnects-for-jobs and related infra on these and got deprioritized.

due to the confusion this has caused I was asked to pick this back up and drive it through to completion, shouldn't be too hard. New Bug(s) filed today, expected finalized by +1week from when IT hands the new VMs over
Depends on: 976281
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.