Closed
Bug 867593
Opened 11 years ago
Closed 10 years ago
Provision enough in-house master capacity
Categories
(Release Engineering :: General, defect, P1)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: Callek)
References
Details
(Keywords: sheriffing-P1)
As much as possible, we should be able to build, test and ship our supported products relying only on in-house infrastructure. We migrated most of our buildbot masters off of KVM to AWS in bug 864364. However, we still need some in-house capacity. Here's a list of get us started: - 1 scheduler master (to replace bm36) - 1 or 2 build masters (for builds and releases) - 0 or 1 try masters - 5 test masters (linux, windows, mac, tegra, panda) Anything else?
Comment 1•11 years ago
|
||
A heads up to the virtualization team that a request for 7-9 servers is coming. Buildbot masters on KVM have the following specs: OS: CentOS (presumably we'll be moving to 6.2 to match the AWS masters, releng?) RAM: 6G CPU: 2 Disk: 30G (but high I/O load, so SATA might be problematic) Network: will need to be plumbed to the srv.releng.scl3.mozilla.com VLAN Releng: is there any specific way the disk needs to be partitioned or any of these specs that need adjusting? I'm not sure what these look like on AWS now.
Comment 3•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #1) > OS: CentOS (presumably we'll be moving to 6.2 to match the AWS masters, > releng?) Correct. We use the same OS for our mock builders.
Updated•11 years ago
|
Comment 4•11 years ago
|
||
How about this: buildbot-master81 - scheduler (build and try) buildbot-master82 - build buildbot-master83 - build buildbot-master83 - try buildbot-master84 - linux tests buildbot-master85 - mac tests buildbot-master86 - windows tests buildbot-master87 - tegra buildbot-master88 - panda
Flags: needinfo?(catlee)
Comment 6•11 years ago
|
||
Amy, is there anything else you need to know before you proceed?
Assignee: nobody → server-ops-releng
Component: Release Engineering: Automation (General) → Server Operations: RelEng
QA Contact: catlee → arich
Comment 7•11 years ago
|
||
I've created DNS and stub inventory entries for these. VMWare folks, please create the above 9 vms with the following specs: OS: CentOS 6 RAM: 6G CPU: 2 Disk: 30G (but high I/O load when in use, so SATA might be problematic) Network: SCL3 VLAN 248 (.srv.releng.scl3.mozilla.com)
Assignee: server-ops-releng → server-ops-virtualization
Component: Server Operations: RelEng → Server Operations: Virtualization
QA Contact: arich → dparsons
Comment 8•11 years ago
|
||
Note, we'll wind up re-kickstarting these, so no need to puppetize. Just make sure we know what the root pw is :} If that means you need to reinstall vmware tools after the fact, please let us know.
Comment 9•11 years ago
|
||
Small nitpick: comment 4 has a doubleup on 83, so is a buildbot-master89 needed?
Comment 10•11 years ago
|
||
Oh, good catch... rail? I only made entries for up to 88.
Flags: needinfo?(rail)
Comment 12•11 years ago
|
||
Greg took care of this. I need to work on bug 870456, then I'll get these KS'd.
Assignee: server-ops-virtualization → dustin
Component: Server Operations: Virtualization → Server Operations: RelEng
QA Contact: dparsons → arich
Comment 13•11 years ago
|
||
made buildbot-master8[1-9].srv.releng.scl3.mozilla.com from the template (just to have something). Fixed to 6G ram, 2 vCPU, 40g disk, vlan248. Added macs for 81-88 to inventory stubs. Added 89 to inventory and DNS (change 65363).
Comment 14•11 years ago
|
||
greg, I seem to not have permission to edit 89, to boot to bios (so I can netboot). can you take a look?
Flags: needinfo?(gcox)
Comment 15•11 years ago
|
||
Hrm. Not sure what that's all about. As admin-me, I did pretty much a noop (changed the guest's OS label from CentOS to CentOS). Then relops-me was able to set the boot-to-bios flag. Funky. Try again?
Flags: needinfo?(gcox)
Comment 16•11 years ago
|
||
OK, these are puppetized and ready to rock. They're running 3.2.0. They're in moco-nodes.pp as # temporary node defs for these hosts node /buildbot-master8[1-9].srv.releng.scl3.mozilla.com/ { include toplevel::server } so that will need to be fleshed out with master instances. Let me know if there are any problems with 3.2.0.
Assignee: dustin → nobody
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
Comment 17•11 years ago
|
||
I added `hostname` to /etc/hosts manually on these hosts (see bug 891333)
Comment 18•11 years ago
|
||
Is there an eta on releng moving over to these new servers so we can decommission the older buildbot masters still on kvm? We're still waiting to delete: buildbot-master10.build.mtv1.mozilla.com buildbot-master12.build.scl1.mozilla.com buildbot-master19.build.mtv1.mozilla.com buildbot-master20.build.mtv1.mozilla.com buildbot-master22.build.mtv1.mozilla.com buildbot-master24.build.scl1.mozilla.com buildbot-master28.build.mtv1.mozilla.com buildbot-master29.build.scl1.mozilla.com buildbot-master35.srv.releng.scl3.mozilla.com buildbot-master36.srv.releng.scl3.mozilla.com buildbot-master37.srv.releng.scl3.mozilla.com buildbot-master38.srv.releng.scl3.mozilla.com buildbot-master42.build.scl1.mozilla.com buildbot-master43.build.scl1.mozilla.com buildbot-master44.build.scl1.mozilla.com buildbot-master45.build.scl1.mozilla.com buildbot-master47.build.scl1.mozilla.com buildbot-master48.build.scl1.mozilla.com buildbot-master49.srv.releng.scl3.mozilla.com
Assignee | ||
Comment 19•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #18) > Is there an eta on releng moving over to these new servers so we can > decommission the older buildbot masters still on kvm? Without looking into which are which, we should be off the tegra and panda masters within the month, if all goes well. Slavealloc for tegras and pandas should make that VERY easy for us! If slavealloc doesn't go well, we can look at doing it manually but I expect a few days straight just for the moving devices from one master to another, all manual work. Not counting setup/etc of the new masters (if needed)
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 20•11 years ago
|
||
Found in triage, moving to "GeneralAutomation" to be with the rest of the KVM-related bugs. (In reply to Dustin J. Mitchell [:dustin] from comment #16) > OK, these are puppetized and ready to rock. They're running 3.2.0. They're > in moco-nodes.pp as > > # temporary node defs for these hosts > node /buildbot-master8[1-9].srv.releng.scl3.mozilla.com/ { > include toplevel::server > } > > so that will need to be fleshed out with master instances. > > Let me know if there are any problems with 3.2.0. dustin: Why change to use 3.2.0, instead of whatever is currently used on the other production masters? (Keeping masters identical seems righteous, so what am I missing? If change needed, what testing have you done to verify this?) (In reply to Amy Rich [:arich] [:arr] from comment #18) > Is there an eta on releng moving over to these new servers so we can > decommission the older buildbot masters still on kvm? > > We're still waiting to delete: > > buildbot-master10.build.mtv1.mozilla.com > buildbot-master12.build.scl1.mozilla.com > buildbot-master19.build.mtv1.mozilla.com > buildbot-master20.build.mtv1.mozilla.com > buildbot-master22.build.mtv1.mozilla.com > buildbot-master24.build.scl1.mozilla.com > buildbot-master28.build.mtv1.mozilla.com > buildbot-master29.build.scl1.mozilla.com > buildbot-master35.srv.releng.scl3.mozilla.com > buildbot-master36.srv.releng.scl3.mozilla.com > buildbot-master37.srv.releng.scl3.mozilla.com > buildbot-master38.srv.releng.scl3.mozilla.com > buildbot-master42.build.scl1.mozilla.com > buildbot-master43.build.scl1.mozilla.com > buildbot-master44.build.scl1.mozilla.com > buildbot-master45.build.scl1.mozilla.com > buildbot-master47.build.scl1.mozilla.com > buildbot-master48.build.scl1.mozilla.com > buildbot-master49.srv.releng.scl3.mozilla.com arr: are you still waiting for formal "ok to kill these KVM masters"?
Component: Buildduty → General Automation
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
QA Contact: armenzg → catlee
Comment 21•11 years ago
|
||
joduinn: if you look at dustin's update, that was back in may (when that's the version we were putting on new machines. These are running the same version of puppet as everything else, now. And, yep, still waiting for the okay to remove the old buildbot masters on KVM.
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
Comment 22•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #18) > buildbot-master10.build.mtv1.mozilla.com > buildbot-master19.build.mtv1.mozilla.com > buildbot-master20.build.mtv1.mozilla.com > buildbot-master22.build.mtv1.mozilla.com We have bm[95-99] waiting in AWS to pick up this slack, but I don't imagine we can pull that trigger until *after* the watcher upgrade is performed this weekend (bug 914302). Don't want too many variables changing at once. > buildbot-master29.build.scl1.mozilla.com > buildbot-master42.build.scl1.mozilla.com > buildbot-master43.build.scl1.mozilla.com > buildbot-master44.build.scl1.mozilla.com > buildbot-master45.build.scl1.mozilla.com These are panda masters. We have bm[90-94] waiting in AWS to replace them, but still need to pull the switch. Not blocked on bug 914302 per se, but will benefit when the people working on that bug are freed up afterwards. > buildbot-master12.build.scl1.mozilla.com > buildbot-master24.build.scl1.mozilla.com > buildbot-master28.build.mtv1.mozilla.com > buildbot-master35.srv.releng.scl3.mozilla.com > buildbot-master36.srv.releng.scl3.mozilla.com > buildbot-master37.srv.releng.scl3.mozilla.com > buildbot-master38.srv.releng.scl3.mozilla.com > buildbot-master47.build.scl1.mozilla.com > buildbot-master48.build.scl1.mozilla.com > buildbot-master49.srv.releng.scl3.mozilla.com These masters have all been disabled for months. They can be safely deleted if they haven't been already.
Comment 23•11 years ago
|
||
I've removed the following hosts form nagios and shut them down: buildbot-master12.build.scl1.mozilla.com buildbot-master24.build.scl1.mozilla.com buildbot-master28.build.mtv1.mozilla.com buildbot-master35.srv.releng.scl3.mozilla.com buildbot-master36.srv.releng.scl3.mozilla.com buildbot-master37.srv.releng.scl3.mozilla.com buildbot-master38.srv.releng.scl3.mozilla.com buildbot-master47.build.scl1.mozilla.com buildbot-master48.build.scl1.mozilla.com buildbot-master49.srv.releng.scl3.mozilla.com If I do not hear of any issues, I will delete them from kvm and inventory tomorrow morning.
Comment 24•11 years ago
|
||
buildbot-master36.srv.releng.scl3.mozilla.com is still needed, AFAIK. We run self-serve agent on it.
Comment 25•11 years ago
|
||
Brought 36 back up
Comment 26•11 years ago
|
||
Seems buildbot-master36.srv.releng.scl3.mozilla.com had the following hostgroup assignments; "selfserve" "release-runner" "bm36-bouncer" which when the host was removed broke nagios as the service names didn't have a host they belonged to. For reference the checks tied to them, did the following; selfserve check_command => 'check_nrpe_selfserve', release-runner-procs check_command => 'check_nrpe_procs_regex!release-runner.py!1!1', bm36-bouncer check_command => 'check_bouncer', Keep those in mind if buildbot-master36.srv.releng.scl3.mozilla.com gets added back to monitoring.
Comment 27•11 years ago
|
||
I did an svn diff -rPREV and didn't see any mention of bm36-bouncer in the node definition. I put back what was there before I made the change (plus the addition of the core parent, which I see had been added after things had been split out into multiple files): 'buildbot-master36.srv.releng.scl3.mozilla.com' => { contact_groups => 'build', parents => 'core1.scl1.mozilla.net', hostgroups => [ 'selfserve', 'release-runner' ] },
Comment 28•11 years ago
|
||
BTW, release-runner has been moved to bm81 in bug 836289.
Comment 29•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #22) Checking back in on this since some things were supposed to have landed before the summit. Did bm[95-99] get stood up so that we are ready to decommission these masters: buildbot-master10.build.mtv1.mozilla.com buildbot-master19.build.mtv1.mozilla.com buildbot-master20.build.mtv1.mozilla.com buildbot-master22.build.mtv1.mozilla.com And for scl1, did we spin up bm[90-94] so that we can decommission: buildbot-master29.build.scl1.mozilla.com buildbot-master42.build.scl1.mozilla.com buildbot-master43.build.scl1.mozilla.com buildbot-master44.build.scl1.mozilla.com buildbot-master45.build.scl1.mozilla.com And the last buildbot master... Has a replacement (or set of replacements) been stood up in AWS to replace dev-master01.build.scl1.mozilla.com?
Flags: needinfo?(coop)
Comment 30•11 years ago
|
||
Oh, and I forgot the one in scl3: buildbot-master36.srv.releng.scl3.mozilla.com since self-serve was still on there.
Comment 31•11 years ago
|
||
Allow me to somewhat undo my statements in comment #22. Cross-colo links are hard, especially those that go from our colos to the AWS US-East region. Bug 710942, bug 844648, and bug 898317 contain evidence as to the frequency of these network events knocking builds and tests offline. It would be great if we could rely on the network to just work, but until we can, we'll need in-house masters to control our in-house slaves. I'm reluctant to have *more* master<->slave traffic crossing the US-East boundary until we have a better handle on the network issues. We won't be shunting the mobile test traffic to AWS masters unless we have no other choice.
Flags: needinfo?(coop)
Comment 32•11 years ago
|
||
The options are to move them to AWS or to move them to VMware in SCL3, but they can't stay on kvm in scl1 or mtv1 since those are going away. If you want to go the scl3 route, can you please submit a bug specifying how many you want and with what specs so we can get those base vms built for you (assuming you don't want to take over the ones that already built and unused)?
Comment 33•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #32) > The options are to move them to AWS or to move them to VMware in SCL3, but > they can't stay on kvm in scl1 or mtv1 since those are going away. If you > want to go the scl3 route, can you please submit a bug specifying how many > you want and with what specs so we can get those base vms built for you > (assuming you don't want to take over the ones that already built and > unused)? We'll absolutely use the ones already setup. Callek is signed up for bug 927129 (windows builders), with an eye towards how much extra setup is still required. We'll move on to test masters for Windows and mobile after that. I'm hoping the existing 10 masters waiting in scl3 will get us most of the way there, but we may need a few more for mobile. Mac and Linux don't see nearly as many connection issues as Windows and mobile. Out of curiosity, how many master VMs _could_ we run in scl3? Since the majority of our test hardware is still in-house, that may be pertinent for standing up, e.g., 10.9 minis.
Updated•11 years ago
|
Keywords: sheriffing-P1
Comment 34•11 years ago
|
||
We're in the process of rebuilding the vmware infrastructure this quarter (all new hosts), so we should have increased capacity by Q1. Part of the problem is that we don't know what resources the buildbot-masters will take since none of them have been set up yet (to my knowledge). Once that's done, we can get some kind of baseline on usage and be better able to extrapolate how many more the existing infrastructure can hold. Greg or Dan might be able to give us better numbers then. I've flagged them for needinfo to give them a heads up and see if they have any ballpark info now. As far as order/criticality of moving things, the kvm servers in mtv1 are slated to be decommissioned first, so if you think there's a chance that not everything can be done this quarter, you may want to focus your energy there earlier.
Flags: needinfo?(gcox)
Flags: needinfo?(dparsons)
Comment 35•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #30) > Oh, and I forgot the one in scl3: > buildbot-master36.srv.releng.scl3.mozilla.com since self-serve was still on > there. I'm working on this in bug 870414 and hope to have a replacement ready this or next week.
Comment 36•11 years ago
|
||
I agree with :arr, without a usage baseline, I can't really advice on how many we can support. Reading over this bug, it is not clear to many new or repurposed VMs are going to become high-io VMs. Can anyone give me a number?
Flags: needinfo?(dparsons)
Updated•11 years ago
|
Flags: needinfo?(gcox)
Reporter | ||
Comment 37•11 years ago
|
||
Do we have metrics from the KVM hosts we could use as a baseline? Alternatively can we set up a single VM and see how it behaves before provisioning more?
Comment 38•11 years ago
|
||
There are four set up now - 99..102 - if you want to experiment (bug 940659)
Updated•10 years ago
|
Assignee: nobody → bugspam.Callek
Priority: -- → P1
Comment 39•10 years ago
|
||
The only buildbot master left on kvm is dev-master01.
Comment 40•10 years ago
|
||
[09:53am] catlee: why do we have windows test masters in AWS? [09:54am] coop: catlee: huh, I thought Callek fixed that [09:54am] coop: or are they idle? [09:55am] catlee: this one looks busy [09:55am] catlee: they're in slavealloc [09:55am] catlee: as active [09:56am] catlee: tests-scl1-* [09:56am] catlee: are all in aws [09:57am] coop: Callek^^ [09:57am] bhearsum: https://bugzilla.mozilla.org/show_bug.cgi?id=867593 [09:58am] bhearsum: i think only the panda/tegra stuff masters were moved in house [09:59am] catlee: we have scl3 masters though [10:01am] rail: we should fix that [10:01am] rail: iirc that was a temporary solution because our old masters sucked [10:01am] rail: and aws masters were much faster [10:02am] rail: catlee: you had some timings for simple steps like set property, they were taking 20s or something on the old masters. IIRC [10:07am] coop: rail, catlee: is that something we still need to worry about? i don't want to push Callek to get us switched if we're worse off [10:07am] coop: or is in-house still better than saturating the network pipe? [10:08am] rail: if we already have windows masters in scl3 we can just disable the ones in amazon and see what happens [10:08am] rail: if it sticks we can kill the amazon masters [10:08am] bhearsum: even just moving some machines to scl3 masters would help [10:12am] rail: oh, we don't have windows masters in scl3 [10:12am] rail: (per production-masters.json) [10:12am] rail: 6 in amazon Callek: all the dependencies here are resolved, but it sounds like the work isn't done. Can we get bugs on file for the remaining cross-colo cases, or comment as to why we're not, please?
Assignee | ||
Comment 41•10 years ago
|
||
so summary -- we didn't move over physical machine testers to in-house masters, the delay here was due to the fact that things looked good in terms of rando-disconnects-for-jobs and related infra on these and got deprioritized. due to the confusion this has caused I was asked to pick this back up and drive it through to completion, shouldn't be too hard. New Bug(s) filed today, expected finalized by +1week from when IT hands the new VMs over
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•