Closed Bug 1112437 Opened 10 years ago Closed 2 years ago

[meta] Bugzilla Project infrastructure reorganization project

Categories

(Bugzilla :: bugzilla.org, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: justdave, Assigned: mhoye)

References

Details

(Whiteboard: current plan at comment 9)

We have a few hardware machines in the community space in SCL3 which are out of warranty and have rapidly failing hard drives that aren't on RAID. We need to vacate and dispose of the machines. We also have one machine hosting email services for the project inside Mozilla's corporate space that they'd like us to rid them of. In speaking with our virtualization folks, it will be cheaper for them to allocate us new VMs on their ESX infrastructure than it will be to maintain our physical hardware or replace it, and our main CPU-intensive task that needed physical hardware (Tinderbox tests) got moved to travis-ci and no longer runs on our boxes, so we should just migrate whole-hog to VMs. This bug is a meta-bug for the project and will collect individual bugs for tasks to be performed.
So, things we need to figure out here... Here's my starting point: https://wiki.mozilla.org/Bugzilla:Infrastructure It sounds like two of the existing ESX VMs can just go away. windows.bugzilla.lan is running Windows Server 2003SP2, which I think is the server version of XP, i.e. ancient and unsafe. And we claim we're no longer using it. oracle.bugzilla.lan is there and up and running RHEL 6 on it. But no Oracle in sight, like it never got installed after the machine was deployed. bugzilla1 is database servers. This we need. It can be in a VM though. bugzilla2 is nothing but a KVM host. The VMs hosted here should be migrated directly to ESX if we intend to keep them (I think they can probably be consolidated). bugzilla3 is dead. Its hard drive toasted itself (and is what prompted all this). I propose merging bots and updates, they can be on the same server with proper privilege separation. MX services for bugzilla.org should move there, too (from mailman4 within Mozilla's IT infrastructure in PHX1 so we can get it off their plate). infra should stay separate, building stuff is a high-cpu task that could interfere with the update and mail services. bzr can be on the same server with bot, updates, and mail (bug 968636).
Depends on: 968636
So to summarize, here's the services we need to be able to provide within the bugzilla project infrastructure: VM1: Internal Services - DNS for internal VLAN - RPM build - puppetmaster VM2: Public Services - IRC bot hosting (bzbot and word) - Mail services (mailman mailing lists, @bugzilla.org email forwarding addresses) - Bugzilla update service - BZR repository (until Bugzilla 4.4 EOLs - because Mozilla doesn't want to host it anymore) VM3: Database Server - MySQL - Postgres VM4: Bugzilla Test Server (landfill) - Shell Server - Web Server VM5: (optional) Bugzilla Demo server (what bugzilla3 used to be) - Automatically created and disposed of instances of Bugzilla used for 2 week trials to let people try out Bugzilla without having to install it somewhere themselves. This has been gone for over a year because of bugzilla3's hard drives getting toasted (twice) and I don't think anyone's missed it. Did I miss anything? So we just need to figure out CPU, RAM, and disk specs for these systems then.
Existing specs: bugzilla1: (phys) cores; 4, RAM: 4GB, HD: 1TB, in use: 16GB bugzilla2: (phys) cores: 4, RAM: 4GB, HD: 250GB, in use: 24GB bugzilla3: (phys) cores: 4, RAM: 4GB, HD: ?? (dead) landfill: (virt) cores: 2, RAM: 4GB, HD: 150GB, in use: 42GB updates: (virt) cores: 1, RAM: 512MB, HD: 7GB, in use: 2GB bots: (virt) cores: 1, RAM: 1GB, HD: 7GB, in use: 3GB infra: (virt) cores: 2, RAM: 512MB, HD: 35GB, in use: ?? (can't get into it) mailman4: (phys) cores: 4, RAM: 4GB, HD: 40GB, in use: 15GB
Proposals: landfill: already in ESX, no-op. bugzilla3 (existing) : immediate disposal oracle: immediate disposal windows: immediate disposal infra: p2v into ESX as-is, assuming we can figure out how to get into it, then dispose of old VM bugzilla1: p2v into ESX as-is, but with 2 cores, 2GB RAM, and 60GB HD space, then dispose of old hardware bugzilla3 (new) : 1 new from-scratch VM to handle public-facing services. 2 cores, 2GB RAM, 40GB disk This will be handling the combined functions of the existing bots + updates + mailman4 bots: move services to new bugzilla3, then dispose of old VM updates: move services to new bugzilla3, then dispose of old VM mailman4: move services to new bugzilla3, then dispose of old hardware/return to IT bugzilla2: disposal, after infra+bots+updates are cleared off of it Anyone see any problems with that plan?
wicked: any objections? most of this stuff would be relevant to you.
Flags: needinfo?(wicked)
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #1) > windows.bugzilla.lan is running Windows Server 2003SP2, which I think is the > server version of XP, i.e. ancient and unsafe. And we claim we're no longer > using it. True, it's ancient now and I haven't heard anything/anybody using it either. So I agree that this can just die. > oracle.bugzilla.lan is there and up and running RHEL 6 on it. But no Oracle > in sight, like it never got installed after the machine was deployed. That's my bad as I just haven't had time to deploy the Oracle on this. :( I do think we still need to, both for a new Oracle based permanent demo/testing install and for any Oracle development needs we have. As long as we support Oracle as a DB both are necessary for that famous "somebody" to figure out bugs in our DB implementation. So I wouldn't decommission this but rather work on getting Oracle installed (Puppetized). I'll promise I'll get to this some year! :) > bugzilla1 is database servers. This we need. It can be in a VM though. True and I'd like to have separate DB server that is used by both landfill and demo server. However, since Tinderbox has died from the server, it doesn't have much things in it we need to preserve (but rather we need to clean it) so I wouldn't migrate it as is but rather install a new pure (puppetized) DB server for us. Any DBs from bugzilla1 we can backup and move to the new server as well as backup any built RPMs/Tinderbox stuff we might want archive for some reason.. When it comes to resource allocation, I would give this way more than 2 vCPU and 2GB, though. This is both MySQL and Pg server and with many DBs running on it. 4 vCPU and 8G would be minimum, IMO. However, since landfill/demo wouldn't run their own DB, their requirements could be bumbed down if needed (landfill-new is 8G server, but could simply be 4G one this way). > bugzilla2 is nothing but a KVM host. The VMs hosted here should be migrated > directly to ESX if we intend to keep them (I think they can probably be It also serves as an inbound email conduit to our bots since current bots VM doesn't have a public IP. Currently the bugzilla.org MX is handled by Mozilla servers* that route bugbot@ to bugzilla2 (additionally some inbound bugspam is still routed via landfill as I haven't changed all bugbot accounts in various Bugzilla installs). This routing can change when/if bots becomes directly accessible, though. *MX handlers are (mx1|mx2).corp.phx1.mozilla.com to be exact. Are these going to die with mailman4? > I propose merging bots and updates, they can be on the same server with > proper privilege separation. MX services for bugzilla.org should move Updates is the most critical part of our infra since all Bugzilla installs out there fetch the update XML from it. There will be a visible error on the admin interface of Bugzilla all over the world if this is inaccessible (there has been few incidents like this in the past). This is why updates has been separate VM and runs with nginx web server behind it. Nginx also complicates combining this with other services since, for example, Puppet Dashboard can't be deployed on a nginx (technically possible but way too complicated and requires rebuilding nginx itself) but rather needs an Apache. So I'd like to keep updates as a separate VM or find about if Mozilla can provide something else to serve this static XML snippet to the world? Additionally our Release Manager (at the moment dkl) needs access to change the file during releases. Can this file be hosted on CDN or some static resource cluster? Separate updates VM can have modest resources allocated to it but would be nice for it have high virtual priority in ESX. We can move activity graph from the updates VM to some other server if Mozilla is worried that we use too much CPU on a high-priority VM (the graph generation is the hardest things the server does) or the file ends up been hosted somewhere else. Although, that graph needs access to access logs, which has been problem in the past for Mozilla to provide us.. > infra should stay separate, building stuff is a high-cpu task that could > interfere with the update and mail services. Our need for actually building custom RPMs has gone significantly down since Tinderbox is gone and if we can get all our servers updated to RHEL 6 (with Perl 5.10.1 that latest Bugzilla requires). Also landfill has previously had access to custom Perl interpreters for our devs to use for testing different Perl versions. Not sure how widespread usage this has been, though. There might also be a much better way to install custom Perl interpreters these days than to painstakingly rebuild RPMs. What we mostly need now is to host third-party RPMs (mostly just repo RPMs) for our Puppet to deploy automatically. > bzr can be on the same server with bot, updates, and mail (bug 968636). Hmm.. other than updates (see above), I think this might be possible. Especially if that's purely a bzr server without any web front end and only handful of committing users (we obviously can't use Mozilla LDAP to grant access to it). (In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #2) > VM1: Internal Services > - DNS for internal VLAN I did mentioned once on irc that our need for internal DNS zone might just complicate things unnecessarily these days. We could forget it and just give public IP for all our servers from our pool. I believe we might have enough to go around.. Additional piece to this puzzle is Puppet deployment so we can easily deploy firewall and other rules to limit access to services. > - puppetmaster Since an ideal Puppet install would include a Puppet Dashboard and that needs to be accessible via web, I think this should rather be on our public services VM. We could have only Puppet Master in it's own VM but not sure if our simple usage patterns warrant that. Also, PuppetDB needs to be somewhere in addition for it to use a Pg database. These comments means we probably don't need a separate internal services VM but public VM should be a bit beefier than that we currently have. Mostly because our Puppet usage hasn't peaked yet since it's not fully deployed yet. > VM5: (optional) Bugzilla Demo server (what bugzilla3 used to be) ... > bugzilla3's hard drives getting toasted (twice) and I don't think anyone's > missed it. I've seen one or two requests/questions about demo installs in the past and, IMO, this is a nice service to provide so I'd like to finally resurrect it. However, if we want, we could combine this back to landfill itself. I believe the main reason to separate this was to get newer OS, make it not depend on what devs do to landfill and.. I believe we did have an "extra" physical server to make use of. :) > Did I miss anything? So we just need to figure out CPU, RAM, and disk specs > for these systems then. I'd be more liberal with these than to skimp out and run into problems. However, I do understand that Mozilla would need some justification from us demanding resources. Let's think about these when we have finalized service migration plan otherwise.
Flags: needinfo?(wicked)
(In reply to Teemu Mannermaa (:wicked) from comment #6) > > bugzilla2 is nothing but a KVM host. The VMs hosted here should be migrated > > directly to ESX if we intend to keep them (I think they can probably be > > It also serves as an inbound email conduit to our bots since current bots VM > doesn't have a public IP. Currently the bugzilla.org MX is handled by > Mozilla servers* that route bugbot@ to bugzilla2 (additionally some inbound > bugspam is still routed via landfill as I haven't changed all bugbot > accounts in various Bugzilla installs). This routing can change when/if bots > becomes directly accessible, though. > > *MX handlers are (mx1|mx2).corp.phx1.mozilla.com to be exact. Are these > going to die with mailman4? mx1/mx2 only relay to mailman4, and mailman4 handles all the bugzilla.org email. The entirety of mailman4's mail handling will be moving into the Bugzilla Project's servers (and likely the MX record with it). This was why I was proposing having it on the same box with the update service because it's of similar criticality and people would be probably as upset if the mailing lists quit as they would be if they get an error checking for updates. > > bzr can be on the same server with bot, updates, and mail (bug 968636). > > Hmm.. other than updates (see above), I think this might be possible. > Especially if that's purely a bzr server without any web front end and only > handful of committing users (we obviously can't use Mozilla LDAP to grant > access to it). Yeah, since the only thing that needs write access to it is the git->bzr mirror script, it can just use local accounts instead of needing Mozilla's LDAP backing for the accounts. > (In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #2) > > VM1: Internal Services > > - DNS for internal VLAN > > I did mentioned once on irc that our need for internal DNS zone might just > complicate things unnecessarily these days. We could forget it and just give > public IP for all our servers from our pool. I believe we might have enough > to go around.. Additional piece to this puzzle is Puppet deployment so we > can easily deploy firewall and other rules to limit access to services. I think the primary reason for the internal VLAN was because of the Windows server because everyone is paranoid about having a Windows box directly connected to the Internet. IIRC we had apache mod_proxy somewhere pointing at its webserver and you had to tunnel through landfill to get to the RDP. That said, there's something to be said for having the database servers separated from the internet. Good iptables rules can enforce this equally well though. > > VM5: (optional) Bugzilla Demo server (what bugzilla3 used to be) > ... > > bugzilla3's hard drives getting toasted (twice) and I don't think anyone's > > missed it. > > I've seen one or two requests/questions about demo installs in the past and, > IMO, this is a nice service to provide so I'd like to finally resurrect it. > However, if we want, we could combine this back to landfill itself. I > believe the main reason to separate this was to get newer OS, make it not > depend on what devs do to landfill and.. I believe we did have an "extra" > physical server to make use of. :) Sounds good, so we set up a new VM for that, too, I guess, to keep it off landfill's hackiness. :) I was going to propose CentOS 7 for all the newly-spun VMs.
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #7) > mx1/mx2 only relay to mailman4, and mailman4 handles all the bugzilla.org > email. The entirety of mailman4's mail handling will be moving into the > Bugzilla Project's servers (and likely the MX record with it). This was why Ah, right, although mx1/mx2 relay bugbot@bugzilla.org directly to bugzilla2 since that's where the relayed mails come (they never visit mailman4, AFAICS). This doesn't change things for migration, though, we just have to remember to make sure we check and handle rules in them as well (for our domain). > I was proposing having it on the same box with the update service because > it's of similar criticality and people would be probably as upset if the > mailing lists quit as they would be if they get an error checking for > updates. Oh, right, definitely. Hmm.. Do we need to limit the VMs we request? Since the more we cram into this same VM, the more likely is that they conflict during install or we can't optimize them as much as we'd like to or there will be more possibility for problems that can take them all down. I could see Puppet related services and possibly bots (which are not that important) run in their own "normal criticality" VM and critical VMs on their own. Not sure if simpler email routing to bots warrants them been on the mx server. They are not at the moment either. :) However, bzr could be seen as one of those critical services since if somebody is trying to do an upgrade just as when it's down.. (Also think about Bugzilla security updates getting delayed due to bzr been down.) Or what do you think? I don't know on top of my head if bzr requires HTTP/HTTPS ports for it's own usage (non-loggerhead one), which would conflict with updates nginx.. Unless we can host it behind nginx too. > I was going to propose CentOS 7 for all the newly-spun VMs. Hmm, I'd feel safer if we'd go with CentOS 6 still as I'm not entirely sure if Puppet modules and the like yet support CentOS 7 fully. Of course, we could try and see since we can configure the new systems before we take them into production.. Although, it might delay migration due to unforeseen problems.. and we do have end of the year deadline to reach for some. However, it would save us upgrade to C7 in few years. :)
Depends on: 1124279
OK, so revised based on the above comments: == EXISTING SYSTEMS == bugzilla1: (phys) cores; 4, RAM: 4GB, HD: 1TB, in use: 16GB bugzilla2: (phys) cores: 4, RAM: 4GB, HD: 250GB, in use: 24GB bugzilla3: (phys) cores: 4, RAM: 4GB, HD: ?? (dead) landfill: (esx) cores: 2, RAM: 4GB, HD: 150GB, in use: 42GB landfill2: (esx) cores: 4, RAM: 4GB, HD: 150GB, in use: 2GB updates: (kvm) cores: 1, RAM: 512MB, HD: 7GB, in use: 2GB bots: (kvm) cores: 1, RAM: 1GB, HD: 7GB, in use: 3GB infra: (kvm) cores: 2, RAM: 512MB, HD: 35GB, in use: 3GB oracle: (esx) cores: 2, RAM: 4GB, HD: 100GB, in use: 2GB mailman4: (esx) cores: 4, RAM: 4GB, HD: 40GB, in use: 15GB windows: (esx) cores: 1, RAM: 1.5GB, HD: 40GB, in use: 12GB == PLANNED FINISHED STATE == VM1: infra.bugzilla.org - Internal Services - DNS for internal VLAN (if we do an internal vlan) - puppetmaster - IRC bot hosting (bugbot and word) - RPM build? VM2: cps.bugzilla.org - Critical Public Services - Mail services (mailman mailing lists, @bugzilla.org email forwarding addresses) - Bugzilla update service - BZR repository (until Bugzilla 4.4 EOLs - because Mozilla doesn't want to host it anymore) VM3: db1.bugzilla.org - Database Server - MySQL - Postgres VM4: oracle.bugzilla.org - Database Server - Oracle VM5: landfill.bugzilla.org - Bugzilla Test Server - Shell Server - Web Server VM6: demo.bugzilla.org - Bugzilla Demo server (what bugzilla3 used to be) - Web Server == HOW TO GET THERE == Nothing to do, already how we want it: - oracle.bugzilla.org Immediately dispose of the following systems: - bugzilla3.community.scl3.mozilla.com - bugzilla-windows.community.scl3.mozilla.com New VMs in ESX: - infra.bugzilla.org - 2 cores - 4 GB RAM (justification = puppetmaster will use it harshly) - 40 GB HD - cps.bugzilla.org - 2 cores - 4 GB RAM (for bzr) - 40 GB HD - db1.bugzilla.org - 2 cores - 8 GB RAM (for mysql, lots of copies of large databases) - 100 GB HD - demo.bugzilla.org - 2 cores - 4 GB RAM - 60 GB HD Migrations: - infra(old) -> infra(new) - bugzilla1 -> db1.bugzilla.org - should be setting up from scratch, I understand nothing of value is on bugzilla1 currently but we'll call it a migration just to be safe - mailman4 (and related stuff from mx1.mail.scl3) -> cps.bugzilla.org - updates -> move services to cps.bugzilla.org - bots -> move services to infra.bugzilla.org - bzr.mozilla.org -> cps.bugzilla.org - landfill(old) -> landfill(new) Disposal after migrations are completed: - bugzilla1.community.scl3.mozilla.com - bugzilla2.community.scl3.mozilla.com - infra, updates, and bots are KVM VMs hosted on this machine and will go away with it. - bzr2.dmz.scl3.mozilla.com - landfill(old).bugzilla.org Any objections?
Flags: needinfo?(wicked)
Whiteboard: current plan at comment 9
I'm not sure if the bugzilla.org website is included in any of these machines, but given the lack of movement in bug 971411, perhaps we should set up our own WordPress. Something more flexible would be nicer than the proposed multisite one anyway.
The plan in comment 9 looks good to me.
Flags: needinfo?(wicked)
Depends on: 1133178
Depends on: 628085
Depends on: 1155525
Depends on: 1155526
No longer depends on: 1155526
A new wrinkle in this (relatively - the bug hasn't been getting updated) is that Mozilla is moving out of the datacenter that we're currently hosted in, and currently has no plans to provide hosting for community services in the new datacenter. So we need to move out/find a new home. There's a table at the top of https://wiki.mozilla.org/Bugzilla:Infrastructure which has been updated with a current list of our infrastructure, and my proposed disposition for it. We have until 3rd quarter 2018, so there's plenty of time yet. I'm waiting for a few more details from Mozilla on their future plans (I have a meeting on Dec 7 with Mozilla folks where I can grill them on it) before I start shopping for an ISP or soliciting offers for donated hosting.
Okay, had the meeting on Dec 7, it turns out Mozilla's still happy to pay for hosting for their officially supported community projects, they just don't want to physically do the hosting themselves anymore. So we can still move elsewhere that costs money and they'll pay for it (they need to approve of it of course, and they want to provide the minimum required to meet our needs of course). Our needs (from the above list) have been communicated, and they're supposed to be getting back to me with options.

This was sorta done a long time ago... and by sorta, I mean the stuff moved to Linode (as described on the wiki page mentioned in comment 12), but it's not getting paid for yet. :mhoye is personally covering one of the machines and I'm personally covering the other. The one I'm covering is about $60/year, which isn't a big deal to me, so I'm not in a hurry.

Assignee: justdave → mhoye
No longer depends on: 628085

not needed anymore

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

(In reply to Sylvestre Ledru [:Sylvestre] from comment #15)

not needed anymore

Or rather: already done. The funding issue is about to be resolved also.

You need to log in before you can comment on or make changes to this bug.