Closed Bug 658743 Opened 13 years ago Closed 11 years ago

Upgrade staging cluster for Bugzilla

Categories

(Developer Services :: General, task, P2)

All
Other

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justdave, Assigned: fox2mike)

References

Details

(Whiteboard: [2013Q1])

We need to set up a staging environment for Bugzilla that reasonably matches production so we can do proper testing.

If it lives in SJC1 we already have sufficient database for it that reasonably matches production.  We would only need 2 additional webservers (so we have 2 each for bugzilla-stage and bugzilla-stage-tip) and set up some Zeus vIPs.  Probably a VM to run the equivalent of mradm02/ip-admin02 on its behalf.

If it lives in PHX1 we'll need to add database servers, too.
Assignee: server-ops → nobody
Component: Server Operations → Server Operations: Projects
Depends on: 565897
Group: infra
Could we possibly merge this or pull in the VM/machine I am requesting in bug 716641 as well? I assume it would be better to manage the staging/development instances together.

dkl
OK, this has now become a priority because of bug 716641 (which just got merged with this).

Here's specifically what we need:

4 bugzilla-class webheads, should be the same specs as the existing production webheads in PHX.  I believe that's blades with 12 GB of RAM with quad-core processors (or dual-core with HT?).
  - 2 of them will be used for bugzilla-stage (pre-production staging)
  - 1 of them will be used for bugzilla-stage-tip (development instance of bmo fork)
  - 1 of them will be used for bugzilla-stage-next (pre-upgrade testing for major ups)

2 or 3 (up to DBAs?) database servers, similar specs to production databases, although they can live with the SSD and just use real disk.
  - 1 master
  - 1 or 2 slaves

Database servers need enough disk capacity to host at least 3 copies of Bugzilla (perhaps 4 for good measure?).  Each instance of Bugzilla currently occupies about 42GB on disk, and growing.  300GB of disk is probably a good choice.
Assignee: nobody → server-ops
Severity: minor → major
Component: Server Operations: Projects → Server Operations
QA Contact: mrz → cshields
while the severity was probably correct at major, this doesn't need to be paging oncall...  sorry.
Severity: major → normal
(In reply to Dave Miller [:justdave] from comment #3)
> although they can live with the SSD and just use real disk.

Er, they can live *without* the SSD. :)
Database servers are also going to need a LOT of RAM.  Forgot about that part...  Bugzilla's all InnoDB, technically you need enough RAM to fit it in memory.  48GB of RAM would only fit one.  Maybe we need separate DB servers for each instance?  (yikes, getting expensive now).
FYI, we have an immediate need for the servers related to "bugzilla-stage-next", the others can be a longer-term project.  So we need at least 2 DB servers and 1 app server for short-term.  But if the hardware is similarly difficult to obtain all around, then doing it all at once won't hurt. :)
and I'm guessing a single database server will probably be fine...  there won't be thousands of people hitting this cluster probably so the likelihood of the innodb buffer pool filling up isn't that high anyway :)
can we do the db server on a blade an the rest on VMs?
(In reply to Corey Shields [:cshields] from comment #9)
> can we do the db server on a blade an the rest on VMs?

We're specifically not using a VM in sjc1 right now because it used to be on a VM and the performance sucked.  Maybe ESX is better these days, I dunno.  On the other hand, there was a desire to replicate production, at least for the final staging site.
(In reply to Dave Miller [:justdave] from comment #10)
> (In reply to Corey Shields [:cshields] from comment #9)
> > can we do the db server on a blade an the rest on VMs?
> 
> We're specifically not using a VM in sjc1 right now because it used to be on
> a VM and the performance sucked.  Maybe ESX is better these days, I dunno. 
> On the other hand, there was a desire to replicate production, at least for
> the final staging site.

comparing our new environment in phx1 to sjc1 is apples and oranges.  Let's try VMs for now as we don't have hardware on the horizon, pushing this off to Q2 at the earliest.
Whiteboard: 2012 q2 at earliest
Assignee: server-ops → shyam
(In reply to Corey Shields [:cshields] from comment #11)
> pushing this off to Q2 at the earliest.

Can't wait that long, the existing stuff is in SJC1 and will go away before then.  VMs will probably be fine, the new ESX stuff is bloody fast when I've messed with it so far. :)
Blocks: scl3-move
Whiteboard: 2012 q2 at earliest → before SJC1 move-out
Any updates here? Would like to confirm that the virtual machine dm-bugstage02.vmx does not need migrated and that this will get done before we're out of sjc1
According to Corey:

[16:07:40] <cshields> we moved the bugzilla stage VMs
[16:08:01] <cshields> so that bug should be invalid
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
(In reply to Sheeri Cabral [:sheeri] from comment #14)
> According to Corey:
> 
> [16:07:40] <cshields> we moved the bugzilla stage VMs
> [16:08:01] <cshields> so that bug should be invalid

while the bugzilla stage VMs have been moved the scl3, they still don't reasonably match production.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
No worries - I have fixed the whiteboard so it's not an sjc move thing anymore.

It would be awesome if someone could write out next steps for this.
Assignee: shyam → server-ops
Whiteboard: before SJC1 move-out
No longer blocks: scl3-move
Assignee: server-ops → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: cshields → shyam
Whiteboard: [after scl3]
In Comment 8 justdave mentioned a single machine would be fine, for performance issues. However, I'd like to still have reads/writes separated, like they are in production. Right now we're still building the 2nd slave (it needed to be rebuilt) so it's not ready yet, but when it is, I'd like to make sure that is used.
the 2nd slave is built and replicating the first.  (physical machine)

What are the next steps?
(In reply to Sheeri Cabral [:sheeri] from comment #18)
> the 2nd slave is built and replicating the first.  (physical machine)
> 
> What are the next steps?

I think this bug also hinges on the work being done in bug 716641 so basically:

https://bugzilla.mozilla.org/show_bug.cgi?id=716641#c52

dkl
Assignee: server-ops-devservices → shyam
glob/dkl,

As we're starting Q3 in a while, let's list out what we need here and have this as a Q3 goal?
Whiteboard: [after scl3] → Q32012
we'll need something closer to production; so, for *both* pre-production staging (bugzilla-stage) and development staging (bugzilla-stage-tip) i'd expect:

  - webhead cluster, two nodes, with some sort of zeus balancing action
  - database cluster with functional master/slave

the webheads ideally should have the same amount of ram as production, as the ram configuration is in code (grr).  however as bugzilla makes such poor use of ram, we should be good with about 6g per webhead (half of production).
Sweet. I'm thinking Seamicro Xeons for all this. I'll flesh it out a bit more and then file dependent bugs etc.
Whiteboard: Q32012 → [2012q3]
Severity: normal → critical
Priority: -- → P2
Updated plan of action here :

1) Create 3 webheads - 1 for bugzilla-stage-tip and 2 for bugzilla-stage
2) These will be "built from scratch" - with RHEL 6 + Bugzilla 4.2 
3) These will match Bugzilla production as they move into 4.2 and further ahead.
4) This is semi dependent on Bugzilla production moving to 4.2, the hardware bits of it aren't, but the servers themselves going live/replacing the current stage environments are.

glob, r?
Summary: Deploy staging cluster for Bugzilla → Upgrade staging cluster for Bugzilla
Depends on: 786734
assuming there's a db cluster behind that to allow the DBAs to test changes to clustering, i'm happy with the plan in comment 23.
(In reply to Byron Jones ‹:glob› from comment #24)
> assuming there's a db cluster behind that to allow the DBAs to test changes
> to clustering, i'm happy with the plan in comment 23.

Yup. The current DB cluster behind bugzilla-stage and bugzilla-stage-tip is production grade hardware. It's already on RHEL 6, we want to upgrade it to MariaDB 5.5 when we can (bug 786734) and we should be good to continue using this :)
Severity: critical → normal
This goes along with prod Bugzilla, and will be completed this quarter.
Whiteboard: [2012q3] → [2012q4]
Depends on: 815048
Whiteboard: [2012q4] → [2013Q1]
So, we are planning to release 4.2 the first week in February. This needs to be completed and ready to go before that week. What is its ETA and when can we expect these machines to be live and ready for use?
Flags: needinfo?(shyam)
(In reply to Clint Talbert ( :ctalbert ) from comment #27)
> So, we are planning to release 4.2 the first week in February. This needs to
> be completed and ready to go before that week. What is its ETA and when can
> we expect these machines to be live and ready for use?

No later than 24 January 2013. I was hoping to have them ready, but ran into other production issues that preempted this.
Flags: needinfo?(shyam)
(In reply to Shyam Mani [:fox2mike] from comment #28)
> (In reply to Clint Talbert ( :ctalbert ) from comment #27)
> > So, we are planning to release 4.2 the first week in February. This needs to
> > be completed and ready to go before that week. What is its ETA and when can
> > we expect these machines to be live and ready for use?
> 
> No later than 24 January 2013. I was hoping to have them ready, but ran into
> other production issues that preempted this.

Great, that would be excellent. Thanks Shyam.
Okay please do everything in your power to make this happen by the 24th.  We'll need at least two weeks of testing, so now we're looking at no earlier than the week of February 11th for the launch.  This is assuming we have production set up by then (see bug 726710).
Update here : the new stage - bugzilla.allizom.org has been configured and is "up". I've had a chat with glob and have started a checksetup.pl run on this, we'll see how long it takes to complete.
https://bugzilla.allizom.org/ is up and running now.
(In reply to Shyam Mani [:fox2mike] from comment #32)
> https://bugzilla.allizom.org/ is up and running now.

\o/ shyam! \o/
Depends on: 838494
Depends on: 838498
Just another short update here. The new behind auth stage environment is fully functional and is now auto-updating every 5 mins. Once 840879 is done (I'll work with glob on that in the morning tomorrow), we'll have a public test system ready to go as well.
Over the past few weeks we've gotten the new stage environments online and tested them. Everything looks good. These environments mirror production and match the layouts (DBs, load balancers) and the like and should help the bugzilla.mozilla.org team continue their work.

https://bugzilla.allizom.org/ and https://bugzilla-dev.allizom.org/ are the new environments. We will be killing bugzilla-stage and bugzilla-stage-tip after we move to 4.2 and SCL3.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.