658743 - Upgrade staging cluster for Bugzilla

Reporter

Description

•

13 years ago

We need to set up a staging environment for Bugzilla that reasonably matches production so we can do proper testing.

If it lives in SJC1 we already have sufficient database for it that reasonably matches production.  We would only need 2 additional webservers (so we have 2 each for bugzilla-stage and bugzilla-stage-tip) and set up some Zeus vIPs.  Probably a VM to run the equivalent of mradm02/ip-admin02 on its behalf.

If it lives in PHX1 we'll need to add database servers, too.

Dave Miller [:justdave]

Reporter

Updated

•

13 years ago

Assignee: server-ops → nobody

Component: Server Operations → Server Operations: Projects

Dave Miller [:justdave]

Reporter

Updated

•

13 years ago

Depends on: 565897

Dave Miller [:justdave]

Reporter

Updated

•

12 years ago

Group: infra

David Lawrence [:dkl]

Comment 1

•

12 years ago

Could we possibly merge this or pull in the VM/machine I am requesting in bug 716641 as well? I assume it would be better to manage the staging/development instances together.

dkl

Dave Miller [:justdave]

Reporter

Comment 3

•

12 years ago

OK, this has now become a priority because of bug 716641 (which just got merged with this).

Here's specifically what we need:

4 bugzilla-class webheads, should be the same specs as the existing production webheads in PHX.  I believe that's blades with 12 GB of RAM with quad-core processors (or dual-core with HT?).
  - 2 of them will be used for bugzilla-stage (pre-production staging)
  - 1 of them will be used for bugzilla-stage-tip (development instance of bmo fork)
  - 1 of them will be used for bugzilla-stage-next (pre-upgrade testing for major ups)

2 or 3 (up to DBAs?) database servers, similar specs to production databases, although they can live with the SSD and just use real disk.
  - 1 master
  - 1 or 2 slaves

Database servers need enough disk capacity to host at least 3 copies of Bugzilla (perhaps 4 for good measure?).  Each instance of Bugzilla currently occupies about 42GB on disk, and growing.  300GB of disk is probably a good choice.

Assignee: nobody → server-ops

Severity: minor → major

Component: Server Operations: Projects → Server Operations

QA Contact: mrz → cshields

Dave Miller [:justdave]

Reporter

Comment 4

•

12 years ago

while the severity was probably correct at major, this doesn't need to be paging oncall...  sorry.

Severity: major → normal

Dave Miller [:justdave]

Reporter

Comment 5

•

12 years ago

(In reply to Dave Miller [:justdave] from comment #3)
> although they can live with the SSD and just use real disk.

Er, they can live *without* the SSD. :)

Dave Miller [:justdave]

Reporter

Comment 6

•

12 years ago

Database servers are also going to need a LOT of RAM.  Forgot about that part...  Bugzilla's all InnoDB, technically you need enough RAM to fit it in memory.  48GB of RAM would only fit one.  Maybe we need separate DB servers for each instance?  (yikes, getting expensive now).

Dave Miller [:justdave]

Reporter

Comment 7

•

12 years ago

FYI, we have an immediate need for the servers related to "bugzilla-stage-next", the others can be a longer-term project.  So we need at least 2 DB servers and 1 app server for short-term.  But if the hardware is similarly difficult to obtain all around, then doing it all at once won't hurt. :)

Dave Miller [:justdave]

Reporter

Comment 8

•

12 years ago

and I'm guessing a single database server will probably be fine...  there won't be thousands of people hitting this cluster probably so the likelihood of the innodb buffer pool filling up isn't that high anyway :)

Corey Shields [:cshields]

Comment 9

•

12 years ago

can we do the db server on a blade an the rest on VMs?

Dave Miller [:justdave]

Reporter

Comment 10

•

12 years ago

(In reply to Corey Shields [:cshields] from comment #9)
> can we do the db server on a blade an the rest on VMs?

We're specifically not using a VM in sjc1 right now because it used to be on a VM and the performance sucked.  Maybe ESX is better these days, I dunno.  On the other hand, there was a desire to replicate production, at least for the final staging site.

Corey Shields [:cshields]

Comment 11

•

12 years ago

(In reply to Dave Miller [:justdave] from comment #10)
> (In reply to Corey Shields [:cshields] from comment #9)
> > can we do the db server on a blade an the rest on VMs?
> 
> We're specifically not using a VM in sjc1 right now because it used to be on
> a VM and the performance sucked.  Maybe ESX is better these days, I dunno. 
> On the other hand, there was a desire to replicate production, at least for
> the final staging site.

comparing our new environment in phx1 to sjc1 is apples and oranges.  Let's try VMs for now as we don't have hardware on the horizon, pushing this off to Q2 at the earliest.

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Whiteboard: 2012 q2 at earliest

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Assignee: server-ops → shyam

Dave Miller [:justdave]

Reporter

Comment 12

•

12 years ago

(In reply to Corey Shields [:cshields] from comment #11)
> pushing this off to Q2 at the earliest.

Can't wait that long, the existing stuff is in SJC1 and will go away before then.  VMs will probably be fine, the new ESX stuff is bloody fast when I've messed with it so far. :)

Blocks: scl3-move

Dave Miller [:justdave]

Reporter

Updated

•

12 years ago

Whiteboard: 2012 q2 at earliest → before SJC1 move-out

Rob Tucker [:rtucker]

Comment 13

•

12 years ago

Any updates here? Would like to confirm that the virtual machine dm-bugstage02.vmx does not need migrated and that this will get done before we're out of sjc1

Sheeri Cabral [:sheeri]

Comment 14

•

12 years ago

According to Corey:

[16:07:40] <cshields> we moved the bugzilla stage VMs
[16:08:01] <cshields> so that bug should be invalid

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → INVALID

:glob ✱

Comment 15

•

12 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #14)
> According to Corey:
> 
> [16:07:40] <cshields> we moved the bugzilla stage VMs
> [16:08:01] <cshields> so that bug should be invalid

while the bugzilla stage VMs have been moved the scl3, they still don't reasonably match production.

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Sheeri Cabral [:sheeri]

Comment 16

•

12 years ago

No worries - I have fixed the whiteboard so it's not an sjc move thing anymore.

It would be awesome if someone could write out next steps for this.

Assignee: shyam → server-ops

Whiteboard: before SJC1 move-out

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

No longer blocks: scl3-move

Phong Tran [:phong]

Updated

•

12 years ago

Assignee: server-ops → server-ops-devservices

Component: Server Operations → Server Operations: Developer Services

QA Contact: cshields → shyam

Whiteboard: [after scl3]

Sheeri Cabral [:sheeri]

Comment 17

•

12 years ago

In Comment 8 justdave mentioned a single machine would be fine, for performance issues. However, I'd like to still have reads/writes separated, like they are in production. Right now we're still building the 2nd slave (it needed to be rebuilt) so it's not ready yet, but when it is, I'd like to make sure that is used.

Sheeri Cabral [:sheeri]

Comment 18

•

12 years ago

the 2nd slave is built and replicating the first.  (physical machine)

What are the next steps?

David Lawrence [:dkl]

Comment 19

•

12 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #18)
> the 2nd slave is built and replicating the first.  (physical machine)
> 
> What are the next steps?

I think this bug also hinges on the work being done in bug 716641 so basically:

https://bugzilla.mozilla.org/show_bug.cgi?id=716641#c52

dkl

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Assignee: server-ops-devservices → shyam

Shyam Mani [:fox2mike]

Assignee

Comment 20

•

12 years ago

glob/dkl,

As we're starting Q3 in a while, let's list out what we need here and have this as a Q3 goal?

Whiteboard: [after scl3] → Q32012

:glob ✱

Comment 21

•

12 years ago

we'll need something closer to production; so, for *both* pre-production staging (bugzilla-stage) and development staging (bugzilla-stage-tip) i'd expect:

  - webhead cluster, two nodes, with some sort of zeus balancing action
  - database cluster with functional master/slave

the webheads ideally should have the same amount of ram as production, as the ram configuration is in code (grr).  however as bugzilla makes such poor use of ram, we should be good with about 6g per webhead (half of production).

Shyam Mani [:fox2mike]

Assignee

Comment 22

•

12 years ago

Sweet. I'm thinking Seamicro Xeons for all this. I'll flesh it out a bit more and then file dependent bugs etc.

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Whiteboard: Q32012 → [2012q3]

Sheeri Cabral [:sheeri]

Updated

•

12 years ago

Severity: normal → critical

Priority: -- → P2

Shyam Mani [:fox2mike]

Assignee

Comment 23

•

12 years ago

Updated plan of action here :

1) Create 3 webheads - 1 for bugzilla-stage-tip and 2 for bugzilla-stage
2) These will be "built from scratch" - with RHEL 6 + Bugzilla 4.2 
3) These will match Bugzilla production as they move into 4.2 and further ahead.
4) This is semi dependent on Bugzilla production moving to 4.2, the hardware bits of it aren't, but the servers themselves going live/replacing the current stage environments are.

glob, r?

Summary: Deploy staging cluster for Bugzilla → Upgrade staging cluster for Bugzilla

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Depends on: 786734

:glob ✱

Comment 24

•

12 years ago

assuming there's a db cluster behind that to allow the DBAs to test changes to clustering, i'm happy with the plan in comment 23.

Shyam Mani [:fox2mike]

Assignee

Comment 25

•

12 years ago

(In reply to Byron Jones ‹:glob› from comment #24)
> assuming there's a db cluster behind that to allow the DBAs to test changes
> to clustering, i'm happy with the plan in comment 23.

Yup. The current DB cluster behind bugzilla-stage and bugzilla-stage-tip is production grade hardware. It's already on RHEL 6, we want to upgrade it to MariaDB 5.5 when we can (bug 786734) and we should be good to continue using this :)

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Severity: critical → normal

Shyam Mani [:fox2mike]

Assignee

Comment 26

•

12 years ago

This goes along with prod Bugzilla, and will be completed this quarter.

Whiteboard: [2012q3] → [2012q4]

Shyam Mani [:fox2mike]

Assignee

Updated

•

12 years ago

Depends on: 815048

Melissa O'Connor [:melissa]

Updated

•

12 years ago

Whiteboard: [2012q4] → [2013Q1]

cmtalbert

Comment 27

•

11 years ago

So, we are planning to release 4.2 the first week in February. This needs to be completed and ready to go before that week. What is its ETA and when can we expect these machines to be live and ready for use?

Flags: needinfo?(shyam)

Shyam Mani [:fox2mike]

Assignee

Comment 28

•

11 years ago

(In reply to Clint Talbert ( :ctalbert ) from comment #27)
> So, we are planning to release 4.2 the first week in February. This needs to
> be completed and ready to go before that week. What is its ETA and when can
> we expect these machines to be live and ready for use?

No later than 24 January 2013. I was hoping to have them ready, but ran into other production issues that preempted this.

Flags: needinfo?(shyam)

cmtalbert

Comment 29

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from comment #28)
> (In reply to Clint Talbert ( :ctalbert ) from comment #27)
> > So, we are planning to release 4.2 the first week in February. This needs to
> > be completed and ready to go before that week. What is its ETA and when can
> > we expect these machines to be live and ready for use?
> 
> No later than 24 January 2013. I was hoping to have them ready, but ran into
> other production issues that preempted this.

Great, that would be excellent. Thanks Shyam.

Mark Côté [:mcote]

Comment 30

•

11 years ago

Okay please do everything in your power to make this happen by the 24th.  We'll need at least two weeks of testing, so now we're looking at no earlier than the week of February 11th for the launch.  This is assuming we have production set up by then (see bug 726710).

Shyam Mani [:fox2mike]

Assignee

Comment 31

•

11 years ago

Update here : the new stage - bugzilla.allizom.org has been configured and is "up". I've had a chat with glob and have started a checksetup.pl run on this, we'll see how long it takes to complete.

Shyam Mani [:fox2mike]

Assignee

Comment 32

•

11 years ago

https://bugzilla.allizom.org/ is up and running now.

David Lawrence [:dkl]

Comment 33

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from comment #32)
> https://bugzilla.allizom.org/ is up and running now.

\o/ shyam! \o/

:glob ✱

Updated

•

11 years ago

Depends on: 838494

:glob ✱

Updated

•

11 years ago

Depends on: 838498

Shyam Mani [:fox2mike]

Assignee

Comment 34

•

11 years ago

Just another short update here. The new behind auth stage environment is fully functional and is now auto-updating every 5 mins. Once 840879 is done (I'll work with glob on that in the morning tomorrow), we'll have a public test system ready to go as well.

Shyam Mani [:fox2mike]

Assignee

Comment 35

•

11 years ago

Over the past few weeks we've gotten the new stage environments online and tested them. Everything looks good. These environments mirror production and match the layouts (DBs, load balancers) and the like and should help the bugzilla.mozilla.org team continue their work.

https://bugzilla.allizom.org/ and https://bugzilla-dev.allizom.org/ are the new environments. We will be killing bugzilla-stage and bugzilla-stage-tip after we move to 4.2 and SCL3.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Component: Server Operations: Developer Services → General

Product: mozilla.org → Developer Services