Closed Bug 622752 Opened 13 years ago Closed 13 years ago

Build out SUMO in PHX

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Other
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: cshields, Assigned: cshields)

References

Details

(Whiteboard: [02/10/2011 @ 4pm][tracker])

1 - we will build the following systems in PHX:

5 app servers
2 db servers
1 celery server

2 - puppetize them

3 - install SUMO, copy data

4 - zeus config

5 - test
A status update to this...  All of the servers for step #1 are physically present in PHX.  2 of the app servers have been installed today, the other 3 will be reinstalled tomorrow (replacing RHEL5.5 with RHEL6) along with the celery server.  The db servers are already setup and will remain as-is on RHEL5.5
Is going from 4 database servers (1 master + 3 slaves) to 2 going to increase the risk of downtime? That seems like a lot less redundancy and load balancing.
Ahh..  This is my mistake, I took what servers we already had allocated for sumo in phx (being 3 app servers and 2 db servers) and added 2 more app servers and a celery server..  I think we will be okay though, I believe we still have another 2 systems matching the db specs down there.  I will double check tomorrow and add them to the list.

Thanks for pointing that out, I was going on the assumption that the current setup in phx was duplicating what we already had.
We only added the third slave fairly recently, but the total, for the record, now is:

4 app servers (one running celery) (pm-app-sumo0{1-4})
1+3 database servers (tm-sumo01-master01, tm-sumo01-slave0{1-3})

and we depend on pm-app-sphinx01/02 and pm-app-memcache(01-03, I think).

pm-app-sphinx01/02 serves a number of sites, and seems to be doing fine, but is there a plan to move some shared sphinx/memcache infrastructure to PHX so we're not doing the 1400-mile round-trip during requests? Or is that not worth worrying about at this point?
(In reply to comment #4)
> pm-app-sphinx01/02 serves a number of sites, and seems to be doing fine, but is
> there a plan to move some shared sphinx/memcache infrastructure to PHX so we're
> not doing the 1400-mile round-trip during requests? Or is that not worth
> worrying about at this point?

We already have sphinx in phx, spinning up memcache servers shouldn't be too hard.
(In reply to comment #5)
> We already have sphinx in phx, spinning up memcache servers shouldn't be too
> hard.

Awesome! Just making sure everything's recorded.
app servers are kickstarted..  memcache server has been kickstarted.  I'll kickstart the db and celery hosts tomorrow, and start to puppetize them.
OS: Mac OS X → Other
Missing packages from rhel6:

pylibmc
libmemcached 0.38-1 (only 0.31-1 is available)
php-eaccelerator
php-pear-HTML-Common
php-pear-HTML-QuickForm
PIL

I've put in conditionals in the puppet manifests to distinguish between rhel5 and rhel6 and have removed the offending packages for now to make puppet stop complaining. Some of these might not be needed anymore, but if they are, then they will need to be built from SRPM and dropped into mrepo and then re-added to puppet.
removed php-pecl-fileinfo as well as it was conflicting with php-common. Also seems there are some apache issues. I've tried replacing httpd.conf with a rhel6 version, but I see lots of wsgi related issues in error_log, probably needs some debugging and testing from someone that knows about wsgi.
If it saves you time, we don't need PHP at all.
That saves loads of time actually. And I think it will be best to re-write the puppet class altogether. Is there a comprehensive list of all the requirements for SUMO, including apache modules and such?
Requirements (from our docs/requirements files):

Python and compiled Python packages (we should have RPMs already):

* Python 2.6
* MySQL-Python 1.2.3
* Jinja2 >= 2.5.2
* PIL 1.1.7
* lxml 2.2.6

Apache:

* mod_wsgi
* mod_rewrite
* mod_expires

Honestly I think those are our only Apache requirements.

Misc:

* RabbitMQ (assuming it runs on the celery box, it only needs to be on that one).
* Sphinx Search (already exists in PHX [comment 5]).
* Memcached (comment 5).

All the other requirements that I can think of are pure Python and in our vendor library.
Oh, I lied a little: PIL has additional requirements:

* libjpeg
* zlib (and/or whatever it takes for PIL to compile with PNG support)
is libmemcached required to be version 0.38-1 ? that is what the current puppet manifest calls for, and I suspect that is an RPM that we built ourselves for RHEL5. Also, PIL doesn't seem to be in the RHEL6 repos, so we'll have to build that one as well. Not sure yet about Jinja. I need to check versions of the other stuff.
Strictly speaking, I don't believe we need libmemcached, either. We aren't using pylibmc, we're using python-memcached, which is pure-Python.

If we do go with pylibmc, which AMO is using, I think we need >= 0.34. They should have RPMs that are at least the correct version of libmemcached, if not the correct RHEL.
Great, pylibmc isn't in the repo either, so I'm happy skipping that dependency and libmemcached...
[root@support1 ~]# rpm -qa |grep MySQL
MySQL-python-1.2.3-0.3.c1.1.el6.x86_64
[root@support1 ~]# rpm -qa |grep Jinja
Jinja2-2.5.2-2.x86_64
[root@support1 ~]# rpm -qa |grep lxml
python-lxml-2.2.3-1.1.el6.x86_64
[root@support1 ~]# rpm -qa |grep mod_wsgi
mod_wsgi-3.2-1.el6.x86_64

is that version of lxml going to work?
I'll need time to test that version. Are you sure 2.2.6 isn't available anywhere?
Depends on: 624819
There is a 2.2.6 rpm in fedora 13.  Sounds like if you need 2.2.6 we should build our own, which might be easier than troubleshooting against an older version.
(In reply to comment #20)
> There is a 2.2.6 rpm in fedora 13.  Sounds like if you need 2.2.6 we should
> build our own, which might be easier than troubleshooting against an older
> version.

What version are we running on the pm-app-sumo cluster? Clearly it works.

(In reply to comment #21)
> What version are we running on the pm-app-sumo cluster? Clearly it works.

[root@pm-app-sumo01 ~]# rpm -qa |grep lxml
lxml-2.2.6-1

so yeah...
Depends on: 615547
Quick update: database servers are all kickstarted and puppetized.  Replication is going between 1-2, will setup 3 & 4 tomorrow.

James, we should chat soon about what we will need to do DB wise when it comes time to cutover.
Depends on: 625646
James,

we are going to re-do the webheads with rhel5.5 like the others, since it will mean a quicker time to migration.  We don't have the bandwidth to fix and test a new stack with everything else going on right now.
Current plan is for SUMO to move to PHX during the 2/10 maintenance window.  We have the RHEL5.5 environment "up" but untested, and the admin functions still need moved over.  As discussed on a call this morning, jabba is going to keep working on a RHEL6 + puppet environment for SUMO in PHX.  The drop-dead date for this to be working is EOB Friday (2/4).  If we get that working in time we will change the other webheads which should be no problem using puppet.
Flags: needs-downtime+
Whiteboard: [tracker] → [tracker][02/10/2011]
Whiteboard: [tracker][02/10/2011] → [02/10/2011 @ 4pm][tracker]
Steps for today:

1)  set the following in settings_local.py to set read_only:  read_only_mode(globals())  then /data/sumo/deploy
 
2)  dump the openfire_chat and support_mozilla_com databases from tm-sumo01-master01.mozilla.org
 
3)  copy and import those databases to support1.db.phx1.mozilla.com
 
4)  change the rw and ro db VIPs in SJC to point to the 4 servers in PHX (necessary for dm-chat01 and metrics)
 
5)  Have QA test and check phx1 setup
 
6)  change support.mozilla.com dns from 63.245.209.132 to 63.245.217.50
 
7)  un-set read_only_mode(globals()) in settings_local.py then /data/sumo/deploy
(In reply to comment #26)
> 1)  set the following in settings_local.py to set read_only: 
> read_only_mode(globals())  then /data/sumo/deploy

Note that this must be the *LAST* line of settings_local.py.
SUMO is now live in phx1 and tested successfully..

We will have a postmortem for the move tomorrow at 16:00.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
All dependent bugs are fixed, and the move has "long-since" been verified.

Verified FIXED.
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.