Last Comment Bug 622752 - Build out SUMO in PHX
: Build out SUMO in PHX
Status: VERIFIED FIXED
[02/10/2011 @ 4pm][tracker]
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 Other
: -- normal (vote)
: ---
Assigned To: Corey Shields [:cshields]
: matthew zeier [:mrz]
Mentors:
: 613323 (view as bug list)
Depends on: 615547 623267 624819 625646 626819
Blocks: 613323
  Show dependency treegraph
 
Reported: 2011-01-03 18:12 PST by Corey Shields [:cshields]
Modified: 2015-03-12 08:17 PDT (History)
11 users (show)
mzeier: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Corey Shields [:cshields] 2011-01-03 18:12:18 PST
1 - we will build the following systems in PHX:

5 app servers
2 db servers
1 celery server

2 - puppetize them

3 - install SUMO, copy data

4 - zeus config

5 - test
Comment 1 Corey Shields [:cshields] 2011-01-03 18:16:27 PST
A status update to this...  All of the servers for step #1 are physically present in PHX.  2 of the app servers have been installed today, the other 3 will be reinstalled tomorrow (replacing RHEL5.5 with RHEL6) along with the celery server.  The db servers are already setup and will remain as-is on RHEL5.5
Comment 2 James Socol [:jsocol, :james] 2011-01-03 18:29:16 PST
Is going from 4 database servers (1 master + 3 slaves) to 2 going to increase the risk of downtime? That seems like a lot less redundancy and load balancing.
Comment 3 Corey Shields [:cshields] 2011-01-03 19:05:13 PST
Ahh..  This is my mistake, I took what servers we already had allocated for sumo in phx (being 3 app servers and 2 db servers) and added 2 more app servers and a celery server..  I think we will be okay though, I believe we still have another 2 systems matching the db specs down there.  I will double check tomorrow and add them to the list.

Thanks for pointing that out, I was going on the assumption that the current setup in phx was duplicating what we already had.
Comment 4 James Socol [:jsocol, :james] 2011-01-03 19:25:44 PST
We only added the third slave fairly recently, but the total, for the record, now is:

4 app servers (one running celery) (pm-app-sumo0{1-4})
1+3 database servers (tm-sumo01-master01, tm-sumo01-slave0{1-3})

and we depend on pm-app-sphinx01/02 and pm-app-memcache(01-03, I think).

pm-app-sphinx01/02 serves a number of sites, and seems to be doing fine, but is there a plan to move some shared sphinx/memcache infrastructure to PHX so we're not doing the 1400-mile round-trip during requests? Or is that not worth worrying about at this point?
Comment 5 Shyam Mani [:fox2mike] 2011-01-03 20:40:10 PST
(In reply to comment #4)
> pm-app-sphinx01/02 serves a number of sites, and seems to be doing fine, but is
> there a plan to move some shared sphinx/memcache infrastructure to PHX so we're
> not doing the 1400-mile round-trip during requests? Or is that not worth
> worrying about at this point?

We already have sphinx in phx, spinning up memcache servers shouldn't be too hard.
Comment 6 James Socol [:jsocol, :james] 2011-01-04 06:41:10 PST
(In reply to comment #5)
> We already have sphinx in phx, spinning up memcache servers shouldn't be too
> hard.

Awesome! Just making sure everything's recorded.
Comment 7 Corey Shields [:cshields] 2011-01-04 14:49:03 PST
app servers are kickstarted..  memcache server has been kickstarted.  I'll kickstart the db and celery hosts tomorrow, and start to puppetize them.
Comment 8 Phong Tran [:phong] 2011-01-05 11:05:09 PST
*** Bug 613323 has been marked as a duplicate of this bug. ***
Comment 9 Justin Dow [:jabba] 2011-01-10 14:18:28 PST
Missing packages from rhel6:

pylibmc
libmemcached 0.38-1 (only 0.31-1 is available)
php-eaccelerator
php-pear-HTML-Common
php-pear-HTML-QuickForm
PIL

I've put in conditionals in the puppet manifests to distinguish between rhel5 and rhel6 and have removed the offending packages for now to make puppet stop complaining. Some of these might not be needed anymore, but if they are, then they will need to be built from SRPM and dropped into mrepo and then re-added to puppet.
Comment 10 Justin Dow [:jabba] 2011-01-10 15:30:25 PST
removed php-pecl-fileinfo as well as it was conflicting with php-common. Also seems there are some apache issues. I've tried replacing httpd.conf with a rhel6 version, but I see lots of wsgi related issues in error_log, probably needs some debugging and testing from someone that knows about wsgi.
Comment 11 James Socol [:jsocol, :james] 2011-01-11 07:22:55 PST
If it saves you time, we don't need PHP at all.
Comment 12 Justin Dow [:jabba] 2011-01-11 07:29:43 PST
That saves loads of time actually. And I think it will be best to re-write the puppet class altogether. Is there a comprehensive list of all the requirements for SUMO, including apache modules and such?
Comment 13 James Socol [:jsocol, :james] 2011-01-11 07:47:38 PST
Requirements (from our docs/requirements files):

Python and compiled Python packages (we should have RPMs already):

* Python 2.6
* MySQL-Python 1.2.3
* Jinja2 >= 2.5.2
* PIL 1.1.7
* lxml 2.2.6

Apache:

* mod_wsgi
* mod_rewrite
* mod_expires

Honestly I think those are our only Apache requirements.

Misc:

* RabbitMQ (assuming it runs on the celery box, it only needs to be on that one).
* Sphinx Search (already exists in PHX [comment 5]).
* Memcached (comment 5).

All the other requirements that I can think of are pure Python and in our vendor library.
Comment 14 James Socol [:jsocol, :james] 2011-01-11 07:51:56 PST
Oh, I lied a little: PIL has additional requirements:

* libjpeg
* zlib (and/or whatever it takes for PIL to compile with PNG support)
Comment 15 Justin Dow [:jabba] 2011-01-11 08:26:54 PST
is libmemcached required to be version 0.38-1 ? that is what the current puppet manifest calls for, and I suspect that is an RPM that we built ourselves for RHEL5. Also, PIL doesn't seem to be in the RHEL6 repos, so we'll have to build that one as well. Not sure yet about Jinja. I need to check versions of the other stuff.
Comment 16 James Socol [:jsocol, :james] 2011-01-11 08:55:37 PST
Strictly speaking, I don't believe we need libmemcached, either. We aren't using pylibmc, we're using python-memcached, which is pure-Python.

If we do go with pylibmc, which AMO is using, I think we need >= 0.34. They should have RPMs that are at least the correct version of libmemcached, if not the correct RHEL.
Comment 17 Justin Dow [:jabba] 2011-01-11 08:58:26 PST
Great, pylibmc isn't in the repo either, so I'm happy skipping that dependency and libmemcached...
Comment 18 Justin Dow [:jabba] 2011-01-11 09:00:37 PST
[root@support1 ~]# rpm -qa |grep MySQL
MySQL-python-1.2.3-0.3.c1.1.el6.x86_64
[root@support1 ~]# rpm -qa |grep Jinja
Jinja2-2.5.2-2.x86_64
[root@support1 ~]# rpm -qa |grep lxml
python-lxml-2.2.3-1.1.el6.x86_64
[root@support1 ~]# rpm -qa |grep mod_wsgi
mod_wsgi-3.2-1.el6.x86_64

is that version of lxml going to work?
Comment 19 James Socol [:jsocol, :james] 2011-01-11 09:08:59 PST
I'll need time to test that version. Are you sure 2.2.6 isn't available anywhere?
Comment 20 Corey Shields [:cshields] 2011-01-11 12:25:11 PST
There is a 2.2.6 rpm in fedora 13.  Sounds like if you need 2.2.6 we should build our own, which might be easier than troubleshooting against an older version.
Comment 21 James Socol [:jsocol, :james] 2011-01-11 12:27:02 PST
(In reply to comment #20)
> There is a 2.2.6 rpm in fedora 13.  Sounds like if you need 2.2.6 we should
> build our own, which might be easier than troubleshooting against an older
> version.

What version are we running on the pm-app-sumo cluster? Clearly it works.
Comment 22 Corey Shields [:cshields] 2011-01-11 12:28:57 PST

(In reply to comment #21)
> What version are we running on the pm-app-sumo cluster? Clearly it works.

[root@pm-app-sumo01 ~]# rpm -qa |grep lxml
lxml-2.2.6-1

so yeah...
Comment 23 Corey Shields [:cshields] 2011-01-13 19:49:10 PST
Quick update: database servers are all kickstarted and puppetized.  Replication is going between 1-2, will setup 3 & 4 tomorrow.

James, we should chat soon about what we will need to do DB wise when it comes time to cutover.
Comment 24 Corey Shields [:cshields] 2011-01-18 14:28:49 PST
James,

we are going to re-do the webheads with rhel5.5 like the others, since it will mean a quicker time to migration.  We don't have the bandwidth to fix and test a new stack with everything else going on right now.
Comment 25 Corey Shields [:cshields] 2011-02-02 07:35:07 PST
Current plan is for SUMO to move to PHX during the 2/10 maintenance window.  We have the RHEL5.5 environment "up" but untested, and the admin functions still need moved over.  As discussed on a call this morning, jabba is going to keep working on a RHEL6 + puppet environment for SUMO in PHX.  The drop-dead date for this to be working is EOB Friday (2/4).  If we get that working in time we will change the other webheads which should be no problem using puppet.
Comment 26 Corey Shields [:cshields] 2011-02-10 10:07:19 PST
Steps for today:

1)  set the following in settings_local.py to set read_only:  read_only_mode(globals())  then /data/sumo/deploy
 
2)  dump the openfire_chat and support_mozilla_com databases from tm-sumo01-master01.mozilla.org
 
3)  copy and import those databases to support1.db.phx1.mozilla.com
 
4)  change the rw and ro db VIPs in SJC to point to the 4 servers in PHX (necessary for dm-chat01 and metrics)
 
5)  Have QA test and check phx1 setup
 
6)  change support.mozilla.com dns from 63.245.209.132 to 63.245.217.50
 
7)  un-set read_only_mode(globals()) in settings_local.py then /data/sumo/deploy
Comment 27 James Socol [:jsocol, :james] 2011-02-10 10:23:11 PST
(In reply to comment #26)
> 1)  set the following in settings_local.py to set read_only: 
> read_only_mode(globals())  then /data/sumo/deploy

Note that this must be the *LAST* line of settings_local.py.
Comment 28 Corey Shields [:cshields] 2011-02-10 20:20:57 PST
SUMO is now live in phx1 and tested successfully..

We will have a postmortem for the move tomorrow at 16:00.
Comment 29 Stephen Donner [:stephend] - PTO; back on 5/28 2011-02-15 18:55:22 PST
All dependent bugs are fixed, and the move has "long-since" been verified.

Verified FIXED.

Note You need to log in before you can comment on or make changes to this bug.