Last Comment Bug 570656 - Push SUMO 2.1 Thursday, 10 June
: Push SUMO 2.1 Thursday, 10 June
Status: RESOLVED INCOMPLETE
06/10/2010 @ 2pm
:
Product: Infrastructure & Operations
Classification: Other
Component: WebOps: Other (show other bugs)
: other
: All Other
: -- major (vote)
: ---
Assigned To: Jeremy Orem [:oremj]
: matthew zeier [:mrz]
Mentors:
http://support.mozilla.com/
Depends on: 568329 571283
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-07 20:55 PDT by James Socol [:jsocol, :james]
Modified: 2013-10-09 10:29 PDT (History)
10 users (show)
mzeier: needs‑downtime+
See Also:
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---
Cab Review: ServiceNow Change Request (use flag)


Attachments
migrate_forum output (25.79 KB, application/x-gzip)
2010-06-10 14:35 PDT, Jeremy Orem [:oremj]
no flags Details

Description James Socol [:jsocol, :james] 2010-06-07 20:55:53 PDT
Per the webdev releases calendar[1], the SUMO 2.1/1.5.5 release and discussion forum migration.

This involves moving some data. It takes around 15 minutes for the important part, during which we'll need an outage page. The safe thing is to assume we'll need the outage page up for around an hour, total.

SVN tag for 1.5.5 is coming, but git tag is `2.1`.

Here's the big list of steps:

* Get RabbitMQ set up and a git clone checked out to `2.1` up and running celeyrd. (See bug 568329. Hopefully this can happen ahead of time?)

* Will need hg for `pip install`. (I know, it's gross, another VCS.)
* Will need java for `./manage.py compress_assets`.

* Outage page.
* `git co 2.1`

* Set some configuration in `settings_local.py`
** EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'
** (Other EMAIL_* constants as needed [2])
** BROKER_* and CELERY_* constants as needed. (See bug 568329.)
*** Particularly CELERY_ALWAYS_EAGER = False (on both webnodes and celeryd instance).
** DEBUG = False, TEMPLATE_DEBUG = False

* Need to run a couple commands from the virtualenv:
** `pip install -Ur requirements.txt`
** `schematic migrations/` [3]
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`

* svn sw to 1.5.5 tag (coming soon)
* Outage page can come down now.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't affect site up-time while it runs. Needs /tmp to be writeable.)

* Add an Alias to Apache:
    Alias /admin-media/ /path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

I am fairly sure that's everything. If anyone remembers something I've forgotten, please add it here.

Then IT is done and we have some dev stuff to take care of.
* Update default site to support.mozilla.com.
* Make sure ForumModerators group has necessary permissions.


[1] https://mail.mozilla.com/home/morgamic@mozilla.com/Webdev%20Releases.html
[2] http://docs.djangoproject.com/en/dev/topics/email/#smtp-backend
[3] I really hope it's this easy. I have a patch to make it this easy. Otherwise, I'll walk you through the slightly worse version.
Comment 1 James Socol [:jsocol, :james] 2010-06-07 21:01:20 PDT
Forgot two things:

1) There is a new sphinx.conf in SVN (and in git, under configs/sphinx/). We'll also need to update that and reindex.

2) We'll also need to set the ADMIN_MEDIA_PREFIX in settings_local.py. I have

    MEDIA_URL = '//support.mozilla.com/media/'
    ADMIN_MEDIA_PREFIX = '//support.mozilla.com/admin-media/'
Comment 2 matthew zeier [:mrz] 2010-06-07 21:04:07 PDT
4pm?
Comment 3 James Socol [:jsocol, :james] 2010-06-07 21:50:22 PDT
Works for me.
Comment 4 matthew zeier [:mrz] 2010-06-08 09:10:02 PDT
User impacting?

oremj, can you do this?
Comment 5 Jeremy Orem [:oremj] 2010-06-08 11:40:30 PDT
Yeah, I can grab this.
Comment 6 James Socol [:jsocol, :james] 2010-06-08 13:24:15 PDT
Forgot one more (easy) bit:

Set up a cron job to run the `./manage.py build_avatars` command once a day. support-stage-new does it at 1:15am PT which seems fine. (It's much shorter after the first run.)
Comment 7 James Socol [:jsocol, :james] 2010-06-08 16:36:59 PDT
SVN tag: https://svn.mozilla.org/projects/sumo/tags/1.5/1.5.5_r68485_20100608
Comment 8 James Socol [:jsocol, :james] 2010-06-08 17:38:55 PDT
After an hour of attempt at this we've reverted and are going to look into the errors we saw during the push, and why we never saw them with the data we had available for testing.
Comment 9 James Socol [:jsocol, :james] 2010-06-09 09:35:05 PDT
The first problem we saw was an unexpected schema:

| forums_thread | CREATE TABLE `forums_thread` (
  `id` int(11) NOT NULL auto_increment,
  `title` varchar(255) collate utf8_unicode_ci NOT NULL,
  `forum_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `creator_id` int(11) NOT NULL,
  `last_post_id` int(11) default NULL,
  `replies` int(11) NOT NULL,
  `is_locked` tinyint(1) NOT NULL,
  `is_sticky` tinyint(1) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_thread_forum_id` (`forum_id`),
  KEY `forums_thread_created` (`created`),
  KEY `forums_thread_creator_id` (`creator_id`),
  KEY `forums_thread_last_post_id` (`last_post_id`),
  KEY `forums_thread_is_sticky` (`is_sticky`),
  CONSTRAINT `creator_id_refs_id_4938e584` FOREIGN KEY (`creator_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `forum_id_refs_id_7f5fd759` FOREIGN KEY (`forum_id`) REFERENCES `forums_forum` (`id`),
  CONSTRAINT `last_post_id_refs_id_3fa89f33` FOREIGN KEY (`last_post_id`) REFERENCES `forums_post` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci | 


mysql> show create table forums_post;  

| forums_post | CREATE TABLE `forums_post` (
  `id` int(11) NOT NULL auto_increment,
  `thread_id` int(11) NOT NULL,
  `content` longtext collate utf8_unicode_ci NOT NULL,
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_post_thread_id` (`thread_id`),
  KEY `forums_post_author_id` (`author_id`),
  KEY `forums_post_created` (`created`),
  KEY `forums_post_updated` (`updated`),
  CONSTRAINT `author_id_refs_id_59fe2704` FOREIGN KEY (`author_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `thread_id_refs_id_5646bc53` FOREIGN KEY (`thread_id`) REFERENCES `forums_thread` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
Comment 10 James Socol [:jsocol, :james] 2010-06-09 09:35:33 PDT
The problem we couldn't immediately work around was the following stack trace during our migrate_forum step:

Starting migration for forum "Contributors" (3)
Created forum "Contributors" (1)...
Processing thread 1529...
Traceback (most recent call last):
  File "./manage.py", line 36, in <module>
    execute_manager(settings)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 438, in execute_manager
    utility.execute()
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 379, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 195, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 222, in execute
    output = self.handle(*args, **options)
  File "/data/www/support.mozilla.com/kitsune/apps/forums/management/commands/migrate_forum.py", line 193, in handle
    last_post = thread.post_set.order_by('-created')[0]
  File "/data/virtualenvs/kitsune/src/django/django/db/models/query.py", line 187, in __getitem__
    return list(qs)[0]
IndexError: list index out of range
Comment 11 Paul Craciunoiu [:paulc] 2010-06-09 10:47:36 PDT
Another problem we hadn't encountered before was mentioned by timellis:
[17:17]	<timellis> Hi. Someone killed the SUMO master in Phoenix with this statement:
[17:17]	<timellis> Error 'Error on rename of './support_mozilla_com/forums_thread' to './support_mozilla_com/#sql2-12b1-4833' (errno: 152)' on query. Default database: 'support_mozilla_com'. Query: 'alter table forums_thread drop foreign key last_post_id_refs_id_3fa89f33
[17:17]	<timellis> The reason is thus:
[17:17]	<timellis> "Cannot delete a parent row"
[17:18]	<timellis> The Phoenix SUMO master is a slave of the SJ SUMO master.

Neither James nor I were aware of this slavemaster, and we're still not sure why a command that ran fine on the SJ master failed on the Phoenix one. One assumption is that the two weren't in sync (with the SJ master having the unexpected schema from comment 9).
Comment 12 Ricky Rosario [:rrosario, :r1cky] 2010-06-09 11:29:41 PDT
The migration code has been modified to keep track of the thread's last_post as they are created, removing the need to go ask the database for it afterwards. This *should* eliminate the race condition with the master/slaves.

http://github.com/jsocol/kitsune/commit/243433b2c6dfc08381cd0fe5bbbcf688d39cb5c7
Comment 13 James Socol [:jsocol, :james] 2010-06-09 15:15:53 PDT
We believe we've got everything ironed out and ready to go tomorrow afternoon. Let's plan on getting on

the phone: 92, 309#
IRC: #sumodev
Comment 14 matthew zeier [:mrz] 2010-06-09 16:43:18 PDT
Duration 2 hrs?
Comment 15 James Socol [:jsocol, :james] 2010-06-09 16:47:30 PDT
(In reply to comment #14)
> Duration 2 hrs?

Yep.
Comment 16 Jeremy Orem [:oremj] 2010-06-09 16:55:16 PDT
Let's start at 2 or 3 this time.
Comment 17 James Socol [:jsocol, :james] 2010-06-09 16:55:54 PDT
(In reply to comment #16)
> Let's start at 2 or 3 this time.

2 WFM. QA?
Comment 18 Stephen Donner [:stephend] - PTO; back on 5/28 2010-06-09 16:57:50 PDT
2PM sounds _great_ to me.
Comment 19 James Socol [:jsocol, :james] 2010-06-09 17:05:45 PDT
(In reply to comment #18)
> 2PM

Moved to 2pm on the Webdev:Releases calendar.
Comment 20 James Socol [:jsocol, :james] 2010-06-10 11:38:58 PDT
UPDATED INSTRUCTIONS!

So we've got slightly updated instructions, since much of this is still done from Tuesday.

* We still need an Outage page up for the duration of the migration.

* `git co 2.1.1` for both web servers and celeryd instance. (Note the new tag)
** Reload celeryd

* Make sure settings_local.py is still configured correctly: (see comment 0)

* Clean up from yesterday:
** SQL: `TRUNCATE TABLE forums_post; TRUNCATE TABLE forums_thread; TRUNCATE TABLE forums_forum;`

* Need to run a couple commands from the virtualenv:
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`
*** Make sure that both the generated JS/CSS and the generated build.py (next to settings.py) get synced out.

* svn sw to 1.5.5 tag (see comment 7)
* run webroot/htaccess.sh in SVN.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't
affect site up-time while it runs. Needs /tmp to be writeable.)

* Outage page can come down now.

* Make sure this alias is there.
    Alias /admin-media/
/path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

* Make sure to update Sphinx again as well.
Comment 21 Jeremy Orem [:oremj] 2010-06-10 14:09:56 PDT
 git fetch
remote: Counting objects: 186, done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 124 (delta 67), reused 12 (delta 7)
Receiving objects: 100% (124/124), 59.39 KiB, done.
Resolving deltas: 100% (67/67), completed with 28 local objects.
From http://github.com/jsocol/kitsune
 + c803754...0f0484e 561530-logging -> origin/561530-logging  (forced update)
   572ed80..4d61e5c  development -> origin/development
   93d393c..8b8e5b9  master     -> origin/master
   e291839..94279b7  questions  -> origin/questions
 * [new branch]      sphinx-doc -> origin/sphinx-doc
 * [new tag]         2.1.1      -> 2.1.1
[root@mradm02 prod]# git checkout 2.1.1
Previous HEAD position was cb19e8c... Adding WebTrends meta tags and test for them.
HEAD is now at 8b8e5b9... Merge branch 'development'
Comment 22 Jeremy Orem [:oremj] 2010-06-10 14:24:03 PDT
svn switch https://svn.mozilla.org/projects/sumo/tags/1.5/1.5.5_r68485_20100608
A    webroot/lang/ilo
A    webroot/lang/ilo/language.php
A    webroot/lang/ilo/index.php
U    webroot/lang/langmapping.php
U    webroot/lib/commentslib.php
U    webroot/tiki-login.php
A    webroot/django_utils.php
U    webroot/tiki-change_password.php
U    webroot/htaccess.dist
U    scripts/sphinx/sphinx.conf
Updated to revision 68644.
Comment 23 Jeremy Orem [:oremj] 2010-06-10 14:35:31 PDT
Created attachment 450456 [details]
migrate_forum output
Comment 24 Jeremy Orem [:oremj] 2010-06-10 14:42:58 PDT
Trevor updated the sphinx config.
Comment 25 James Socol [:jsocol, :james] 2010-06-10 15:26:25 PDT
We ran into a pretty serious architectural issue in the code related to replication. We need to re-examine and we'll take another run at this.

Note You need to log in before you can comment on or make changes to this bug.