Closed Bug 570656 Opened 14 years ago Closed 14 years ago

Push SUMO 2.1 Thursday, 10 June

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jsocol, Assigned: oremj)

References

()

Details

(Whiteboard: 06/10/2010 @ 2pm)

Attachments

(1 file)

Per the webdev releases calendar[1], the SUMO 2.1/1.5.5 release and discussion forum migration.

This involves moving some data. It takes around 15 minutes for the important part, during which we'll need an outage page. The safe thing is to assume we'll need the outage page up for around an hour, total.

SVN tag for 1.5.5 is coming, but git tag is `2.1`.

Here's the big list of steps:

* Get RabbitMQ set up and a git clone checked out to `2.1` up and running celeyrd. (See bug 568329. Hopefully this can happen ahead of time?)

* Will need hg for `pip install`. (I know, it's gross, another VCS.)
* Will need java for `./manage.py compress_assets`.

* Outage page.
* `git co 2.1`

* Set some configuration in `settings_local.py`
** EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'
** (Other EMAIL_* constants as needed [2])
** BROKER_* and CELERY_* constants as needed. (See bug 568329.)
*** Particularly CELERY_ALWAYS_EAGER = False (on both webnodes and celeryd instance).
** DEBUG = False, TEMPLATE_DEBUG = False

* Need to run a couple commands from the virtualenv:
** `pip install -Ur requirements.txt`
** `schematic migrations/` [3]
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`

* svn sw to 1.5.5 tag (coming soon)
* Outage page can come down now.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't affect site up-time while it runs. Needs /tmp to be writeable.)

* Add an Alias to Apache:
    Alias /admin-media/ /path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

I am fairly sure that's everything. If anyone remembers something I've forgotten, please add it here.

Then IT is done and we have some dev stuff to take care of.
* Update default site to support.mozilla.com.
* Make sure ForumModerators group has necessary permissions.


[1] https://mail.mozilla.com/home/morgamic@mozilla.com/Webdev%20Releases.html
[2] http://docs.djangoproject.com/en/dev/topics/email/#smtp-backend
[3] I really hope it's this easy. I have a patch to make it this easy. Otherwise, I'll walk you through the slightly worse version.
Forgot two things:

1) There is a new sphinx.conf in SVN (and in git, under configs/sphinx/). We'll also need to update that and reindex.

2) We'll also need to set the ADMIN_MEDIA_PREFIX in settings_local.py. I have

    MEDIA_URL = '//support.mozilla.com/media/'
    ADMIN_MEDIA_PREFIX = '//support.mozilla.com/admin-media/'
4pm?
Flags: needs-downtime+
Whiteboard: 06/08/2010 @ 4pm
Works for me.
Depends on: 568329
User impacting?

oremj, can you do this?
Assignee: server-ops → jeremy.orem+bugs
Yeah, I can grab this.
Forgot one more (easy) bit:

Set up a cron job to run the `./manage.py build_avatars` command once a day. support-stage-new does it at 1:15am PT which seems fine. (It's much shorter after the first run.)
After an hour of attempt at this we've reverted and are going to look into the errors we saw during the push, and why we never saw them with the data we had available for testing.
The first problem we saw was an unexpected schema:

| forums_thread | CREATE TABLE `forums_thread` (
  `id` int(11) NOT NULL auto_increment,
  `title` varchar(255) collate utf8_unicode_ci NOT NULL,
  `forum_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `creator_id` int(11) NOT NULL,
  `last_post_id` int(11) default NULL,
  `replies` int(11) NOT NULL,
  `is_locked` tinyint(1) NOT NULL,
  `is_sticky` tinyint(1) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_thread_forum_id` (`forum_id`),
  KEY `forums_thread_created` (`created`),
  KEY `forums_thread_creator_id` (`creator_id`),
  KEY `forums_thread_last_post_id` (`last_post_id`),
  KEY `forums_thread_is_sticky` (`is_sticky`),
  CONSTRAINT `creator_id_refs_id_4938e584` FOREIGN KEY (`creator_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `forum_id_refs_id_7f5fd759` FOREIGN KEY (`forum_id`) REFERENCES `forums_forum` (`id`),
  CONSTRAINT `last_post_id_refs_id_3fa89f33` FOREIGN KEY (`last_post_id`) REFERENCES `forums_post` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci | 


mysql> show create table forums_post;  

| forums_post | CREATE TABLE `forums_post` (
  `id` int(11) NOT NULL auto_increment,
  `thread_id` int(11) NOT NULL,
  `content` longtext collate utf8_unicode_ci NOT NULL,
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_post_thread_id` (`thread_id`),
  KEY `forums_post_author_id` (`author_id`),
  KEY `forums_post_created` (`created`),
  KEY `forums_post_updated` (`updated`),
  CONSTRAINT `author_id_refs_id_59fe2704` FOREIGN KEY (`author_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `thread_id_refs_id_5646bc53` FOREIGN KEY (`thread_id`) REFERENCES `forums_thread` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
The problem we couldn't immediately work around was the following stack trace during our migrate_forum step:

Starting migration for forum "Contributors" (3)
Created forum "Contributors" (1)...
Processing thread 1529...
Traceback (most recent call last):
  File "./manage.py", line 36, in <module>
    execute_manager(settings)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 438, in execute_manager
    utility.execute()
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 379, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 195, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 222, in execute
    output = self.handle(*args, **options)
  File "/data/www/support.mozilla.com/kitsune/apps/forums/management/commands/migrate_forum.py", line 193, in handle
    last_post = thread.post_set.order_by('-created')[0]
  File "/data/virtualenvs/kitsune/src/django/django/db/models/query.py", line 187, in __getitem__
    return list(qs)[0]
IndexError: list index out of range
Another problem we hadn't encountered before was mentioned by timellis:
[17:17]	<timellis> Hi. Someone killed the SUMO master in Phoenix with this statement:
[17:17]	<timellis> Error 'Error on rename of './support_mozilla_com/forums_thread' to './support_mozilla_com/#sql2-12b1-4833' (errno: 152)' on query. Default database: 'support_mozilla_com'. Query: 'alter table forums_thread drop foreign key last_post_id_refs_id_3fa89f33
[17:17]	<timellis> The reason is thus:
[17:17]	<timellis> "Cannot delete a parent row"
[17:18]	<timellis> The Phoenix SUMO master is a slave of the SJ SUMO master.

Neither James nor I were aware of this slavemaster, and we're still not sure why a command that ran fine on the SJ master failed on the Phoenix one. One assumption is that the two weren't in sync (with the SJ master having the unexpected schema from comment 9).
The migration code has been modified to keep track of the thread's last_post as they are created, removing the need to go ask the database for it afterwards. This *should* eliminate the race condition with the master/slaves.

http://github.com/jsocol/kitsune/commit/243433b2c6dfc08381cd0fe5bbbcf688d39cb5c7
We believe we've got everything ironed out and ready to go tomorrow afternoon. Let's plan on getting on

the phone: 92, 309#
IRC: #sumodev
Summary: Push SUMO 2.1 Tuesday, 8 June → Push SUMO 2.1 Tuesday, 10 June
Whiteboard: 06/08/2010 @ 4pm → 06/10/2010 @ 4pm, needs-downtime+
Summary: Push SUMO 2.1 Tuesday, 10 June → Push SUMO 2.1 Thursday, 10 June
Duration 2 hrs?
Whiteboard: 06/10/2010 @ 4pm, needs-downtime+ → 06/10/2010 @ 4pm
(In reply to comment #14)
> Duration 2 hrs?

Yep.
Let's start at 2 or 3 this time.
(In reply to comment #16)
> Let's start at 2 or 3 this time.

2 WFM. QA?
2PM sounds _great_ to me.
(In reply to comment #18)
> 2PM

Moved to 2pm on the Webdev:Releases calendar.
Whiteboard: 06/10/2010 @ 4pm → 06/10/2010 @ 2pm
Depends on: 571283
UPDATED INSTRUCTIONS!

So we've got slightly updated instructions, since much of this is still done from Tuesday.

* We still need an Outage page up for the duration of the migration.

* `git co 2.1.1` for both web servers and celeryd instance. (Note the new tag)
** Reload celeryd

* Make sure settings_local.py is still configured correctly: (see comment 0)

* Clean up from yesterday:
** SQL: `TRUNCATE TABLE forums_post; TRUNCATE TABLE forums_thread; TRUNCATE TABLE forums_forum;`

* Need to run a couple commands from the virtualenv:
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`
*** Make sure that both the generated JS/CSS and the generated build.py (next to settings.py) get synced out.

* svn sw to 1.5.5 tag (see comment 7)
* run webroot/htaccess.sh in SVN.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't
affect site up-time while it runs. Needs /tmp to be writeable.)

* Outage page can come down now.

* Make sure this alias is there.
    Alias /admin-media/
/path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

* Make sure to update Sphinx again as well.
 git fetch
remote: Counting objects: 186, done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 124 (delta 67), reused 12 (delta 7)
Receiving objects: 100% (124/124), 59.39 KiB, done.
Resolving deltas: 100% (67/67), completed with 28 local objects.
From http://github.com/jsocol/kitsune
 + c803754...0f0484e 561530-logging -> origin/561530-logging  (forced update)
   572ed80..4d61e5c  development -> origin/development
   93d393c..8b8e5b9  master     -> origin/master
   e291839..94279b7  questions  -> origin/questions
 * [new branch]      sphinx-doc -> origin/sphinx-doc
 * [new tag]         2.1.1      -> 2.1.1
[root@mradm02 prod]# git checkout 2.1.1
Previous HEAD position was cb19e8c... Adding WebTrends meta tags and test for them.
HEAD is now at 8b8e5b9... Merge branch 'development'
svn switch https://svn.mozilla.org/projects/sumo/tags/1.5/1.5.5_r68485_20100608
A    webroot/lang/ilo
A    webroot/lang/ilo/language.php
A    webroot/lang/ilo/index.php
U    webroot/lang/langmapping.php
U    webroot/lib/commentslib.php
U    webroot/tiki-login.php
A    webroot/django_utils.php
U    webroot/tiki-change_password.php
U    webroot/htaccess.dist
U    scripts/sphinx/sphinx.conf
Updated to revision 68644.
Attached file migrate_forum output
Trevor updated the sphinx config.
We ran into a pretty serious architectural issue in the code related to replication. We need to re-examine and we'll take another run at this.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INCOMPLETE
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: