Push SUMO 2.1 Thursday, 10 June

RESOLVED INCOMPLETE

Status

Infrastructure & Operations
WebOps: Other
--
major
RESOLVED INCOMPLETE
7 years ago
4 years ago

People

(Reporter: jsocol, Assigned: oremj)

Tracking

other
All
Other
Dependency tree / graph
Bug Flags:
needs-downtime +

Details

(Whiteboard: 06/10/2010 @ 2pm, URL)

Attachments

(1 attachment)

(Reporter)

Description

7 years ago
Per the webdev releases calendar[1], the SUMO 2.1/1.5.5 release and discussion forum migration.

This involves moving some data. It takes around 15 minutes for the important part, during which we'll need an outage page. The safe thing is to assume we'll need the outage page up for around an hour, total.

SVN tag for 1.5.5 is coming, but git tag is `2.1`.

Here's the big list of steps:

* Get RabbitMQ set up and a git clone checked out to `2.1` up and running celeyrd. (See bug 568329. Hopefully this can happen ahead of time?)

* Will need hg for `pip install`. (I know, it's gross, another VCS.)
* Will need java for `./manage.py compress_assets`.

* Outage page.
* `git co 2.1`

* Set some configuration in `settings_local.py`
** EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'
** (Other EMAIL_* constants as needed [2])
** BROKER_* and CELERY_* constants as needed. (See bug 568329.)
*** Particularly CELERY_ALWAYS_EAGER = False (on both webnodes and celeryd instance).
** DEBUG = False, TEMPLATE_DEBUG = False

* Need to run a couple commands from the virtualenv:
** `pip install -Ur requirements.txt`
** `schematic migrations/` [3]
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`

* svn sw to 1.5.5 tag (coming soon)
* Outage page can come down now.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't affect site up-time while it runs. Needs /tmp to be writeable.)

* Add an Alias to Apache:
    Alias /admin-media/ /path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

I am fairly sure that's everything. If anyone remembers something I've forgotten, please add it here.

Then IT is done and we have some dev stuff to take care of.
* Update default site to support.mozilla.com.
* Make sure ForumModerators group has necessary permissions.


[1] https://mail.mozilla.com/home/morgamic@mozilla.com/Webdev%20Releases.html
[2] http://docs.djangoproject.com/en/dev/topics/email/#smtp-backend
[3] I really hope it's this easy. I have a patch to make it this easy. Otherwise, I'll walk you through the slightly worse version.
(Reporter)

Comment 1

7 years ago
Forgot two things:

1) There is a new sphinx.conf in SVN (and in git, under configs/sphinx/). We'll also need to update that and reindex.

2) We'll also need to set the ADMIN_MEDIA_PREFIX in settings_local.py. I have

    MEDIA_URL = '//support.mozilla.com/media/'
    ADMIN_MEDIA_PREFIX = '//support.mozilla.com/admin-media/'

Comment 2

7 years ago
4pm?
Flags: needs-downtime+
Whiteboard: 06/08/2010 @ 4pm
(Reporter)

Comment 3

7 years ago
Works for me.
(Reporter)

Updated

7 years ago
Depends on: 568329

Comment 4

7 years ago
User impacting?

oremj, can you do this?
Assignee: server-ops → jeremy.orem+bugs
(Assignee)

Comment 5

7 years ago
Yeah, I can grab this.
(Reporter)

Comment 6

7 years ago
Forgot one more (easy) bit:

Set up a cron job to run the `./manage.py build_avatars` command once a day. support-stage-new does it at 1:15am PT which seems fine. (It's much shorter after the first run.)
(Reporter)

Comment 7

7 years ago
SVN tag: https://svn.mozilla.org/projects/sumo/tags/1.5/1.5.5_r68485_20100608
(Reporter)

Comment 8

7 years ago
After an hour of attempt at this we've reverted and are going to look into the errors we saw during the push, and why we never saw them with the data we had available for testing.
(Reporter)

Comment 9

7 years ago
The first problem we saw was an unexpected schema:

| forums_thread | CREATE TABLE `forums_thread` (
  `id` int(11) NOT NULL auto_increment,
  `title` varchar(255) collate utf8_unicode_ci NOT NULL,
  `forum_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `creator_id` int(11) NOT NULL,
  `last_post_id` int(11) default NULL,
  `replies` int(11) NOT NULL,
  `is_locked` tinyint(1) NOT NULL,
  `is_sticky` tinyint(1) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_thread_forum_id` (`forum_id`),
  KEY `forums_thread_created` (`created`),
  KEY `forums_thread_creator_id` (`creator_id`),
  KEY `forums_thread_last_post_id` (`last_post_id`),
  KEY `forums_thread_is_sticky` (`is_sticky`),
  CONSTRAINT `creator_id_refs_id_4938e584` FOREIGN KEY (`creator_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `forum_id_refs_id_7f5fd759` FOREIGN KEY (`forum_id`) REFERENCES `forums_forum` (`id`),
  CONSTRAINT `last_post_id_refs_id_3fa89f33` FOREIGN KEY (`last_post_id`) REFERENCES `forums_post` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci | 


mysql> show create table forums_post;  

| forums_post | CREATE TABLE `forums_post` (
  `id` int(11) NOT NULL auto_increment,
  `thread_id` int(11) NOT NULL,
  `content` longtext collate utf8_unicode_ci NOT NULL,
  `author_id` int(11) NOT NULL,
  `created` datetime NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `forums_post_thread_id` (`thread_id`),
  KEY `forums_post_author_id` (`author_id`),
  KEY `forums_post_created` (`created`),
  KEY `forums_post_updated` (`updated`),
  CONSTRAINT `author_id_refs_id_59fe2704` FOREIGN KEY (`author_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `thread_id_refs_id_5646bc53` FOREIGN KEY (`thread_id`) REFERENCES `forums_thread` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
(Reporter)

Comment 10

7 years ago
The problem we couldn't immediately work around was the following stack trace during our migrate_forum step:

Starting migration for forum "Contributors" (3)
Created forum "Contributors" (1)...
Processing thread 1529...
Traceback (most recent call last):
  File "./manage.py", line 36, in <module>
    execute_manager(settings)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 438, in execute_manager
    utility.execute()
  File "/data/virtualenvs/kitsune/src/django/django/core/management/__init__.py", line 379, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 195, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/data/virtualenvs/kitsune/src/django/django/core/management/base.py", line 222, in execute
    output = self.handle(*args, **options)
  File "/data/www/support.mozilla.com/kitsune/apps/forums/management/commands/migrate_forum.py", line 193, in handle
    last_post = thread.post_set.order_by('-created')[0]
  File "/data/virtualenvs/kitsune/src/django/django/db/models/query.py", line 187, in __getitem__
    return list(qs)[0]
IndexError: list index out of range
Another problem we hadn't encountered before was mentioned by timellis:
[17:17]	<timellis> Hi. Someone killed the SUMO master in Phoenix with this statement:
[17:17]	<timellis> Error 'Error on rename of './support_mozilla_com/forums_thread' to './support_mozilla_com/#sql2-12b1-4833' (errno: 152)' on query. Default database: 'support_mozilla_com'. Query: 'alter table forums_thread drop foreign key last_post_id_refs_id_3fa89f33
[17:17]	<timellis> The reason is thus:
[17:17]	<timellis> "Cannot delete a parent row"
[17:18]	<timellis> The Phoenix SUMO master is a slave of the SJ SUMO master.

Neither James nor I were aware of this slavemaster, and we're still not sure why a command that ran fine on the SJ master failed on the Phoenix one. One assumption is that the two weren't in sync (with the SJ master having the unexpected schema from comment 9).
The migration code has been modified to keep track of the thread's last_post as they are created, removing the need to go ask the database for it afterwards. This *should* eliminate the race condition with the master/slaves.

http://github.com/jsocol/kitsune/commit/243433b2c6dfc08381cd0fe5bbbcf688d39cb5c7
(Reporter)

Comment 13

7 years ago
We believe we've got everything ironed out and ready to go tomorrow afternoon. Let's plan on getting on

the phone: 92, 309#
IRC: #sumodev
Summary: Push SUMO 2.1 Tuesday, 8 June → Push SUMO 2.1 Tuesday, 10 June
Whiteboard: 06/08/2010 @ 4pm → 06/10/2010 @ 4pm, needs-downtime+
Summary: Push SUMO 2.1 Tuesday, 10 June → Push SUMO 2.1 Thursday, 10 June
Duration 2 hrs?
Whiteboard: 06/10/2010 @ 4pm, needs-downtime+ → 06/10/2010 @ 4pm
(Reporter)

Comment 15

7 years ago
(In reply to comment #14)
> Duration 2 hrs?

Yep.
(Assignee)

Comment 16

7 years ago
Let's start at 2 or 3 this time.
(Reporter)

Comment 17

7 years ago
(In reply to comment #16)
> Let's start at 2 or 3 this time.

2 WFM. QA?
2PM sounds _great_ to me.
(Reporter)

Comment 19

7 years ago
(In reply to comment #18)
> 2PM

Moved to 2pm on the Webdev:Releases calendar.

Updated

7 years ago
Whiteboard: 06/10/2010 @ 4pm → 06/10/2010 @ 2pm
(Reporter)

Updated

7 years ago
Depends on: 571283
(Reporter)

Comment 20

7 years ago
UPDATED INSTRUCTIONS!

So we've got slightly updated instructions, since much of this is still done from Tuesday.

* We still need an Outage page up for the duration of the migration.

* `git co 2.1.1` for both web servers and celeryd instance. (Note the new tag)
** Reload celeryd

* Make sure settings_local.py is still configured correctly: (see comment 0)

* Clean up from yesterday:
** SQL: `TRUNCATE TABLE forums_post; TRUNCATE TABLE forums_thread; TRUNCATE TABLE forums_forum;`

* Need to run a couple commands from the virtualenv:
** `./manage.py migrate_forum 3 4 5` (will take 5-15 minutes).
** `./manage.py compress_assets`
*** Make sure that both the generated JS/CSS and the generated build.py (next to settings.py) get synced out.

* svn sw to 1.5.5 tag (see comment 7)
* run webroot/htaccess.sh in SVN.

* One more command on the virtualenv:
** `./manage.py build_avatars` (Will take up to half an hour, but shouldn't
affect site up-time while it runs. Needs /tmp to be writeable.)

* Outage page can come down now.

* Make sure this alias is there.
    Alias /admin-media/
/path/to/virtualenv/src/django/django/contrib/admin/media/
** Make sure it's readable, etc.

* Flush all caches.

* Make sure to update Sphinx again as well.
(Assignee)

Comment 21

7 years ago
 git fetch
remote: Counting objects: 186, done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 124 (delta 67), reused 12 (delta 7)
Receiving objects: 100% (124/124), 59.39 KiB, done.
Resolving deltas: 100% (67/67), completed with 28 local objects.
From http://github.com/jsocol/kitsune
 + c803754...0f0484e 561530-logging -> origin/561530-logging  (forced update)
   572ed80..4d61e5c  development -> origin/development
   93d393c..8b8e5b9  master     -> origin/master
   e291839..94279b7  questions  -> origin/questions
 * [new branch]      sphinx-doc -> origin/sphinx-doc
 * [new tag]         2.1.1      -> 2.1.1
[root@mradm02 prod]# git checkout 2.1.1
Previous HEAD position was cb19e8c... Adding WebTrends meta tags and test for them.
HEAD is now at 8b8e5b9... Merge branch 'development'
(Assignee)

Comment 22

7 years ago
svn switch https://svn.mozilla.org/projects/sumo/tags/1.5/1.5.5_r68485_20100608
A    webroot/lang/ilo
A    webroot/lang/ilo/language.php
A    webroot/lang/ilo/index.php
U    webroot/lang/langmapping.php
U    webroot/lib/commentslib.php
U    webroot/tiki-login.php
A    webroot/django_utils.php
U    webroot/tiki-change_password.php
U    webroot/htaccess.dist
U    scripts/sphinx/sphinx.conf
Updated to revision 68644.
(Assignee)

Comment 23

7 years ago
Created attachment 450456 [details]
migrate_forum output
(Assignee)

Comment 24

7 years ago
Trevor updated the sphinx config.
(Reporter)

Comment 25

7 years ago
We ran into a pretty serious architectural issue in the code related to replication. We need to re-examine and we'll take another run at this.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INCOMPLETE
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.