871799 - Alter table add column for production browserid databases

Reporter

Description

•

12 years ago

Hi Sheeri, For an upcoming release of Persona, we need to add a column to the browserid.user table. The alter is: > alter table browserid.user add column `failedAuthTries` int(11) NOT NULL DEFAULT '0' `user` is millions of rows and gene says it took 15 minutes in stage. I wonder if you have some time to discuss a plan to do this in production. 1. I think we can make this a little quicker by using mysqldump to page in the tables fully before the alter (?). 2. While the alter table is in progress, all writes will wait for the alter to complete. True or False?

Sheeri Cabral [:sheeri]

Comment 1

•

12 years ago

The user table is 108M in size: -rw-rw---- 1 mysql mysql 108M May 14 08:22 user.ibd I'm not sure what you're getting at with #1. Meaning that things will already be in the cache? The bottleneck is likely the I/O of writing, not of reading.....although there's some reading too. Interesting. As for #2, while the ALTER TABLE is in progress, an exclusive lock is gotten (write lock) on the whole table - no writes OR reads can proceed until the ALTER is done. I would use the oak-online-alter-table tool - http://openarkkit.googlecode.com/svn/trunk/openarkkit/doc/html/oak-online-alter-table.html - or the pt-online-schema-change tool - http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html - it creates triggers on the old table and creates a new table with the schema you want and backfills the data from the old table, and when the backfill is complete it drops the old table and renames the new table to what the old table should be (though you can configure it not to drop the old table if you want). There are many caveats with using these tools, but they're mostly for more complicated setups like when foriegn keys are in use or if triggers already exist on the current table. The openark tool was created first and is still maintained, and is likely less buggy than the percona toolkit tool (because it has fewer features, and the features the percona toolkit has, we don't need).

John Morrison [:jrgm]

Reporter

Comment 2

•

12 years ago

@sheeri Thanks for the comments. Yeah, for #1, yeah I was referring to getting things in mysql and linux caches to make the reads quicker. But yes, the entire table has to be written too. Actually, one bit of weirdness. While the alter on the master took 15 minutes, on the slaves the alter took something like 10 hours. Thanks for the pointers to the oak and pt tools. The former I've looked into before, but I'll look into the oak tool (for my reading enjoyment). It seems to me we first need to understand why the IO with EBS is so slow (and request more IOPS is that is appropriate). (For, the ALTER TABLE I suppose there is also the option to do it offline and do a master swap.)

Sheeri Cabral [:sheeri]

Comment 3

•

12 years ago

One other solution: upgrade to MySQL 5.6, it has online ALTER TABLE. We have a Q2 goal of upgrading to 5.5 everywhere the db engineering team manages, and by end of Q4 be upgraded to MySQL 5.6 everywhere. It's probably not a solution you'll implement right now (and if you are, we're here to help) but I figured it might be important for you to know that online column add is native in MySQL 5.6.

John Morrison [:jrgm]

Reporter

Comment 4

•

12 years ago

(After a long email thread in between previous comment and now...) Steps I used to perform the ALTER TABLE. 1. On the master and *all* slaves in parallel: > `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null` 2. Then on the master, in the mysql shell: > ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0'; While the mysqldumps run, replication is delayed (which will lead to failures for adduser, change-password, etc. where a change is written \ to the master, and is then read from a slave). While the ALTER propagates down the replication tree, replication will again be delayed. (Gene, we should also talk about steps to take if there is a problem with making this change (e.g., setup replication within phx1 (which has no read traffic atm) like 'master->slave1->slave2' and then stopping replication on slave1 so it can become the basis of a new replication tree in the worst case). I think it would be best if we tried this change on a slave in production (take it out of the loadbalancer, do mysqldump+ALTER and time it, then do mysqldump+DROP of the column and put it back into traffic when replication is caught up).

Gene Wood [:gene]

Assignee

Comment 5

•

12 years ago

Should I wait for all three Step 1 statements to complete before initiating Step 2? How long should Step 2 take? What if Step 2 takes 10 hours? > we should also talk about steps to take if there is a problem with making this change What do you recommend? > take it out of the loadbalancer, do mysqldump+ALTER and time it, then do mysqldump+DROP of the column and put it back into traffic when replication is caught up Sure thing, can you provide the exact commands that you'd like run and I'll do so on one of the DBs in us-west-2

John Morrison [:jrgm]

Reporter

Comment 6

•

12 years ago

Yes, the Step 1 mysqldumps should be allowed to complete before starting Step 2. The `staged` table is tiny; mysqldump for `email` and `user` took about 20 seconds each in stage database environment. Step 2, after the mysqldump, the ALTER TABLE took 50 seconds to 3 minutes in stage. For the last part, can we find a time for us to chat with Sheeri; I'm not familiar with the production environment, so I can't recommend exact commands without asking some other questions (and my idea is only half-baked anyways ;-)).

John Morrison [:jrgm]

Reporter

Comment 7

•

12 years ago

As a trial run to check timing, in PHX1 datacenter which has no readonly traffic: 1. Check that db2 really does not have traffic. Since we don't have query logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be connections from the webheads, but they should show a large value in the 'Time' column (which is how long since they were last used). 2. On db2, do > `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null` and note the times (which should be ~20 seconds (or less) for `email` and `user`. 3. Immediately following step 2., on db2 do: > USE browserid; > ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0'; and note the time (which should be on the order of a few minutes (maybe less for the smaller table size in production). 4. Check the new column is there: > USE browserid; > SHOW CREATE TABLE user; which should look something like: > CREATE TABLE `user` ( > `id` bigint(20) NOT NULL AUTO_INCREMENT, > `passwd` char(64) DEFAULT NULL, > `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', > `failedAuthTries` int(11) NOT NULL DEFAULT '0', > PRIMARY KEY (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1 5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that get back to 0-ish seconds. 6. If everything has gone well, we now clean up and drop the column. On db2, repeat the mysqldump: 7. On db2, 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;' 8. Do: > USE browserid; > SHOW CREATE TABLE user; to see the column is gone.

John Morrison [:jrgm]

Reporter

Comment 8

•

12 years ago

Update to the above plan: in step 2 and step 7, before the ALTER TABLE statement, do SET autocommit=0;, and after the ALTER 'SET autommit=1' or exit and re-enter the mysql shell.

John Morrison [:jrgm]

Reporter

Comment 9

•

12 years ago

As a trial run to check timing, in PHX1 datacenter which has no readonly traffic: 1. Check that db2 really does not have traffic. Since we don't have query logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be connections from the webheads, but they should show a large value in the 'Time' column (which is how long since they were last used). 2. On db2, do > `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null` and note the times (which should be ~20 seconds (or less) for `email` and `user`. 3. Immediately following step 2., on db2 do: > USE browserid; > SET autocommit=0; > ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0'; and note the time (which should be on the order of a few minutes (maybe less for the smaller table size in production). 4. Check the new column is there: > USE browserid; > SHOW CREATE TABLE user; which should look something like: > CREATE TABLE `user` ( > `id` bigint(20) NOT NULL AUTO_INCREMENT, > `passwd` char(64) DEFAULT NULL, > `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', > `failedAuthTries` int(11) NOT NULL DEFAULT '0', > PRIMARY KEY (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1 5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that get back to 0-ish seconds. 6. If everything has gone well, we now clean up and drop the column. On db2, repeat the mysqldump: 7. On db2, > SET autocommit=0; > 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;' 8. Do: > USE browserid; > SHOW CREATE TABLE user; to see the column is gone.

John Morrison [:jrgm]

Reporter

Comment 10

•

12 years ago

As a trial run to check timing, in PHX1 datacenter which has no readonly traffic: 1. Check that db3 really does not have traffic. Since we don't have query logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be connections from the webheads, but they should show a large value in the 'Time' column (which is how long since they were last used). 2. On db3, do > `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null` and note the times (which should be ~20 seconds (or less) for `email` and `user`. 3. Immediately following step 2., on db3 do: > USE browserid; > SET autocommit=0; > ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0'; and note the time (which should be on the order of a few minutes (maybe less for the smaller table size in production). 4. Check the new column is there: > USE browserid; > SHOW CREATE TABLE user; which should look something like: > CREATE TABLE `user` ( > `id` bigint(20) NOT NULL AUTO_INCREMENT, > `passwd` char(64) DEFAULT NULL, > `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', > `failedAuthTries` int(11) NOT NULL DEFAULT '0', > PRIMARY KEY (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1 5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that get back to 0-ish seconds. 6. If everything has gone well, we now clean up and drop the column. On db3, repeat the mysqldump: 7. On db3, > SET autocommit=0; > 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;' 8. Do: > USE browserid; > SHOW CREATE TABLE user; to see the column is gone.

Gene Wood [:gene]

Assignee

Comment 11

•

12 years ago

Step 2-5 http://gene.pastebin.mozilla.org/2492842 I forgot step 6 Step 7 - 8 [root@db3.iddb.phx1.svc ~]# date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user DROP COLUMN failedAuthTries;\" | mysql -BE browserid" Thu Jun 6 14:52:56 PDT 2013 real 0m6.747s user 0m0.013s sys 0m0.012s [root@db3.iddb.phx1.svc ~]# echo "SHOW CREATE TABLE user;" | mysql -BE browserid *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2306076 DEFAULT CHARSET=latin1 [root@db3.iddb.phx1.svc ~]#

Gene Wood [:gene]

Assignee

Comment 12

•

12 years ago

Here's the output from step 2-5 in the pastebin above (since the pastebin will expire) [root@db3.iddb.phx1.svc ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user ADD COLUMN failedAuthTries INT(11) NOT NULL DEFAULT '0';\" | mysql -BE browserid";echo "SHOW CREATE TABLE user;" | mysql -BE browserid;echo "SHOW SLAVE STATUS;" | mysql -BE browserid Thu Jun 6 14:50:32 PDT 2013 real 0m2.744s user 0m2.265s sys 0m0.115s real 0m4.020s user 0m3.207s sys 0m0.100s real 0m3.842s user 0m3.613s sys 0m0.136s Thu Jun 6 14:50:43 PDT 2013 real 0m7.285s user 0m0.007s sys 0m0.012s *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `failedAuthTries` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2306063 DEFAULT CHARSET=latin1 *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db2.iddb.phx1.svc.mozilla.com Master_User: replication Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000269 Read_Master_Log_Pos: 81959899 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 9536040 Relay_Master_Log_File: mysql-bin.000269 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: browserid Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 81959899 Relay_Log_Space: 9536196 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error:

John Morrison [:jrgm]

Reporter

Comment 13

•

12 years ago

Okay, so the mysqldumps + ADD COLUMN took 18 seconds. So end to end to the furthest leaf in the tree that's maybe about a minute (plus maybe a bit more for replication to catch up) to complete the maintenance. I would like to get this same test done on an idle slave in an AWS region to confirm that those databases have about the same performance. If they do, the I figure we should just do this either Friday or Monday morning, and not target our quietest time of the night. (I would really like to get back to finishing with this train-2013.05.22 for shipping on Monday).

Gene Wood [:gene]

Assignee

Comment 14

•

12 years ago

Alter completed after 142 minutes. I've started the revert now [root@ip-10-148-37-93 ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date;time bash -c "echo \"SET autocomm it=0;ALTER TABLE browserid.user ADD COLUMN failedAuthTries INT(11) NOT NULL DEFAULT '0';\" | mysql -BE browserid";echo "SHOW CREATE TABLE user;" | mysql -BE browserid;echo "SHOW SLAVE STATUS;" | mysql -BE browserid Fri Jun 7 11:37:37 PDT 2013 real 0m1.955s user 0m1.012s sys 0m0.043s real 0m3.230s user 0m1.425s sys 0m0.050s real 0m3.677s user 0m1.913s sys 0m0.066s Fri Jun 7 11:37:46 PDT 2013 real 142m49.877s user 0m0.003s sys 0m0.009s *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `failedAuthTries` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2311456 DEFAULT CHARSET=latin1 *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.148.33.176 Master_User: replication Master_Port: 3306 Connect_Retry: 15 Master_Log_File: mysql-bin.000040 Read_Master_Log_Pos: 102699258 Relay_Log_File: mysqld-relay-bin.000034 Relay_Log_Pos: 102699403 Relay_Master_Log_File: mysql-bin.000040 Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: browserid Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 102699258 Relay_Log_Space: 102699602 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: [root@ip-10-148-37-93 ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date; Fri Jun 7 14:04:24 PDT 2013 real 0m1.905s user 0m1.014s sys 0m0.039s real 0m3.131s user 0m1.422s sys 0m0.046s real 0m3.737s user 0m1.983s sys 0m0.072s Fri Jun 7 14:04:33 PDT 2013

Gene Wood [:gene]

Assignee

Comment 15

•

12 years ago

John, given the test results how would you like to proceed on Monday morning?

Gene Wood [:gene]

Assignee

Comment 16

•

12 years ago

Here are the drop column results [root@ip-10-148-37-93 ~]# date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user DROP COLUMN failedAuthTries;\" | mysql -BE browserid" Fri Jun 7 14:04:34 PDT 2013 real 145m7.841s user 0m0.004s sys 0m0.008s [root@ip-10-148-37-93 ~]# echo "slave start;SHOW SLAVE STATUS;" | mysql -BE browserid *************************** 1. row *************************** Slave_IO_State: Connecting to master Master_Host: 10.148.33.176 Master_User: replication Master_Port: 3306 Connect_Retry: 15 Master_Log_File: mysql-bin.000040 Read_Master_Log_Pos: 102699258 Relay_Log_File: mysqld-relay-bin.000034 Relay_Log_Pos: 102699403 Relay_Master_Log_File: mysql-bin.000040 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: browserid Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 102699258 Relay_Log_Space: 102699602 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: [root@ip-10-148-37-93 ~]# echo "SHOW SLAVE STATUS;" | mysql -BE browserid *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.148.33.176 Master_User: replication Master_Port: 3306 Connect_Retry: 15 Master_Log_File: mysql-bin.000041 Read_Master_Log_Pos: 17061771 Relay_Log_File: mysqld-relay-bin.000035 Relay_Log_Pos: 1047957 Relay_Master_Log_File: mysql-bin.000040 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: browserid Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 103746964 Relay_Log_Space: 121920275 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 21682 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error:

John Morrison [:jrgm]

Reporter

Comment 17

•

12 years ago

Corrections welcome; this is my understanding of the replication tree in production. (I'm using logical names since I don't know the actual hostnames used in production): http://i.imgur.com/JFXpTGZ.png The reason phx1-db3 is a child of phx1-db2 is that I requested that change when we were thinking we would not move web traffic back into phx1, and the plan then would have been that we would stop replication on db2 before starting the ALTER. And so phx1-db2 would be a safe fallback position if things went horribly wrong in the rest of the replication tree. Anyways, trial runs in phx1 ran quick enough, but in aws infrastructure the alter still takes hours. So, one approach (with risks) would be: 0. Bring all web traffic back into phx1 and wait for aws to go idle. 1. Stop replication on scl2-db1 so those set of databases (and scl2 as a whole) are a fallback. 2. Do mysqldump on all the database hosts, fwiw. And then start the alter on phx1-db1. 3. Happy path: the alter succeeds in about the same time as the trial with phx1-db3. During the time the ALTER propagates to phx1-db2 and then phx1-db3, users making changes will experience transient errors. The ALTER also continues to the two AWS regions, where, as long as there is no web traffic, users will be oblivious to the hours-long database unavailability. 4. Not happy path: the ALTER on phx1-db1 takes >1 minute. We can safely cancel the ALTER at this point and mysql will clean up; the ALTER statement is not written to binary logging until it successfully completes, so it won't replicate off phx1-db1. (Note: a crude way to check progress is the size of the temporary /data/mysql/.#sql... table on phx1-db1). 5. Really unhappy path: something goes horribly, unexplainably wrong. SCL2, with replication stopped is the fallback: declare scl2-db1 the new master, reset the slave there so it no longer replicates from phx1-db1, change scl2 webhead and dbwriter configurations to point to scl2-db1 as the write master, restart webheads and dbwriters, move all web traffic to SCL2.

John Morrison [:jrgm]

Reporter

Comment 18

•

12 years ago

0. Bring all web traffic back into phx1 and wait for aws regions to go idle. 0b. Remove db2 and db3 from the database readonly loadbalancer so only db1 has either read or write traffic. 1. Stop replication in an aws region to act as a fallback. 2. Do mysqldump on all the database hosts (that's *all* databases in phx1 and scl2, and two aws regions). > `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null` and note the times (which should be ~20 seconds (or less) for `email` and `user`. 3. Immediately following step 2., on db3 do: > USE browserid; > SET autocommit=0; > ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0'; and note the time (which should be on the order of a 30 seconds from previous data). If this takes longer than ~ one minute, cancel the ALTER (which will rollback the changes). 4. When it completes on the master Check the new column is there: > USE browserid; > SHOW CREATE TABLE user; which should look something like: > CREATE TABLE `user` ( > `id` bigint(20) NOT NULL AUTO_INCREMENT, > `passwd` char(64) DEFAULT NULL, > `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', > `failedAuthTries` int(11) NOT NULL DEFAULT '0', > PRIMARY KEY (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1 5. Monitor the slaves in phx1 as it passes it down to db2 and db3. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. This may take several hours in AWS regions. 6. When phx1 slaves are complete and caught up, they can be returned to take read traffic again. 7. Wait hours for the aws regions to catch up.

John Morrison [:jrgm]

Reporter

Comment 19

•

12 years ago

We began work on this at 10am PDT; at 11am PDT we started the alter which took 9 seconds on the master and a few minutes on slaves. During this time, there were problems for write related operations til 11:15am PDT, due to us still having hosts reading from the slaves (despite being out of the loadbalancer) [Lesson: need to bounce webheads and dbwriters to get them off of established db connections). The write proceeded into us-west-2 and completed about 4pm PDT. I've pointed my DNS and went through a series of use cases that trigger writes and replication, and looks good to go back in production, but Gene, can you check SHOW CREATE TABLE on each slave in us-west-2 and phx1 and the mysql error log on phx1-db{1,2,3} and us-west-2-db{1,2,3) for any unusual messages.

Gene Wood [:gene]

Assignee

Comment 20

•

12 years ago

I shared the show create table with jrgm yesterday around 4pm and everything looked good. At that time we brought us-west-2 back into the load balance group and I enabled slave replication at us-east-1 which should have completed by now, i'll go check it out.

Gene Wood [:gene]

Assignee

Comment 21

•

12 years ago

us-east-1 looks good, I'll enable replication [admin:0525:us-east-1][root@ip-10-146-34-142 ~]$ for host in 10.146.33.213 10.146.37.43 10.146.35.96; do ssh $host 'echo "SHOW CREATE TABLE user;" | mysql -BE browserid'; done *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `failedAuthTries` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1 *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `failedAuthTries` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1 *************************** 1. row *************************** Table: user Create Table: CREATE TABLE `user` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `passwd` char(64) DEFAULT NULL, `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `failedAuthTries` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1

Gene Wood [:gene]

Assignee

Comment 22

•

12 years ago

sorry my comments are confusing/bogus. What I mean : us-east-1 which has been replicating overnight looks good. Now to get some QA from jrgm and we'll put user traffic back on it.

John Morrison [:jrgm]

Reporter

Comment 23

•

12 years ago

Gene, us-east-1 is good to go. You can re-enable traffic to us-east-1. Thanks.

John Morrison [:jrgm]

Reporter

Comment 24

•

12 years ago

[11:35:25] <gene> 11:35am phx1 traffic drained. us-east-1 traffic enabled. we're now serving out of us-west-2 and us-east-1

Gene Wood [:gene]

Assignee

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Gene Wood [:gene]

Assignee

Updated

•

12 years ago

Assignee: nobody → gene