Closed
Bug 871799
Opened 12 years ago
Closed 12 years ago
Alter table add column for production browserid databases
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jrgm, Assigned: gene)
Details
Hi Sheeri,
For an upcoming release of Persona, we need to add a column to the browserid.user table.
The alter is:
> alter table browserid.user add column `failedAuthTries` int(11) NOT NULL DEFAULT '0'
`user` is millions of rows and gene says it took 15 minutes in stage.
I wonder if you have some time to discuss a plan to do this in production.
1. I think we can make this a little quicker by using mysqldump to page in the tables fully before the alter (?).
2. While the alter table is in progress, all writes will wait for the alter to complete. True or False?
Comment 1•12 years ago
|
||
The user table is 108M in size:
-rw-rw---- 1 mysql mysql 108M May 14 08:22 user.ibd
I'm not sure what you're getting at with #1. Meaning that things will already be in the cache? The bottleneck is likely the I/O of writing, not of reading.....although there's some reading too. Interesting.
As for #2, while the ALTER TABLE is in progress, an exclusive lock is gotten (write lock) on the whole table - no writes OR reads can proceed until the ALTER is done.
I would use the oak-online-alter-table tool - http://openarkkit.googlecode.com/svn/trunk/openarkkit/doc/html/oak-online-alter-table.html - or the pt-online-schema-change tool - http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html - it creates triggers on the old table and creates a new table with the schema you want and backfills the data from the old table, and when the backfill is complete it drops the old table and renames the new table to what the old table should be (though you can configure it not to drop the old table if you want).
There are many caveats with using these tools, but they're mostly for more complicated setups like when foriegn keys are in use or if triggers already exist on the current table. The openark tool was created first and is still maintained, and is likely less buggy than the percona toolkit tool (because it has fewer features, and the features the percona toolkit has, we don't need).
Reporter | ||
Comment 2•12 years ago
|
||
@sheeri Thanks for the comments.
Yeah, for #1, yeah I was referring to getting things in mysql and linux caches to make the reads quicker. But yes, the entire table has to be written too.
Actually, one bit of weirdness. While the alter on the master took 15 minutes, on the slaves the alter took something like 10 hours.
Thanks for the pointers to the oak and pt tools. The former I've looked into before, but I'll look into the oak tool (for my reading enjoyment).
It seems to me we first need to understand why the IO with EBS is so slow (and request more IOPS is that is appropriate).
(For, the ALTER TABLE I suppose there is also the option to do it offline and do a master swap.)
Comment 3•12 years ago
|
||
One other solution: upgrade to MySQL 5.6, it has online ALTER TABLE. We have a Q2 goal of upgrading to 5.5 everywhere the db engineering team manages, and by end of Q4 be upgraded to MySQL 5.6 everywhere.
It's probably not a solution you'll implement right now (and if you are, we're here to help) but I figured it might be important for you to know that online column add is native in MySQL 5.6.
Reporter | ||
Comment 4•12 years ago
|
||
(After a long email thread in between previous comment and now...)
Steps I used to perform the ALTER TABLE.
1. On the master and *all* slaves in parallel:
> `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null`
2. Then on the master, in the mysql shell:
> ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0';
While the mysqldumps run, replication is delayed (which will lead to failures for adduser, change-password, etc. where a change is written \
to the master, and is then read from a slave).
While the ALTER propagates down the replication tree, replication will again be delayed.
(Gene, we should also talk about steps to take if there is a problem with
making this change (e.g., setup replication within phx1 (which has no read
traffic atm) like 'master->slave1->slave2' and then stopping replication on
slave1 so it can become the basis of a new replication tree in the worst case).
I think it would be best if we tried this change on a slave in production
(take it out of the loadbalancer, do mysqldump+ALTER and time it, then do
mysqldump+DROP of the column and put it back into traffic when replication is
caught up).
Assignee | ||
Comment 5•12 years ago
|
||
Should I wait for all three Step 1 statements to complete before initiating Step 2?
How long should Step 2 take?
What if Step 2 takes 10 hours?
> we should also talk about steps to take if there is a problem with making this change
What do you recommend?
> take it out of the loadbalancer, do mysqldump+ALTER and time it, then do mysqldump+DROP of the column and put it back into traffic when replication is caught up
Sure thing, can you provide the exact commands that you'd like run and I'll do so on one of the DBs in us-west-2
Reporter | ||
Comment 6•12 years ago
|
||
Yes, the Step 1 mysqldumps should be allowed to complete before starting Step 2. The `staged` table is tiny; mysqldump for `email` and `user` took about 20 seconds each in stage database environment.
Step 2, after the mysqldump, the ALTER TABLE took 50 seconds to 3 minutes in stage.
For the last part, can we find a time for us to chat with Sheeri; I'm not familiar
with the production environment, so I can't recommend exact commands without asking
some other questions (and my idea is only half-baked anyways ;-)).
Reporter | ||
Comment 7•12 years ago
|
||
As a trial run to check timing, in PHX1 datacenter which has no readonly
traffic:
1. Check that db2 really does not have traffic. Since we don't have query
logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be
connections from the webheads, but they should show a large value in the
'Time' column (which is how long since they were last used).
2. On db2, do
> `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null`
and note the times (which should be ~20 seconds (or less) for `email` and `user`.
3. Immediately following step 2., on db2 do:
> USE browserid;
> ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0';
and note the time (which should be on the order of a few minutes (maybe
less for the smaller table size in production).
4. Check the new column is there:
> USE browserid;
> SHOW CREATE TABLE user;
which should look something like:
> CREATE TABLE `user` (
> `id` bigint(20) NOT NULL AUTO_INCREMENT,
> `passwd` char(64) DEFAULT NULL,
> `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
> `failedAuthTries` int(11) NOT NULL DEFAULT '0',
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1
5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that
get back to 0-ish seconds.
6. If everything has gone well, we now clean up and drop the column. On db2,
repeat the mysqldump:
7. On db2, 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;'
8. Do:
> USE browserid;
> SHOW CREATE TABLE user;
to see the column is gone.
Reporter | ||
Comment 8•12 years ago
|
||
Update to the above plan: in step 2 and step 7, before the ALTER TABLE statement, do
SET autocommit=0;, and after the ALTER 'SET autommit=1' or exit and re-enter the mysql shell.
Reporter | ||
Comment 9•12 years ago
|
||
As a trial run to check timing, in PHX1 datacenter which has no readonly
traffic:
1. Check that db2 really does not have traffic. Since we don't have query
logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be
connections from the webheads, but they should show a large value in the
'Time' column (which is how long since they were last used).
2. On db2, do
> `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null`
and note the times (which should be ~20 seconds (or less) for `email` and `user`.
3. Immediately following step 2., on db2 do:
> USE browserid;
> SET autocommit=0;
> ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0';
and note the time (which should be on the order of a few minutes (maybe
less for the smaller table size in production).
4. Check the new column is there:
> USE browserid;
> SHOW CREATE TABLE user;
which should look something like:
> CREATE TABLE `user` (
> `id` bigint(20) NOT NULL AUTO_INCREMENT,
> `passwd` char(64) DEFAULT NULL,
> `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
> `failedAuthTries` int(11) NOT NULL DEFAULT '0',
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1
5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that
get back to 0-ish seconds.
6. If everything has gone well, we now clean up and drop the column. On db2,
repeat the mysqldump:
7. On db2,
> SET autocommit=0;
> 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;'
8. Do:
> USE browserid;
> SHOW CREATE TABLE user;
to see the column is gone.
Reporter | ||
Comment 10•12 years ago
|
||
As a trial run to check timing, in PHX1 datacenter which has no readonly
traffic:
1. Check that db3 really does not have traffic. Since we don't have query
logging enabled (I don't think), do `SHOW PROCESSLIST`: there may be
connections from the webheads, but they should show a large value in the
'Time' column (which is how long since they were last used).
2. On db3, do
> `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null`
and note the times (which should be ~20 seconds (or less) for `email` and `user`.
3. Immediately following step 2., on db3 do:
> USE browserid;
> SET autocommit=0;
> ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0';
and note the time (which should be on the order of a few minutes (maybe
less for the smaller table size in production).
4. Check the new column is there:
> USE browserid;
> SHOW CREATE TABLE user;
which should look something like:
> CREATE TABLE `user` (
> `id` bigint(20) NOT NULL AUTO_INCREMENT,
> `passwd` char(64) DEFAULT NULL,
> `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
> `failedAuthTries` int(11) NOT NULL DEFAULT '0',
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1
5. Do 'SHOW SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. Let that
get back to 0-ish seconds.
6. If everything has gone well, we now clean up and drop the column. On db3,
repeat the mysqldump:
7. On db3,
> SET autocommit=0;
> 'ALTER TABLE browserid.user DROP COLUMN failedAuthTries;'
8. Do:
> USE browserid;
> SHOW CREATE TABLE user;
to see the column is gone.
Assignee | ||
Comment 11•12 years ago
|
||
Step 2-5
http://gene.pastebin.mozilla.org/2492842
I forgot step 6
Step 7 - 8
[root@db3.iddb.phx1.svc ~]# date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user DROP COLUMN failedAuthTries;\" | mysql -BE browserid"
Thu Jun 6 14:52:56 PDT 2013
real 0m6.747s
user 0m0.013s
sys 0m0.012s
[root@db3.iddb.phx1.svc ~]# echo "SHOW CREATE TABLE user;" | mysql -BE browserid
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2306076 DEFAULT CHARSET=latin1
[root@db3.iddb.phx1.svc ~]#
Assignee | ||
Comment 12•12 years ago
|
||
Here's the output from step 2-5 in the pastebin above (since the pastebin will expire)
[root@db3.iddb.phx1.svc ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user ADD COLUMN failedAuthTries INT(11) NOT NULL DEFAULT '0';\" | mysql -BE browserid";echo "SHOW CREATE TABLE user;" | mysql -BE browserid;echo "SHOW SLAVE STATUS;" | mysql -BE browserid
Thu Jun 6 14:50:32 PDT 2013
real 0m2.744s
user 0m2.265s
sys 0m0.115s
real 0m4.020s
user 0m3.207s
sys 0m0.100s
real 0m3.842s
user 0m3.613s
sys 0m0.136s
Thu Jun 6 14:50:43 PDT 2013
real 0m7.285s
user 0m0.007s
sys 0m0.012s
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`failedAuthTries` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2306063 DEFAULT CHARSET=latin1
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: db2.iddb.phx1.svc.mozilla.com
Master_User: replication
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000269
Read_Master_Log_Pos: 81959899
Relay_Log_File: mysqld-relay-bin.000002
Relay_Log_Pos: 9536040
Relay_Master_Log_File: mysql-bin.000269
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: browserid
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 81959899
Relay_Log_Space: 9536196
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Reporter | ||
Comment 13•12 years ago
|
||
Okay, so the mysqldumps + ADD COLUMN took 18 seconds. So end to end to the furthest leaf in the tree that's maybe about a minute (plus maybe a bit more for replication to catch up) to complete the maintenance.
I would like to get this same test done on an idle slave in an AWS region to confirm that those databases have about the same performance.
If they do, the I figure we should just do this either Friday or Monday morning, and
not target our quietest time of the night. (I would really like to get back to finishing with this train-2013.05.22 for shipping on Monday).
Assignee | ||
Comment 14•12 years ago
|
||
Alter completed after 142 minutes. I've started the revert now
[root@ip-10-148-37-93 ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date;time bash -c "echo \"SET autocomm
it=0;ALTER TABLE browserid.user ADD COLUMN failedAuthTries INT(11) NOT NULL DEFAULT '0';\" | mysql -BE browserid";echo "SHOW CREATE TABLE user;" | mysql -BE browserid;echo "SHOW SLAVE STATUS;" | mysql
-BE browserid
Fri Jun 7 11:37:37 PDT 2013
real 0m1.955s
user 0m1.012s
sys 0m0.043s
real 0m3.230s
user 0m1.425s
sys 0m0.050s
real 0m3.677s
user 0m1.913s
sys 0m0.066s
Fri Jun 7 11:37:46 PDT 2013
real 142m49.877s
user 0m0.003s
sys 0m0.009s
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`failedAuthTries` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2311456 DEFAULT CHARSET=latin1
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 10.148.33.176
Master_User: replication
Master_Port: 3306
Connect_Retry: 15
Master_Log_File: mysql-bin.000040
Read_Master_Log_Pos: 102699258
Relay_Log_File: mysqld-relay-bin.000034
Relay_Log_Pos: 102699403
Relay_Master_Log_File: mysql-bin.000040
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB: browserid
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 102699258
Relay_Log_Space: 102699602
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
[root@ip-10-148-37-93 ~]# date;time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null;date;
Fri Jun 7 14:04:24 PDT 2013
real 0m1.905s
user 0m1.014s
sys 0m0.039s
real 0m3.131s
user 0m1.422s
sys 0m0.046s
real 0m3.737s
user 0m1.983s
sys 0m0.072s
Fri Jun 7 14:04:33 PDT 2013
Assignee | ||
Comment 15•12 years ago
|
||
John, given the test results how would you like to proceed on Monday morning?
Assignee | ||
Comment 16•12 years ago
|
||
Here are the drop column results
[root@ip-10-148-37-93 ~]# date;time bash -c "echo \"SET autocommit=0;ALTER TABLE browserid.user DROP COLUMN failedAuthTries;\" | mysql -BE browserid"
Fri Jun 7 14:04:34 PDT 2013
real 145m7.841s
user 0m0.004s
sys 0m0.008s
[root@ip-10-148-37-93 ~]# echo "slave start;SHOW SLAVE STATUS;" | mysql -BE browserid
*************************** 1. row ***************************
Slave_IO_State: Connecting to master
Master_Host: 10.148.33.176
Master_User: replication
Master_Port: 3306
Connect_Retry: 15
Master_Log_File: mysql-bin.000040
Read_Master_Log_Pos: 102699258
Relay_Log_File: mysqld-relay-bin.000034
Relay_Log_Pos: 102699403
Relay_Master_Log_File: mysql-bin.000040
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB: browserid
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 102699258
Relay_Log_Space: 102699602
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
[root@ip-10-148-37-93 ~]# echo "SHOW SLAVE STATUS;" | mysql -BE browserid
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.148.33.176
Master_User: replication
Master_Port: 3306
Connect_Retry: 15
Master_Log_File: mysql-bin.000041
Read_Master_Log_Pos: 17061771
Relay_Log_File: mysqld-relay-bin.000035
Relay_Log_Pos: 1047957
Relay_Master_Log_File: mysql-bin.000040
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: browserid
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 103746964
Relay_Log_Space: 121920275
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 21682
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Reporter | ||
Comment 17•12 years ago
|
||
Corrections welcome; this is my understanding of the replication tree in
production. (I'm using logical names since I don't know the actual hostnames
used in production):
http://i.imgur.com/JFXpTGZ.png
The reason phx1-db3 is a child of phx1-db2 is that I requested that change
when we were thinking we would not move web traffic back into phx1, and the
plan then would have been that we would stop replication on db2 before
starting the ALTER. And so phx1-db2 would be a safe fallback position if
things went horribly wrong in the rest of the replication tree.
Anyways, trial runs in phx1 ran quick enough, but in aws infrastructure the
alter still takes hours.
So, one approach (with risks) would be:
0. Bring all web traffic back into phx1 and wait for aws to go idle.
1. Stop replication on scl2-db1 so those set of databases (and scl2 as a
whole) are a fallback.
2. Do mysqldump on all the database hosts, fwiw. And then start the alter
on phx1-db1.
3. Happy path: the alter succeeds in about the same time as the trial with
phx1-db3. During the time the ALTER propagates to phx1-db2 and then
phx1-db3, users making changes will experience transient errors. The ALTER
also continues to the two AWS regions, where, as long as there is no web
traffic, users will be oblivious to the hours-long database unavailability.
4. Not happy path: the ALTER on phx1-db1 takes >1 minute. We can safely
cancel the ALTER at this point and mysql will clean up; the ALTER statement is
not written to binary logging until it successfully completes, so it won't
replicate off phx1-db1. (Note: a crude way to check progress is the size of
the temporary /data/mysql/.#sql... table on phx1-db1).
5. Really unhappy path: something goes horribly, unexplainably wrong. SCL2, with
replication stopped is the fallback: declare scl2-db1 the new master, reset
the slave there so it no longer replicates from phx1-db1, change scl2 webhead
and dbwriter configurations to point to scl2-db1 as the write master, restart
webheads and dbwriters, move all web traffic to SCL2.
Reporter | ||
Comment 18•12 years ago
|
||
0. Bring all web traffic back into phx1 and wait for aws regions to go idle.
0b. Remove db2 and db3 from the database readonly loadbalancer so only db1 has
either read or write traffic.
1. Stop replication in an aws region to act as a fallback.
2. Do mysqldump on all the database hosts (that's *all* databases in phx1 and scl2, and two aws regions).
> `time mysqldump browserid staged > /dev/null; time mysqldump browserid email > /dev/null; time mysqldump browserid user > /dev/null`
and note the times (which should be ~20 seconds (or less) for `email` and `user`.
3. Immediately following step 2., on db3 do:
> USE browserid;
> SET autocommit=0;
> ALTER TABLE browserid.user ADD COLUMN `failedAuthTries` INT(11) NOT NULL DEFAULT '0';
and note the time (which should be on the order of a 30 seconds from previous data).
If this takes longer than ~ one minute, cancel the ALTER (which will rollback the
changes).
4. When it completes on the master Check the new column is there:
> USE browserid;
> SHOW CREATE TABLE user;
which should look something like:
> CREATE TABLE `user` (
> `id` bigint(20) NOT NULL AUTO_INCREMENT,
> `passwd` char(64) DEFAULT NULL,
> `lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
> `failedAuthTries` int(11) NOT NULL DEFAULT '0',
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=12414783 DEFAULT CHARSET=latin1
5. Monitor the slaves in phx1 as it passes it down to db2 and db3. Do 'SHOW
SLAVE STATUS\G' and check 'Seconds_Behind_Master:'. This may take several
hours in AWS regions.
6. When phx1 slaves are complete and caught up, they can be returned to
take read traffic again.
7. Wait hours for the aws regions to catch up.
Reporter | ||
Comment 19•12 years ago
|
||
We began work on this at 10am PDT; at 11am PDT we started the alter which took 9 seconds on the master and a few minutes on slaves. During this time, there were
problems for write related operations til 11:15am PDT, due to us still having hosts reading from the slaves (despite being out of the loadbalancer) [Lesson: need to bounce webheads and dbwriters to get them off of established db connections).
The write proceeded into us-west-2 and completed about 4pm PDT. I've pointed my DNS and went through a series of use cases that trigger writes and replication, and looks good to go back in production, but Gene, can you check SHOW CREATE TABLE on each slave in us-west-2 and phx1 and the mysql error log on phx1-db{1,2,3} and us-west-2-db{1,2,3) for any unusual messages.
Assignee | ||
Comment 20•12 years ago
|
||
I shared the show create table with jrgm yesterday around 4pm and everything looked good. At that time we brought us-west-2 back into the load balance group and I enabled slave replication at us-east-1 which should have completed by now, i'll go check it out.
Assignee | ||
Comment 21•12 years ago
|
||
us-east-1 looks good, I'll enable replication
[admin:0525:us-east-1][root@ip-10-146-34-142 ~]$ for host in 10.146.33.213 10.146.37.43 10.146.35.96; do ssh $host 'echo "SHOW CREATE TABLE user;" | mysql -BE browserid'; done
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`failedAuthTries` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`failedAuthTries` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1
*************************** 1. row ***************************
Table: user
Create Table: CREATE TABLE `user` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`passwd` char(64) DEFAULT NULL,
`lastPasswordReset` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`failedAuthTries` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2346140 DEFAULT CHARSET=latin1
Assignee | ||
Comment 22•12 years ago
|
||
sorry my comments are confusing/bogus. What I mean :
us-east-1 which has been replicating overnight looks good. Now to get some QA from jrgm and we'll put user traffic back on it.
Reporter | ||
Comment 23•12 years ago
|
||
Gene, us-east-1 is good to go. You can re-enable traffic to us-east-1. Thanks.
Reporter | ||
Comment 24•12 years ago
|
||
[11:35:25] <gene> 11:35am phx1 traffic drained. us-east-1 traffic enabled. we're now serving out of us-west-2 and us-east-1
Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → gene
You need to log in
before you can comment on or make changes to this bug.
Description
•