<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

13 years ago

I'll be working through these instructions tomorrow : https://intranet.mozilla.org/Services/Ops/BrowserID/MySQL#Schema_changes

Comment 2

•

13 years ago

I'll take this change as meaning this alter statement : ALTER TABLE user ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0

Comment 3

•

13 years ago

It appears that the "browserid" user which the app uses to access the DB doesn't have rights to do what we need here : You do not have the PROCESS privilege at bin/pt-online-schema-change line 3708. And the "repl" user as defined in the master DB doesn't have rights to read the "browserid" database And neither user has rights to the "mysql" database so I can't determine what any other users are. And none of the gpg encrypted password files reveal any other browserid database users.

Comment 4

•

13 years ago

I've confirmed that I have the network ability to reach the databases # Validate that you can access all of the databases so pt-online-schema-change can monitor slave lag mpwd=PASSWORDGOESHERE for dbhost in db1.iddb db2.iddb db3.iddb; do echo $dbhost mysql --host=$dbhost.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=repl browserid -e "SELECT count(*) FROM user" done And my planned schema change command is : bin/pt-online-schema-change \ --alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \ --alter-foreign-keys-method=rebuild_constraints --drop-old-table \ --print --recursion-method=processlist --recurse=1 \ --no-check-replication-filters --check-interval=1 --chunk-size=64 --execute \ D=browserid,t=user,u=USERNAMEGOESHERE,p=PASSWORDGOESHERE,h=db1.iddb.scl2.stage.svc.mozilla.com

James Bonacci [:jbonacci]

Comment 5

•

13 years ago

I've gotten in locally with the root user and confirmed that I'll need to get the root mysql password in order to do the pt-online-schema-change

Updated

•

13 years ago

Whiteboard: [qa+]

Comment 6

•

13 years ago

Found the location of the root password from atoll. Continuing with schema change now.

Comment 7

•

13 years ago

Schema change has completed but jrgm is reporting problems

Assignee

Comment 8

•

13 years ago

I think that this is a result of the "do-repl" statements in my.cnf for browserid. The data on the slaves is incorrect and no longer matches the master. Please consider all slaves damaged, they will need to be rebuilt from the master. I recommend the following steps: 1) on the master, ALTER TABLE browserid.user DROP COLUMN lastPasswordReset; 3) remove "replicate-do-db = browserid" from my.cnf.iddb (for stage only) 4) using puppet, deploy the new my.cnf to master and all slaves 5) restart mysql on master and all slaves 6) recreate all slaves using the mysql "restore a slave" process in the browserid wiki 7) remove "--no-check-replication-filters" from the pt-o-s-c command line 8) try the pt-o-s-c command again, with the same tests as before

John Morrison [:jrgm]

Comment 9

•

13 years ago

(In reply to Eugene Wood [:gene] from comment #7) > Schema change has completed but jrgm is reporting problems The two problems I saw were: 1. from approx. 19:32:32 to 19:35:21 the dbwriters (sweb) were returning 503 for /wsapi/complete_email_addition and /wsapi/complete_user_creation with additional detail like: > {"level":"warn","message":"Query taking more than 5000ms! reconnecting to mysql","timestamp":"2012-09-06T19:35:21.532Z"} > {"level":"warn","message":"couldn't complete email verification: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"} > {"level":"warn","message":"database is down, cannot process request: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"} Gene: do the logs on the pt job show a correlation to that time about what it was doing? (Damn, I miss having the general query log.) 2. On the replication slave databases this query was taking hours to complete: > ALTER TABLE `browserid`.`email` DROP FOREIGN KEY `__email_ibfk_1`, ADD CONSTRAINT `___email_ibfk_1` FOREIGN KEY (`user`) REFERENCES `browserid`.`user` (`id`) As a result, the table row counts on browserid.email and browserid.user were way off between the master and replicant, but now appear to have returned to normal (but I'm not sure I trust that they are completely in sync, so let's get this back to a known state).

John Morrison [:jrgm]

Comment 10

•

13 years ago

> 1. from approx. 19:32:32 to 19:35:21 Those are GMT times, so 12:30ish PDT.

John Morrison [:jrgm]

Comment 11

•

13 years ago

> (Damn, I miss having the general query log.) Okay, off topic to this bug, but at a previous company, we ran a minutely cronjob to do 'SHOW FULL PROCESSLIST' and dumped that into some_dir/%H%M files that would overwrite themselves every 24 hours. Useful for post mortem analysis.

Comment 12

•

13 years ago

Rolled back change on db1 : mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"

Comment 13

•

13 years ago

my.cnf modified and all staging mysql servers restarted

Comment 14

•

13 years ago

Slaves restored

Comment 15

•

13 years ago

Ok, App and schema reverted to 06.22. jrgm would you smoke test it and then start up a load test, let me know and I'll begin the schema change again

Comment 16

•

13 years ago

I've restarted the schema change now while jrgm puts load on the system : Start time : Fri Sep 7 15:04:29 PDT 2012 date;bin/pt-online-schema-change \ --alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \ --alter-foreign-keys-method=rebuild_constraints --drop-old-table \ --print --recursion-method=processlist --recurse=1 \ --check-interval=1 --chunk-size=64 --execute \ D=browserid,t=user,u=$muser,p=$mpwd,h=db1.iddb.scl2.stage.svc.mozilla.com;date

Assignee

Comment 17

•

13 years ago

It looks like "--alter-foreign-keys-method=rebuild_constraints" creates a huge locking delay at the end of the process due to the foreign key rebuild method [1]. I recommend rolling back and then trying again with one change to the comment 16 command line "--alter-foreign-keys-method=drop_old_table", rather than method=rebuild_constraints. All other --options remain the same, including --drop-old-table. Note that the "method=drop_old_table" comes with safety warnings [1]. [1] http://www.percona.com/doc/percona-toolkit/2.0/pt-online-schema-change.html#cmdoption-pt-online-schema-change--update-foreign-keys-method

Comment 18

•

13 years ago

The schema change completed at Fri Sep 7 16:13:20 PDT 2012 but caused a 15 minute window due to foreign key constraints. I've now kicked off a rollback of the schema change ; mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"

Comment 19

•

13 years ago

@sheeri - We'd love your review and advice on our deployment strategy. Feel free to join us in #identity

Assignee

Comment 20

•

13 years ago

My worst-case scenario alternative to comment 17 is: a. ALTER TABLE on the write master, blocking all read and write queries until it finishes in 5-30 minutes. b. Confirm that it replicates to all slaves. However, we can reduce "5-30 minutes" to the time it takes to do a master/slave swap: a. Drain service from SCL2. b. Direct all PHX1 write AND read traffic to the write master in PHX1. c. ALTER TABLE on all servers in SCL2 and all slaves in PHX1. d. Master failover to an altered slave in PHX1. e. Rebuild the unaltered former master using the standard copy-from-master process. f. Confirm replication working to all servers in SCL2 and PHX1. g. Undrain service from SCL2.

Comment 21

•

13 years ago

For users and email tables... how many rows exist in stage and how many exist in production? If our stage db (due to automated tests) is much larger than prod, we should delete some rows to get it down to 2x or 3x the size of production.

Sheeri Cabral [:sheeri]

Comment 22

•

13 years ago

Your ideas on how to handle the ALTER TABLE are fine, but....if this is only to support 1 feature, I have some questions..... Why require a field for it? Why can't the password reset itself kick all the sessions without needing a field in the database? (and yes, I understand that there may be sessions on the master, and sessions on more than one slave, etc but doing a cron job every minute adds tons of overhead). Why not create a new table, which logs password resets? You may want to log other things in the future, so having a table that has user_id, an ENUM "action" field (right now the only action would be lastPasswordReset, but you can add to the end of an ENUM field without locking, and thus without needing downtime) and the timestamp seems pretty sane. And creating a table doesn't require any locking, either.

Sheeri Cabral [:sheeri]

Comment 23

•

13 years ago

There may be an issue with the workaround, because you're adding a column - if you add the column at different times on different machines, replication might break because the master/slave have different schemas.

Comment 24

•

13 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #22) > Why can't the password reset itself kick all the sessions without needing a field in the database? Our sessions are client side. > Why not create a new table I agree with this solution, but we have a hard deadline and re-engineering won't be possible for beta launch.

Comment 25

•

13 years ago

ozten : select count(*) from user; Staging : 6719176 Prod : 189561

Comment 26

•

13 years ago

(In reply to Eugene Wood [:gene] from comment #25) Per IRC discussion... Please 1) reduce table size to 2x or 3x prod 2) Retest migration starting with the most desirable option.

Assignee

Comment 27

•

13 years ago

(In reply to Austin King [:ozten] from comment #26) > 1) reduce table size to 2x or 3x prod Reduced to users id 1..400000 plus 1000 loadtest users. > 2) Retest migration starting with the most desirable option. Tested using the simple option, direct ALTER TABLE on master (per comment 20), median 2.5 seconds, max 4.5 seconds. Current suggested procedure for production alter: (a) Drain all non-master sites (b) Drain all slaves from Zeus database pools, leaving only the write master (c) ALTER TABLE on the master (d) Confirm schema replicated correctly on all slaves (e) Undrain all slaves from Zeus database pools (f) Undrain all non-master sites

Sheeri Cabral [:sheeri]

Comment 28

•

13 years ago

Did you want to drain ALL slaves, or all but 1? will writes AND reads go to the master while this is going on? this should all work, depending on the replicated statements, so I recommend dong some thorough functional testing of this in stage.

Assignee

Comment 29

•

13 years ago

All slaves. The master is spec'd to support full read/write load during such an event, and stage loadtests have (unintentionally) been run in that configuration since June 2012.