Closed Bug 787707 Opened 13 years ago Closed 13 years ago

Persona: Schema change: add column to user - lastPasswordReset TIMESTAMP DEFAULT 0 NOT NULL

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lhilaiel, Assigned: Atoll)

References

Details

(Whiteboard: [qa+])

Please initiate a schema change in the persona staging environment as a pre-requisite to deploying train-2012.08.17 The change adds a column to the user table: lastPasswordReset TIMESTAMP DEFAULT 0 NOT NULL Here is the diff in source code: https://github.com/mozilla/browserid/pull/2026/files#L2R68 This change is to support a feature that causes all outstanding user sessions to be reset upon password change.
Blocks: 787708
I'll be working through these instructions tomorrow : https://intranet.mozilla.org/Services/Ops/BrowserID/MySQL#Schema_changes
I'll take this change as meaning this alter statement : ALTER TABLE user ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0
It appears that the "browserid" user which the app uses to access the DB doesn't have rights to do what we need here : You do not have the PROCESS privilege at bin/pt-online-schema-change line 3708. And the "repl" user as defined in the master DB doesn't have rights to read the "browserid" database And neither user has rights to the "mysql" database so I can't determine what any other users are. And none of the gpg encrypted password files reveal any other browserid database users.
I've confirmed that I have the network ability to reach the databases # Validate that you can access all of the databases so pt-online-schema-change can monitor slave lag mpwd=PASSWORDGOESHERE for dbhost in db1.iddb db2.iddb db3.iddb; do echo $dbhost mysql --host=$dbhost.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=repl browserid -e "SELECT count(*) FROM user" done And my planned schema change command is : bin/pt-online-schema-change \ --alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \ --alter-foreign-keys-method=rebuild_constraints --drop-old-table \ --print --recursion-method=processlist --recurse=1 \ --no-check-replication-filters --check-interval=1 --chunk-size=64 --execute \ D=browserid,t=user,u=USERNAMEGOESHERE,p=PASSWORDGOESHERE,h=db1.iddb.scl2.stage.svc.mozilla.com
I've gotten in locally with the root user and confirmed that I'll need to get the root mysql password in order to do the pt-online-schema-change
Whiteboard: [qa+]
Found the location of the root password from atoll. Continuing with schema change now.
Schema change has completed but jrgm is reporting problems
I think that this is a result of the "do-repl" statements in my.cnf for browserid. The data on the slaves is incorrect and no longer matches the master. Please consider all slaves damaged, they will need to be rebuilt from the master. I recommend the following steps: 1) on the master, ALTER TABLE browserid.user DROP COLUMN lastPasswordReset; 3) remove "replicate-do-db = browserid" from my.cnf.iddb (for stage only) 4) using puppet, deploy the new my.cnf to master and all slaves 5) restart mysql on master and all slaves 6) recreate all slaves using the mysql "restore a slave" process in the browserid wiki 7) remove "--no-check-replication-filters" from the pt-o-s-c command line 8) try the pt-o-s-c command again, with the same tests as before
(In reply to Eugene Wood [:gene] from comment #7) > Schema change has completed but jrgm is reporting problems The two problems I saw were: 1. from approx. 19:32:32 to 19:35:21 the dbwriters (sweb) were returning 503 for /wsapi/complete_email_addition and /wsapi/complete_user_creation with additional detail like: > {"level":"warn","message":"Query taking more than 5000ms! reconnecting to mysql","timestamp":"2012-09-06T19:35:21.532Z"} > {"level":"warn","message":"couldn't complete email verification: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"} > {"level":"warn","message":"database is down, cannot process request: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"} Gene: do the logs on the pt job show a correlation to that time about what it was doing? (Damn, I miss having the general query log.) 2. On the replication slave databases this query was taking hours to complete: > ALTER TABLE `browserid`.`email` DROP FOREIGN KEY `__email_ibfk_1`, ADD CONSTRAINT `___email_ibfk_1` FOREIGN KEY (`user`) REFERENCES `browserid`.`user` (`id`) As a result, the table row counts on browserid.email and browserid.user were way off between the master and replicant, but now appear to have returned to normal (but I'm not sure I trust that they are completely in sync, so let's get this back to a known state).
> 1. from approx. 19:32:32 to 19:35:21 Those are GMT times, so 12:30ish PDT.
> (Damn, I miss having the general query log.) Okay, off topic to this bug, but at a previous company, we ran a minutely cronjob to do 'SHOW FULL PROCESSLIST' and dumped that into some_dir/%H%M files that would overwrite themselves every 24 hours. Useful for post mortem analysis.
Rolled back change on db1 : mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"
my.cnf modified and all staging mysql servers restarted
Slaves restored
Ok, App and schema reverted to 06.22. jrgm would you smoke test it and then start up a load test, let me know and I'll begin the schema change again
I've restarted the schema change now while jrgm puts load on the system : Start time : Fri Sep 7 15:04:29 PDT 2012 date;bin/pt-online-schema-change \ --alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \ --alter-foreign-keys-method=rebuild_constraints --drop-old-table \ --print --recursion-method=processlist --recurse=1 \ --check-interval=1 --chunk-size=64 --execute \ D=browserid,t=user,u=$muser,p=$mpwd,h=db1.iddb.scl2.stage.svc.mozilla.com;date
It looks like "--alter-foreign-keys-method=rebuild_constraints" creates a huge locking delay at the end of the process due to the foreign key rebuild method [1]. I recommend rolling back and then trying again with one change to the comment 16 command line "--alter-foreign-keys-method=drop_old_table", rather than method=rebuild_constraints. All other --options remain the same, including --drop-old-table. Note that the "method=drop_old_table" comes with safety warnings [1]. [1] http://www.percona.com/doc/percona-toolkit/2.0/pt-online-schema-change.html#cmdoption-pt-online-schema-change--update-foreign-keys-method
The schema change completed at Fri Sep 7 16:13:20 PDT 2012 but caused a 15 minute window due to foreign key constraints. I've now kicked off a rollback of the schema change ; mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"
@sheeri - We'd love your review and advice on our deployment strategy. Feel free to join us in #identity
My worst-case scenario alternative to comment 17 is: a. ALTER TABLE on the write master, blocking all read and write queries until it finishes in 5-30 minutes. b. Confirm that it replicates to all slaves. However, we can reduce "5-30 minutes" to the time it takes to do a master/slave swap: a. Drain service from SCL2. b. Direct all PHX1 write AND read traffic to the write master in PHX1. c. ALTER TABLE on all servers in SCL2 and all slaves in PHX1. d. Master failover to an altered slave in PHX1. e. Rebuild the unaltered former master using the standard copy-from-master process. f. Confirm replication working to all servers in SCL2 and PHX1. g. Undrain service from SCL2.
For users and email tables... how many rows exist in stage and how many exist in production? If our stage db (due to automated tests) is much larger than prod, we should delete some rows to get it down to 2x or 3x the size of production.
Your ideas on how to handle the ALTER TABLE are fine, but....if this is only to support 1 feature, I have some questions..... Why require a field for it? Why can't the password reset itself kick all the sessions without needing a field in the database? (and yes, I understand that there may be sessions on the master, and sessions on more than one slave, etc but doing a cron job every minute adds tons of overhead). Why not create a new table, which logs password resets? You may want to log other things in the future, so having a table that has user_id, an ENUM "action" field (right now the only action would be lastPasswordReset, but you can add to the end of an ENUM field without locking, and thus without needing downtime) and the timestamp seems pretty sane. And creating a table doesn't require any locking, either.
There may be an issue with the workaround, because you're adding a column - if you add the column at different times on different machines, replication might break because the master/slave have different schemas.
(In reply to Sheeri Cabral [:sheeri] from comment #22) > Why can't the password reset itself kick all the sessions without needing a field in the database? Our sessions are client side. > Why not create a new table I agree with this solution, but we have a hard deadline and re-engineering won't be possible for beta launch.
ozten : select count(*) from user; Staging : 6719176 Prod : 189561
(In reply to Eugene Wood [:gene] from comment #25) Per IRC discussion... Please 1) reduce table size to 2x or 3x prod 2) Retest migration starting with the most desirable option.
(In reply to Austin King [:ozten] from comment #26) > 1) reduce table size to 2x or 3x prod Reduced to users id 1..400000 plus 1000 loadtest users. > 2) Retest migration starting with the most desirable option. Tested using the simple option, direct ALTER TABLE on master (per comment 20), median 2.5 seconds, max 4.5 seconds. Current suggested procedure for production alter: (a) Drain all non-master sites (b) Drain all slaves from Zeus database pools, leaving only the write master (c) ALTER TABLE on the master (d) Confirm schema replicated correctly on all slaves (e) Undrain all slaves from Zeus database pools (f) Undrain all non-master sites
Did you want to drain ALL slaves, or all but 1? will writes AND reads go to the master while this is going on? this should all work, depending on the replicated statements, so I recommend dong some thorough functional testing of this in stage.
All slaves. The master is spec'd to support full read/write load during such an event, and stage loadtests have (unintentionally) been run in that configuration since June 2012.
This was eventually done during a deploy day a couple weeks ago.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.