Closed
Bug 787707
Opened 13 years ago
Closed 13 years ago
Persona: Schema change: add column to user - lastPasswordReset TIMESTAMP DEFAULT 0 NOT NULL
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lhilaiel, Assigned: Atoll)
References
Details
(Whiteboard: [qa+])
Please initiate a schema change in the persona staging environment as a pre-requisite to deploying train-2012.08.17
The change adds a column to the user table: lastPasswordReset TIMESTAMP DEFAULT 0 NOT NULL
Here is the diff in source code: https://github.com/mozilla/browserid/pull/2026/files#L2R68
This change is to support a feature that causes all outstanding user sessions to be reset upon password change.
Comment 1•13 years ago
|
||
I'll be working through these instructions tomorrow : https://intranet.mozilla.org/Services/Ops/BrowserID/MySQL#Schema_changes
Comment 2•13 years ago
|
||
I'll take this change as meaning this alter statement :
ALTER TABLE user ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0
Comment 3•13 years ago
|
||
It appears that the "browserid" user which the app uses to access the DB doesn't have rights to do what we need here :
You do not have the PROCESS privilege at bin/pt-online-schema-change line 3708.
And the "repl" user as defined in the master DB doesn't have rights to read the "browserid" database
And neither user has rights to the "mysql" database so I can't determine what any other users are.
And none of the gpg encrypted password files reveal any other browserid database users.
Comment 4•13 years ago
|
||
I've confirmed that I have the network ability to reach the databases
# Validate that you can access all of the databases so pt-online-schema-change can monitor slave lag
mpwd=PASSWORDGOESHERE
for dbhost in db1.iddb db2.iddb db3.iddb; do
echo $dbhost
mysql --host=$dbhost.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=repl browserid -e "SELECT count(*) FROM user"
done
And my planned schema change command is :
bin/pt-online-schema-change \
--alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \
--alter-foreign-keys-method=rebuild_constraints --drop-old-table \
--print --recursion-method=processlist --recurse=1 \
--no-check-replication-filters --check-interval=1 --chunk-size=64 --execute \
D=browserid,t=user,u=USERNAMEGOESHERE,p=PASSWORDGOESHERE,h=db1.iddb.scl2.stage.svc.mozilla.com
Comment 5•13 years ago
|
||
I've gotten in locally with the root user and confirmed that I'll need to get the root mysql password in order to do the pt-online-schema-change
Updated•13 years ago
|
Whiteboard: [qa+]
Comment 6•13 years ago
|
||
Found the location of the root password from atoll. Continuing with schema change now.
Comment 7•13 years ago
|
||
Schema change has completed but jrgm is reporting problems
I think that this is a result of the "do-repl" statements in my.cnf for browserid.
The data on the slaves is incorrect and no longer matches the master. Please consider all slaves damaged, they will need to be rebuilt from the master.
I recommend the following steps:
1) on the master, ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;
3) remove "replicate-do-db = browserid" from my.cnf.iddb (for stage only)
4) using puppet, deploy the new my.cnf to master and all slaves
5) restart mysql on master and all slaves
6) recreate all slaves using the mysql "restore a slave" process in the browserid wiki
7) remove "--no-check-replication-filters" from the pt-o-s-c command line
8) try the pt-o-s-c command again, with the same tests as before
Comment 9•13 years ago
|
||
(In reply to Eugene Wood [:gene] from comment #7)
> Schema change has completed but jrgm is reporting problems
The two problems I saw were:
1. from approx. 19:32:32 to 19:35:21 the dbwriters (sweb) were returning 503 for /wsapi/complete_email_addition and /wsapi/complete_user_creation with additional detail like:
> {"level":"warn","message":"Query taking more than 5000ms! reconnecting to mysql","timestamp":"2012-09-06T19:35:21.532Z"}
> {"level":"warn","message":"couldn't complete email verification: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"}
> {"level":"warn","message":"database is down, cannot process request: database connection unavailable","timestamp":"2012-09-06T19:35:21.533Z"}
Gene: do the logs on the pt job show a correlation to that time about what it was doing? (Damn, I miss having the general query log.)
2. On the replication slave databases this query was taking hours to complete:
> ALTER TABLE `browserid`.`email` DROP FOREIGN KEY `__email_ibfk_1`, ADD CONSTRAINT `___email_ibfk_1` FOREIGN KEY (`user`) REFERENCES `browserid`.`user` (`id`)
As a result, the table row counts on browserid.email and browserid.user were way off between the master and replicant, but now appear to have returned to normal (but I'm not sure I trust that they are completely in sync, so let's get this back to a known state).
Comment 10•13 years ago
|
||
> 1. from approx. 19:32:32 to 19:35:21
Those are GMT times, so 12:30ish PDT.
Comment 11•13 years ago
|
||
> (Damn, I miss having the general query log.)
Okay, off topic to this bug, but at a previous company, we ran a minutely cronjob to do 'SHOW FULL PROCESSLIST' and dumped that into some_dir/%H%M files that would overwrite themselves every 24 hours. Useful for post mortem analysis.
Comment 12•13 years ago
|
||
Rolled back change on db1 :
mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"
Comment 13•13 years ago
|
||
my.cnf modified and all staging mysql servers restarted
Comment 14•13 years ago
|
||
Slaves restored
Comment 15•13 years ago
|
||
Ok, App and schema reverted to 06.22.
jrgm would you smoke test it and then start up a load test, let me know and I'll begin the schema change again
Comment 16•13 years ago
|
||
I've restarted the schema change now while jrgm puts load on the system :
Start time : Fri Sep 7 15:04:29 PDT 2012
date;bin/pt-online-schema-change \
--alter 'ADD COLUMN lastPasswordReset TIMESTAMP NOT NULL DEFAULT 0' \
--alter-foreign-keys-method=rebuild_constraints --drop-old-table \
--print --recursion-method=processlist --recurse=1 \
--check-interval=1 --chunk-size=64 --execute \
D=browserid,t=user,u=$muser,p=$mpwd,h=db1.iddb.scl2.stage.svc.mozilla.com;date
| Assignee | ||
Comment 17•13 years ago
|
||
It looks like "--alter-foreign-keys-method=rebuild_constraints" creates a huge locking delay at the end of the process due to the foreign key rebuild method [1].
I recommend rolling back and then trying again with one change to the comment 16 command line "--alter-foreign-keys-method=drop_old_table", rather than method=rebuild_constraints. All other --options remain the same, including --drop-old-table.
Note that the "method=drop_old_table" comes with safety warnings [1].
[1] http://www.percona.com/doc/percona-toolkit/2.0/pt-online-schema-change.html#cmdoption-pt-online-schema-change--update-foreign-keys-method
Comment 18•13 years ago
|
||
The schema change completed at Fri Sep 7 16:13:20 PDT 2012 but caused a 15 minute window due to foreign key constraints.
I've now kicked off a rollback of the schema change ;
mysql --host=db1.iddb.scl2.stage.svc.mozilla.com --batch --password=$mpwd --user=$muser browserid -e "ALTER TABLE browserid.user DROP COLUMN lastPasswordReset;"
Comment 19•13 years ago
|
||
@sheeri - We'd love your review and advice on our deployment strategy.
Feel free to join us in #identity
| Assignee | ||
Comment 20•13 years ago
|
||
My worst-case scenario alternative to comment 17 is:
a. ALTER TABLE on the write master, blocking all read and write queries until it finishes in 5-30 minutes.
b. Confirm that it replicates to all slaves.
However, we can reduce "5-30 minutes" to the time it takes to do a master/slave swap:
a. Drain service from SCL2.
b. Direct all PHX1 write AND read traffic to the write master in PHX1.
c. ALTER TABLE on all servers in SCL2 and all slaves in PHX1.
d. Master failover to an altered slave in PHX1.
e. Rebuild the unaltered former master using the standard copy-from-master process.
f. Confirm replication working to all servers in SCL2 and PHX1.
g. Undrain service from SCL2.
Comment 21•13 years ago
|
||
For users and email tables... how many rows exist in stage and how many exist in production?
If our stage db (due to automated tests) is much larger than prod, we should delete some rows to get it down to 2x or 3x the size of production.
Comment 22•13 years ago
|
||
Your ideas on how to handle the ALTER TABLE are fine, but....if this is only to support 1 feature, I have some questions.....
Why require a field for it? Why can't the password reset itself kick all the sessions without needing a field in the database? (and yes, I understand that there may be sessions on the master, and sessions on more than one slave, etc but doing a cron job every minute adds tons of overhead).
Why not create a new table, which logs password resets? You may want to log other things in the future, so having a table that has user_id, an ENUM "action" field (right now the only action would be lastPasswordReset, but you can add to the end of an ENUM field without locking, and thus without needing downtime) and the timestamp seems pretty sane. And creating a table doesn't require any locking, either.
Comment 23•13 years ago
|
||
There may be an issue with the workaround, because you're adding a column - if you add the column at different times on different machines, replication might break because the master/slave have different schemas.
Comment 24•13 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #22)
> Why can't the password reset itself kick all the sessions without needing a field in the database?
Our sessions are client side.
> Why not create a new table
I agree with this solution, but we have a hard deadline and re-engineering won't be possible for beta launch.
Comment 25•13 years ago
|
||
ozten :
select count(*) from user;
Staging : 6719176
Prod : 189561
Comment 26•13 years ago
|
||
(In reply to Eugene Wood [:gene] from comment #25)
Per IRC discussion... Please
1) reduce table size to 2x or 3x prod
2) Retest migration starting with the most desirable option.
| Assignee | ||
Comment 27•13 years ago
|
||
(In reply to Austin King [:ozten] from comment #26)
> 1) reduce table size to 2x or 3x prod
Reduced to users id 1..400000 plus 1000 loadtest users.
> 2) Retest migration starting with the most desirable option.
Tested using the simple option, direct ALTER TABLE on master (per comment 20), median 2.5 seconds, max 4.5 seconds.
Current suggested procedure for production alter:
(a) Drain all non-master sites
(b) Drain all slaves from Zeus database pools, leaving only the write master
(c) ALTER TABLE on the master
(d) Confirm schema replicated correctly on all slaves
(e) Undrain all slaves from Zeus database pools
(f) Undrain all non-master sites
Comment 28•13 years ago
|
||
Did you want to drain ALL slaves, or all but 1? will writes AND reads go to the master while this is going on?
this should all work, depending on the replicated statements, so I recommend dong some thorough functional testing of this in stage.
| Assignee | ||
Comment 29•13 years ago
|
||
All slaves. The master is spec'd to support full read/write load during such an event, and stage loadtests have (unintentionally) been run in that configuration since June 2012.
| Assignee | ||
Comment 30•13 years ago
|
||
This was eventually done during a deploy day a couple weeks ago.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•