Closed Bug 1176486 Opened 10 years ago Closed 9 years ago

Migrate the Treeherder prod DB to a new stage RDS instance for Heroku

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

Currently prod is 600GB (though after a defrag this will hopefully more than half; bug ). We're going to need to figure out the best way to transfer the data to the Amazon RDS instance used by Heroku - and estimate how long this will take (since it will affect the tree closure time). In addition, we're currently using the RDS instance set up in bug 1150986 - and need to: 1) Check if we can continue to use that AWS account, or if we need to use a different team's 2) Check if the current instance type/size is adequate 3) Ensure we have a reserved instance for best price 4) Decide what to do for stage vs prod (I'm presuming we'll need two separate RDS instances?) 5) Clear out the data in the current RDS instance 6) Give access to the AWS account to the treeherder devs
(In reply to Ed Morley [:emorley] from comment #0) > Currently prod is 600GB (though after a defrag this will hopefully more than > half; bug ). Bug 1161618.
I think we should consider using Amazon's new Aurora service: http://aws.amazon.com/rds/aurora/ http://www.theregister.co.uk/2014/11/26/inside_aurora_how_disruptive_is_amazons_mysql_clone/ Though there are one or two things that we won't get until they finish the preview phase (eg choice of AZ, encryption at rest etc).
Depends on: 1179860
We'll also want to check we have require SSL enabled: https://devcenter.heroku.com/articles/amazon-rds#require-ssl
I'm fairly certain I set the require ssl bit, but feel free to double check! :-D
also... 1) yes you are absolutely welcome to use our account, it was part of our budgeting plan. 3) it's not currently a reserved instance, so we should change that when we know what size we actually want. 4) definitely recommend separate instance for stage and prod. drop me a bug at any time to spin up a stage db if you haven't, it might be worth talking to the DBAs about migrating data. I know they have some plan to replicate BMO data to AWS when we get to that point, so that might be something they can set up on TH ?
Ah thank you :-)
We should also set up read only MySQL users (currently we just have one th_admin user), so as to limit exposure to things like bug 1182994.
(In reply to Ed Morley (Away 29th July -> 2nd Aug) [:emorley] from comment #2) > I think we should consider using Amazon's new Aurora service: > http://aws.amazon.com/rds/aurora/ > http://www.theregister.co.uk/2014/11/26/ > inside_aurora_how_disruptive_is_amazons_mysql_clone/ > > Though there are one or two things that we won't get until they finish the > preview phase (eg choice of AZ, encryption at rest etc). Aurora is now released officially: https://aws.amazon.com/blogs/aws/now-available-amazon-aurora/
Seems like there are one or two issues they are still working through (though to be expected): https://forums.aws.amazon.com/search.jspa?threadID=&q=aurora&objID=f60&userID=&dateRange=all&numResults=30&rankBy=9
Amazon now have a preview available of their Database Migration Service (US East, N. Virginia only for now), which should make the move easier: http://aws.amazon.com/dms/
Aurora now also supports encryption at rest: http://aws.amazon.com/about-aws/whats-new/2015/12/amazon-aurora-now-supports-encryption-at-rest/ It's not mentioned if there's a perf impact, and there is also a cost per request to the key management service, but it's not clear how many would be made (on each restart of the nodes?).
Sheeri, we're looking to try and migrate the SCL3 hosted treeherder database to Amazon RDS in Q1, as part of Treeherder's migration to Heroku. Do you have any suggestions as to the best way to do this? Amazon's new database migration service looks promising (https://aws.amazon.com/dms/) - does that seem like an idea? How would we go about setting up flows etc to allow that? Many thanks :-)
Flags: needinfo?(scabral)
Matt, I don't suppose you'd be able to help with this? :-) (In reply to Ed Morley [:emorley] from comment #12) > Sheeri, we're looking to try and migrate the SCL3 hosted treeherder database > to Amazon RDS in Q1, as part of Treeherder's migration to Heroku. Do you > have any suggestions as to the best way to do this? Amazon's new database > migration service looks promising (https://aws.amazon.com/dms/) - does that > seem like an idea? How would we go about setting up flows etc to allow that?
Flags: needinfo?(mpressman)
I think we might want to bring in pythian to be able to help with this. I don't think I'll have much time this quarter. I can certainly provide some answers and point out issues that I am aware of regarding moving bugzilla to RDS/EC2. Initially, this was going to RDS, but it had to be moved to it's own instance, gozer would be able to better answer if it was related to db config options that were limited in RDS or more to the application itself. I also recall that we took a mysqldump and that loaded just find initially on RDS.
Flags: needinfo?(mpressman)
Thank you. Some additional context: * Treeherder's DB is ~130GB * We're needing to move it from SCL3 to AWS this quarter, for the Heroku move (bug 1176484) Philippe/Matt, between you would you be able to answer the following? 1) What was the reason for Bugzilla using a standalone instance rather than RDS? (comment 14) 2) How long did the mysqldump mentioned in comment 14 take, was it to the standalone instance or to RDS, and what was the process? (Did you rsync the file, or ...?) 3) If we wanted to try Amazons new database migration service (https://aws.amazon.com/dms/), how would we go about setting up flows between them and the Treeherder DB nodes? (Or more generically: I presume between an AWS VPC and specific nodes in SCL3) Many thanks :-)
Flags: needinfo?(gozer)
(In reply to Ed Morley [:emorley] from comment #15) > Thank you. > > Some additional context: > * Treeherder's DB is ~130GB > * We're needing to move it from SCL3 to AWS this quarter, for the Heroku > move (bug 1176484) > > Philippe/Matt, between you would you be able to answer the following? > > 1) What was the reason for Bugzilla using a standalone instance rather than > RDS? (comment 14) A simple reason, and a bad one, really. Turned out bugzilla *needs* its database to be in the PST timezone, and Amazon RDS doesn't allow for tuning that knob. it's UTC or nothing. So, because of that, we've had to use MySQL on EC2 > 2) How long did the mysqldump mentioned in comment 14 take, was it to the > standalone instance or to RDS, and what was the process? (Did you rsync the > file, or ...?) I didn't keep records, but IIRC, the bugzilla DB is in the 100G size and it took many, many hours, something along the lines of 8-10 hours. The problem is that with RDS, you only have the mysqldump route to export/import your data, and that means paying the price of SQL at import time. When I did it with Mysql on EC2, I was able to just rsync raw database files over, and that was much, much faster, and was easily done in incremental steps. > 3) If we wanted to try Amazons new database migration service > (https://aws.amazon.com/dms/), how would we go about setting up flows > between them and the Treeherder DB nodes? (Or more generically: I presume > between an AWS VPC and specific nodes in SCL3) Yes, that avenue sure sounds interesting, especially the self-healing aspects of it. Yeah, you'll need VPC/VPN connectivity between the AWS account in question and SCL3, doable, but you'll need a big assist from NetOps to get that setup.
Flags: needinfo?(gozer)
one thing - the existing RDS instance was a one-off set up to get things rolling. I'd super like to have that not be the case for the *real* RDS instance (ie use terraform or cloudformation, or something)
(In reply to Kendall Libby [:fubar] from comment #17) > one thing - the existing RDS instance was a one-off set up to get things > rolling. I'd super like to have that not be the case for the *real* RDS > instance (ie use terraform or cloudformation, or something) That's 100% incorrect. The RDS instances for bugzilla in stage and prod are and have always been driven by cloudformation.
I believe Kendall is referring to the RDS instance that is being used by the Treeherder prototype.
Treeherder Heroku prototype that is :-) Thank you for those replies - sounds like DMS is worth a shot, but that I need to ask NetOps sooner rather than later to get the ball rolling for it. And I agree CloudFormation or similar would be a good idea.
Flags: needinfo?(scabral)
(In reply to Kendall Libby [:fubar] from bug 1239660 comment #21) > I haven't gotten ANY response from AWS regarding the database migration > service, so I'm just going to reach out to our DBAs and see if we can just > do replication to the AWS RDS instance. (after I convert the new instance to > SSD, since apparently terraform defaults to 'magnetic' storage type -.-) I presume this is from filling in the DMS preview form? Have they replied since? :-) I wonder if there is any way we can chase them up about it - DMS might be easier in the long run, since it doesn't depend on the DBAs helping us out in bug 1246965? Thank you for your work on this so far :-)
(In reply to Ed Morley [:emorley] from comment #21) > I presume this is from filling in the DMS preview form? Have they replied > since? :-) > > I wonder if there is any way we can chase them up about it - DMS might be > easier in the long run, since it doesn't depend on the DBAs helping us out > in bug 1246965? > > Thank you for your work on this so far :-)
Flags: needinfo?(klibby)
(In reply to Ed Morley [:emorley] from comment #21) > > I presume this is from filling in the DMS preview form? Have they replied > since? :-) Yes, and they haven't. I am planning on trying to contact them via our IT liaison to see if that makes things happen. > I wonder if there is any way we can chase them up about it - DMS might be > easier in the long run, since it doesn't depend on the DBAs helping us out > in bug 1246965? I was also planning on poking the DBAs later this week, as they're still wrapping up glibc patching now, but I see you've already covered that! ;-)
Flags: needinfo?(klibby)
Priority: P3 → P1
(In reply to Kendall Libby [:fubar] from comment #23) > Yes, and they haven't. I am planning on trying to contact them via our IT > liaison to see if that makes things happen. That would be great, thank you :-)
Blocks: 1176487
> Yes, and they haven't. I am planning on trying to contact them via our IT liaison to see if that makes things happen. Who is our IT liaison out of curiosity? I'm happy to chase them up if that's easier? It's just until we get the DB sorted it blocks the rest of the Heroku work, and it's my deliverable for this quarter. Many thanks :-)
Flags: needinfo?(klibby)
Or alternatively, if I can be given whatever permissions I need to file AWS support requests via our account (something that would likely be useful for the future regardless), I could try pinging their support about the DMS request too :-)
:r2 is the man, but it turns out DMS magically appeared in the console when I was looking on Tuesday; I'm not sure if they enabled it and didn't tell us, or if it's just more widely available. :-\ In any case, I started poking about at it, but was interrupted by other things. One thing I noticed and wanted to follow up: "AWS DMS doesn't propagate items such as indexes, users, privileges, stored procedures, and other database changes not directly related to table data." Will that be an issue?
Flags: needinfo?(klibby) → needinfo?(emorley)
additionally... Limitations on Using a MySQL Database as a Source for AWS DMS AWS DMS does not support the following: - DDL statements Truncate Partition, Drop Table, Rename Table - Using an ALTER TABLE <table> ADD COLUMN <column> statement to add columns to the begining or the middle of a table - Capturing changes from tables whose names contain both uppercase and lowercase characters - The AR_H_USER header column curious about the second to last caveat.
Looking at those caveats, I can't see anything that should be a problem. Shall we give this a go? :-) (Particularly given that bug 1246965 comment 11 doesn't fill me with confidence about the alternative)
Flags: needinfo?(emorley)
1) Would you be open to adding the permissions mentioned on page 23 of the DMS guide here, so I could try out the migration? http://dms-preview.s3.amazonaws.com/awsdms-ug.pdf 2) Do you have the master password for the new RDS instance? 3) Should we rename the new RDS instance from "treeherder" to "treeherder-stage"? Many thanks :-)
Flags: needinfo?(klibby)
1) could do, but using DMS means using an intermediary instance for the migration, which means re-doing the flows, which is mainly what I haven't just done so.. 2) I kept it the same, assuming we'd be cutting over to it 3) oh, is it not for prod? uh, then yes we should rename it
Flags: needinfo?(klibby)
Could the intermediary just be added to the same VPC? Yeah this is for stage, intentionally cloning prod to (a) more realistically gauge migration times (stage DB is much smaller), and (b) mean the future new stage is more representative of prod (which is helpful for development moving forwards)
No longer blocks: 1176487
ok, I'll re-spin the RDS node as treeherder-stage and ask dcurado if we can change the ACL.
Ah I thought the way the VPC worked was that the ACLs were just between the DB nodes and our VPC, and anything in that would work? We'll need this set up for prod soon, so if each needs to be added specifically, perhaps we should ask dcurado to do the latter now too?
The VPN connects the AWS VPC and SCL3, but then there are network ACLs to content with because the SCL3 network isn't flat. But yes, we'll need something for prod; since this is temporary and DMS is still a likely option, I'll see if we can just allow everything in the AWS subnet to get to the treeherder dbs.
Ah I follow now, sorry for the confusion - I don't really know much about the netops side :-) Bug 1239660 comment 26 looks perfect - thank you!
Gave DMS a whirl today, but it seems that the bin logs in SCL3 are in the wrong format: Error Code [10002] : MySQL binary logging must use ROW format Matt, is this something we can change? (I'm really not thrilled with the pythian support in 1246965, but I also really don't want to do weird things in SCL3)
Flags: needinfo?(mpressman)
Summary: Decide best way to migrate the Treeherder prod DB to the RDS instance used by Heroku → Migrate the Treeherder prod DB to new stage/prod RDS instances for Heroku
Hey Kendall, I'm really sorry for how poorly pythian's support is. As far as the binlogs format, we are using mixed, a combination of row and statement formats. It looks like you should be able to set the format to mixed on RDS as well. Setting the parameter binlog_format=MIXED should allow replication between our existing treeherder cluster and RDS
Flags: needinfo?(mpressman)
@fubar - I will reach out to you for feedback on Pythian service. I'm currently going through a vendor review for them and would appreciate your input.
(In reply to Matt Pressman [:mpressman] from comment #38) > Hey Kendall, I'm really sorry for how poorly pythian's support is. As far as > the binlogs format, we are using mixed, a combination of row and statement > formats. It looks like you should be able to set the format to mixed on RDS > as well. Setting the parameter binlog_format=MIXED should allow replication > between our existing treeherder cluster and RDS turns out RDS is "mixed" by default, so I assume that the Database Migration Service can only handle "row", rather than there being a mismatch between the sites. Though looking at the docs again, I guess it's only needed if we're trying to do "change data capture" which is AWSese for ongoing replication. will go review the latest pythian attempt and see where that gets us
I don't suppose you could PM me the credentials for the new stage & prod RDS instances? Has a non-root account been set up too, or will I need to do that? Many thanks :-)
Flags: needinfo?(klibby)
Ah there's just a stage instance so far; I'll morph this bug to just be about that, and I'll file a new one when we're ready for the new prod RDS instance.
Summary: Migrate the Treeherder prod DB to new stage/prod RDS instances for Heroku → Migrate the Treeherder prod DB to a new stage RDS instances for Heroku
Summary: Migrate the Treeherder prod DB to a new stage RDS instances for Heroku → Migrate the Treeherder prod DB to a new stage RDS instance for Heroku
pm'ed /me also makes a note to not confuse the treeherder RDS instances :-(
Flags: needinfo?(klibby)
I tried connecting to the treeherder-stage RDS instance however was unable. Comparing to the treeherder-heroku instance I noticed that it's not marked as publicly accessible, however I don't have permissions to change that: User emorley is not authorized to modify database instance arn:aws:rds:us-east-1:699292812394:db:treeherder-stage (Service: AmazonRDS; Status Code: 403; Error Code: AccessDenied; Request ID: cfdb7948-12b0-11e6-b717-37922724c10a). Please check with your administrator. I believe I was able to modify the treeherder-heroku instance without getting the permission denied - would you mind adding the relevant permissions? Many thanks :-)
Flags: needinfo?(klibby)
Also, looking at the parameter group (th-replication) assigned to the treeherder-stage RDS instance, the value for `read_only` is set to `{TrueIfReplica}` (the RDS default). However I think `{TrueIfReplica}` might only evaluate to True if the instance is a replica in the Amazon sense (ie a read replica set up alongside other RDS instances). As such, I wonder if we need to modify th-replication to set read_only to `1` explicitly? (To prevent inadvertent breaking of replication if something (eg deploy scripts running migrations) prematurely modify the DB prior to us switching over). Anyway once the RDS instance is publicly accessible I'll be able to confirm on the instance itself I guess.
(In reply to Ed Morley [:emorley] from comment #44) > I tried connecting to the treeherder-stage RDS instance however was unable. > > Comparing to the treeherder-heroku instance I noticed that it's not marked > as publicly accessible, however I don't have permissions to change that: yes, by design. the stage and prod instances will be managed more strictly and by terraform, so that we can track changes and recreate them if need be. I'll see if I can split out the treeherder bits so you can at least have read access to it, though. also, it's now publicly available! (In reply to Ed Morley [:emorley] from comment #45) > Also, looking at the parameter group (th-replication) assigned to the > treeherder-stage RDS instance, the value for `read_only` is set to > `{TrueIfReplica}` (the RDS default). However I think `{TrueIfReplica}` might > only evaluate to True if the instance is a replica in the Amazon sense (ie a > read replica set up alongside other RDS instances). > > As such, I wonder if we need to modify th-replication to set read_only to > `1` explicitly? (To prevent inadvertent breaking of replication if something > (eg deploy scripts running migrations) prematurely modify the DB prior to us > switching over). > > Anyway once the RDS instance is publicly accessible I'll be able to confirm > on the instance itself I guess. I wasn't able to find much documentation around it, so let's go ahead and try the empirical approach! :-) Poke at it and let me know if you want to change it.
Flags: needinfo?(klibby)
(In reply to Kendall Libby [:fubar] from comment #46) > yes, by design. the stage and prod instances will be managed more strictly > and by terraform, so that we can track changes and recreate them if need be. > I'll see if I can split out the treeherder bits so you can at least have > read access to it, though. Ah makes sense. I'd really really like the configs to be public at some point, otherwise it seems we're duplicating many of the frustrations of the current SCL3 infra :-) > also, it's now publicly available! I can now connect to port 3306, however the credentials (username th_admin, password the one specified the other day) don't work. Any ideas?
requires SSL; shame the error doesn't SAY that, though. try: wget http://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem mysql -h host --ssl_ca=rds-combined-ca-bundle.pem --ssl-verify-server-cert -u username -p
I'm using SSL, I duplicated the config from the existing RDS instance.
Flags: needinfo?(klibby)
(In reply to Ed Morley [:emorley] from comment #49) > I'm using SSL, I duplicated the config from the existing RDS instance. aha, different password, then. same one as staging in scl3.
Flags: needinfo?(klibby)
Ah I misunderstood, I thought you meant the same credentials as the existing prototype Heroku instance. Working now, thank you.
(In reply to Kendall Libby [:fubar] from comment #46) > > (In reply to Ed Morley [:emorley] from comment #45) > > Also, looking at the parameter group (th-replication) assigned to the > > treeherder-stage RDS instance, the value for `read_only` is set to > > `{TrueIfReplica}` (the RDS default). However I think `{TrueIfReplica}` might > > only evaluate to True if the instance is a replica in the Amazon sense (ie a > > read replica set up alongside other RDS instances). > > > > As such, I wonder if we need to modify th-replication to set read_only to > > `1` explicitly? (To prevent inadvertent breaking of replication if something > > (eg deploy scripts running migrations) prematurely modify the DB prior to us > > switching over). > > I wasn't able to find much documentation around it, so let's go ahead and > try the empirical approach! :-) Poke at it and let me know if you want to > change it. I've just tested it and I was able to make changes :-s Please can you modify the th-replication parameter group such that `read_only` is explicitly set to "1". Whilst we're virtually ready to turn off replication for this instance (since it's going to be morphed into the new stage, at which point we'll change away from the th-replication parameter group), we'll still want this option fixed for when we repeat this process for the prod transition.
Done.
Confirmed enabled, thank you :-) """ Executing: CREATE SCHEMA `test1234` ; Operation failed: There was an error while applying the SQL script to the database. ERROR 1290: The MySQL server is running with the --read-only option so it cannot execute this statement """
So I now have the new Heroku stage instance successfully using the new RDS instance (that's in read-only mode), however the replication appears to have stopped working on 11th May? ie the most recent push shown here is at 14:27 UTC+1 On 11th May: https://treeherder-stage.herokuapp.com/#/jobs?repo=mozilla-inbound compare to the prod instance, which has pushes up to a few minutes ago: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound Any ideas? (Not that we need the replication much longer for the new stage instance, however for the prod move, we need it to work reliably for the few days prior to the weekend tree closure when the actual DNS changes will take place).
Flags: needinfo?(klibby)
Looking at replication status: Last_Error: Error 'Cannot add or update a child row: a foreign key constraint fails (`mozilla_inbound_jobs_1`.`job_log_url`, CONSTRAINT `fk_job_log_url` FOREIGN KEY (`job_id`) REFERENCES `job` (`id`))' on query. Default database: 'mozilla_inbound_jobs_1'. Query: 'INSERT INTO `job_log_url` ( `job_id`, `name`, `url`, `parse_status` ) VALUES (27658695,'builds-4h','https://queue.taskcluster.net/v1/task/c61uLExRTFanpi_-BR0oTw/runs/0/artifacts/public/logs/live_backing.log','pending')'
Flags: needinfo?(klibby)
Thank you - will continue discussion in bug 1246965.
Depends on: 1276301
I'm asking for replication to be officially turned off again in bug 1276301. Please can you move this stage RDS instance out of the replication parameter group (so it loses the read_only and whatever replication-specific stuff it doesn't need) now, so we can start using the new instance standalone. (The 'replication' parameter group will be used again shortly for the new prod RDS instance, so doesn't want to be deleted). Thanks! :-)
Flags: needinfo?(klibby)
Assignee: nobody → klibby
(In reply to Ed Morley [:emorley] from comment #58) > Please can you move this stage RDS instance out of the replication parameter > group (so it loses the read_only and whatever replication-specific stuff it > doesn't need) now, so we can start using the new instance standalone. (The > 'replication' parameter group will be used again shortly for the new prod > RDS instance, so doesn't want to be deleted). Looks like this has now been performed as part of bug 1276301 comment 1; cancelling needinfo.
Flags: needinfo?(klibby)
oh sure, mid-air my comment when I got to update it... ;-P param group is now default.mysql5.6.
Think we're done here :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1277269
Blocks: 1277304
You need to log in before you can comment on or make changes to this bug.