Closed
Bug 1283170
Opened 8 years ago
Closed 8 years ago
Setup Treeherder replication to AWS (prod SCl3 -> prod RDS instance)
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P2)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: fubar)
References
Details
This is the RDS production instance equivalent of bug 1176486. See also bug 1246965 and the Pythian plan attached to that bug (attachment 8729694 [details]). We'd like to get this replication set up roughly prior to the Treeherder stage migration, such that when Treeherder stage has successfully switched over, we're ready to move Treeherder prod to Heroku without needing to wait too long for the DB migration parts. (We can leave it replicating in the meantime). I've written a rough draft of the plan based on the steps that were documented in the stage bugs / Pythian document: https://public.etherpad-mozilla.org/p/treeherder-prod-rds-migration-draft Kendall, would you mind making any changes if needed, so I can confirm the plan of action and post in this bug? We can file a dependant bug from this one for any of the ACL or DBA-requiring parts in their respective components, as needed. Many thanks :-)
Flags: needinfo?(klibby)
Reporter | ||
Updated•8 years ago
|
Summary: Setup Treeheder replication to AWS (prod SCl3 -> prod RDS instance) → Setup Treeherder replication to AWS (prod SCl3 -> prod RDS instance)
Assignee | ||
Comment 1•8 years ago
|
||
Only two minor notes added: - pretty sure the access issue is because we flipped the new prod instance to be public, and the VPN/ACL assume it's got a 10.0.0.0/8 address (it still does but the hostname doesn't resolve to it and there's no way, afaict, to look it up without flipping it back to private) - when the db dump is created we need the binlog name and timestamp recorded for setting up replication
Flags: needinfo?(klibby)
Reporter | ||
Comment 2•8 years ago
|
||
Thank you for reviewing. So the plan for SCL3 prod -> RDS *prod* migration, is: Pre-requisites: * Re-establish connectivity from SCL3 treeherder prod db to the treeherder-prod RDS instance, since this command currently fails: treeherder2.db.scl3$ nc -vz treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com 3306 -> (were the VPC settings or ACLs changed since bug 1246965 ?) -> this may be due to having flipped the instances to public; the VPN and ACL presume it's in the 10.0.0.0/8 space, IIRC * Check we still have the password for the `repl` user on treeherder2.db.scl3. * Check the `repl` user still has the necessary grants on treeherder2.db.scl3. * Adjust `expire_logs_days` on treeherder2.db.scl3, to cover the time between the dump being performed and replication being set up. * Check the cron job that deletes bin logs has been disabled (see bug 1246965 comment 39). When we're ready to go: * Get the DBAs to make a dump of the current Treeherder DB and place it in treeherder2.db.scl3:/data -> we need the name of the binlog and timestamp the dump is based on for set_external_master below * Copy the dump to an EC2 node. * Apply the `th-import` parameter group to the prod RDS instance. * Reboot the prod RDS instance. * Import the dump into the prod RDS instance using the EC2 node. * Wait for import to complete. * Run this on the RDS instance (substituting in the necessary values): CALL mysql.rds_set_external_master ('<treeherder2.db.scl3 IP address>', 3306, 'repl', '<password>', '<binlog filename>', <timestamp>, 0) CALL mysql.rds_start_replication; * Check replication succeeding by running `show slave status\G` on the RDS instance several times and ensuring that `seconds_behind_master` is reducing. * Stop replication by running `CALL mysql.rds_stop_replication;` on the RDS instance. * Apply the `th-replication` parameter group to the prod RDS instance. * Reboot the prod RDS instance. * Restart replication on the RDS instance using `CALL mysql.rds_start_replication;`. * Check replication still working using `show slave status\G`. * Run this on the RDS instance `CALL mysql.rds_set_configuration('binlog retention hours', 48);` & verify with `CALL mysql.rds_show_configuration;`. * Add th_admin and th_user grants to the RDS instance. * Check replication still working using `show slave status\G` & monitor the decreasing `seconds_behind_master`.
Reporter | ||
Comment 3•8 years ago
|
||
> Pre-requisites: > * Re-establish connectivity from SCL3 treeherder prod db to the treeherder-prod RDS instance -> Filed bug 1285995. > * Check we still have the password for the `repl` user on treeherder2.db.scl3. > * Check the `repl` user still has the necessary grants on treeherder2.db.scl3. Kendall, can you confirm? > * Adjust `expire_logs_days` on treeherder2.db.scl3, to cover the time > between the dump being performed and replication being set up. On treeherder2.db.scl3: SHOW VARIABLES LIKE 'expire_logs_days' + ------------------ + ---------- + | Variable_name | Value | + ------------------ + ---------- + | expire_logs_days | 7 | + ------------------ + ---------- + 1 rows > * Check the cron job that deletes bin logs has been disabled (see bug > 1246965 comment 39). Kendall, can you confirm?
Flags: needinfo?(klibby)
Assignee | ||
Comment 4•8 years ago
|
||
I have to defer to DBAs on both of those points. NI on Matt/Sheeri
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
Flags: needinfo?(klibby)
Comment 5•8 years ago
|
||
I can confirm the purge_logs script (/etc/cron.d/purge_logs) which used to remove binlogs older than 3 hours is no longer on any of the treeherder machines (stage or prod). I don't know what password you have, but I can confirm that the password begins with "z" and ends with "9", and I don't recall it having been changed in the last few months. The replication user still has the proper grants.
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
Reporter | ||
Comment 6•8 years ago
|
||
Thanks sheeri :-)
Reporter | ||
Comment 7•8 years ago
|
||
It seems like the pre-requesites are now complete. Could I leave you to coordinate with the DBAs on actioning the plan in comment 2? :-)
Flags: needinfo?(klibby)
Reporter | ||
Comment 8•8 years ago
|
||
We're going to make a some invasive DB changes shortly, so would prefer the DB dump not happen during that window. I'll reopen needinfo when we're good to proceed.
Flags: needinfo?(klibby)
Reporter | ||
Comment 9•8 years ago
|
||
We're just waiting for bug 1258861 to be finished, but that should be by ~Wed next week. Kendall, would it be possible for you to have another glance over the plan here + coordinate with the DBA between now and Wed, with the hope of then getting the DB dump/replication at least in progress by the end of next week (15-16th)? Thanks :-)
Flags: needinfo?(klibby)
Assignee | ||
Comment 10•8 years ago
|
||
f+ from me. I'll try to get things rolling this week, and possibly follow up off-hours while I'm in MTV for training next week. I'm on PTO the last week of September, so will do my best to have it sorted before then.
Flags: needinfo?(klibby)
Reporter | ||
Comment 11•8 years ago
|
||
Please could we get this started today? :-)
Flags: needinfo?(klibby)
Assignee | ||
Comment 12•8 years ago
|
||
Ball is rolling. Will work on this while I'm in MTV before/after training.
Flags: needinfo?(klibby)
Reporter | ||
Comment 13•8 years ago
|
||
Amazing, thank you :-)
Reporter | ||
Comment 14•8 years ago
|
||
Right, we're ready to proceed with the rest of the comment 2 steps. The file is at treeherder1.db.scl:/data/treeherderdb_20160920.gz The master will be treeherder2.db.scl.mozilla.com, with log filename 'treeherder2-bin.020350' and position 190207574. Thanks!
Flags: needinfo?(klibby)
Assignee | ||
Comment 15•8 years ago
|
||
Data has been copied to a temp instance and import has started.
Reporter | ||
Comment 16•8 years ago
|
||
Thank you! Looking at the CloudWatch graphs, I think the initial import has now completed.
Reporter | ||
Comment 17•8 years ago
|
||
The second/corrected dump by Pythian has just completed - the file is at treeherder1.db.scl:/data/treeherder_dump_09212016.dmp.gz Could you start the copy to EC2 / initial import. Many thanks!
Reporter | ||
Comment 19•8 years ago
|
||
I'm seeing the replication error icon on the RDS dashboard - was the existing replication turned off using `CALL mysql.rds_stop_replication;`?
Reporter | ||
Comment 20•8 years ago
|
||
I'm seeing replication error log entries from Sep 22 8:35 PM - please can you disable replication? I'm concerned it may corrupt the new import.
Assignee | ||
Comment 21•8 years ago
|
||
son of a... yeah, it wasn't disabled. I've stopped it and restarted the import to ensure no issues.
Assignee | ||
Comment 22•8 years ago
|
||
... and the errors keep appearing even after stopping replication. *sigh*
Reporter | ||
Comment 23•8 years ago
|
||
On: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.NonRDSRepl.html#MySQL.Procedural.Importing.Redirect.App It also mentions a: CALL mysql.rds_reset_external_master;
Assignee | ||
Comment 24•8 years ago
|
||
thanks; rebooting looks like it helped, but I've run that as well. *grumble* import is running again.
Reporter | ||
Comment 25•8 years ago
|
||
Unfortunately the daily backup has been running for the last few hours, which is slowing down the import. I guess it's too late this time, but perhaps if we need to restart the import for some reason, we should temporarily disable the backups.
Assignee | ||
Comment 26•8 years ago
|
||
Import finished and replication is underway, if a bit behind. For posterity, steps taken: ## scp data to instance ## verify RDS instance has th_import param group set ## verify mysql connection from treeherder1.db to RDS instance ## set bin_log_purge to 1 hour mysql> call mysql.rds_set_configuration('binlog retention hours', 1); mysql> call mysql.rds_show_configuration; ## import $ gunzip -c /mnt/tmp/treeherder_dump_09212016.dmp.gz | mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin --ssl_ca=rds-combined-ca-bundle.pem --ssl-verify-server-cert -p > import.log 2>&1 $ bg $ disown -h ## verify running $ mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin --ssl_ca=rds-combined-ca-bundle.pem --ssl-verify-server-cert -p mysql> show processlist; mysql> call mysql.rds_set_external_master ('10.22.70.136', 3306, 'repl', 'REDACTED', 'treeherder1-bin.022652', 165073000, 0); mysql> call mysql.rds_start_replication; ## Check status a few times. mysql> show slave status\G ## If all OK, and seconds_behind_master reducing, so we can restart with normal parameters. ## Stop before reboot. mysql> call mysql.rds_stop_replication; ## change param group to th_replication in terraform config and apply git clone git@github.com:mozilla-platform-ops/devservices-aws.git cd devservices-aws/treeherder ## aws creds need to be setup/available (eg mfa-start-session devservices) ./init.sh vi rds.tf ## under treeherder-prod-rds, change parameter_group_name from th-import to th-replication terraform plan terraform apply git add rds.tf git ci -m "Bug 1283170 - change treeherder-prod to th-replication param group" rds.tf git push origin master ## reboot instance via console :-( ## Restart replication, import users mysql> call mysql.rds_start_replication; ## Check status a few times. mysql> show slave status\G ## Set bin_log_purge to at least 24 hours, recommended 48 mysql> call mysql.rds_set_configuration('binlog retention hours', 48); ## Verify by: mysql> call mysql.rds_show_configuration; Seconds_Behind_Master is currently *increasing* which means that it's not yet replaying changes and catching up. Since Relay_Log_Space is also increasing, I believe it's still copying over the events to reply. But I'm keeping an eye on it. Email with instructions for handling other events forthcoming!
Reporter | ||
Comment 27•8 years ago
|
||
I've connected to the RDS instance (using treeherder1.db.scl3 as a jumphost, since not publicly accessible), and `show slave status` reports Seconds_Behind_Master as 284102 (and rising). Pythian, any idea what's up here? Is treeherder1 set up properly for replication? (It's the slave to treeherder2 aiui). The full output of `show slave status` is here: https://emorley.pastebin.mozilla.org/8913193
Flags: needinfo?(team73)
Reporter | ||
Comment 28•8 years ago
|
||
This seems to imply no replication events are being sent from treeherder1.db.scl3... Slave_IO_State: Waiting FOR master TO send event This is really quite frustrating. Instead of being caught up over the weekend, we're now 3 days behind SCL3.
Reporter | ||
Comment 29•8 years ago
|
||
So `show slave status` reports the position as having advanced past that originally set in comment 26: Master_Log_File: treeherder1-bin.022672 Read_Master_Log_Pos: 349072372 So I'm guessing replication was initially fine, but didn't resume post switching parameter group from th-import to th-replication and then restarting? However this was derived from the plan originally devised by Pythian in bug 1246965, so should be correct. Unfortunately fubar is now on PTO, so we have to rely on what comment 26 says. I've tried another `call mysql.rds_start_replication;` but that just reports replication as already started. Can anyone else help out here? It's just it's 17 days since comment 9, and 10 days since when we wanted the replication to be running and caught up, and it feels like we're still no closer to getting this complete. Any further delay will mean we'll have to do another DB dump since we'll be more behind than the number of days binlogs retained on treeherder1.db.scl3.
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
Reporter | ||
Comment 30•8 years ago
|
||
I ran a `call mysql.rds_stop_replication` followed by (another) `call mysql.rds_start_replication`, which appeared to result in replication proceeding, however extremely slowly (seconds behind would regularly increase). We were using a db.m4.xlarge, which only has 16GB RAM and 2 vCPU, so I've changed it to a db.m4.2xlarge which has double of both. Normally we'd want to make that change using Terraform, however I don't have access and we can't wait any longer, so I've gone ahead and adjusted using the AWS console (it turns out the permissions weren't as locked down as they were meant to be). This changed the instance IP, so I had to adjust the repl user host on SCL3 to use a wildcard on the last octet. After resuming replication, Seconds_Behind_Master is now decreasing slightly faster, though it seems like the bottleneck might be on the SCL3 side or in the current slave config. Pythian/Sheeri/Matt, can you suggest some tweaks that might speed things up? Our deviations from the RDS MySQL 5.6 defaults are here (and come from the Pythian plan from bug 1246965): https://github.com/mozilla-platform-ops/devservices-aws/blob/d07d83bbc207c60824bd74e7e30d592e752a9454/treeherder/rds.tf#L69-L174
Reporter | ||
Comment 31•8 years ago
|
||
If I'm interpreting this correctly, the slowness catching up is not on the IO thread, but the SQL thread: https://emorley.pastebin.mozilla.org/8913272 CPU utilisation on the RDS instance is low, and the IOPS is well under (like half) of our allowance on AWS, plus queue depth and read/write latency are all low. As such, I'm somewhat puzzled as to why the slave isn't applying the relay log more quickly? (We're still over 3 days behind)
Comment 32•8 years ago
|
||
Treeherder is a large database and we checksum the data twice daily, to make sure the slaves match the masters. So twice a day every row gets read and checksum'd....which makes replication catch-up difficult. That being said, turning off binary logging for the time being can make replication faster - up to 15% faster. If the master knows the slave is connected, it will not purge the logs, so that should not be an issue.
Flags: needinfo?(scabral)
Reporter | ||
Comment 33•8 years ago
|
||
I'm not sure how the checksums are relevant here? There is no IO lag on the RDS instance, so the SCL3 instance is keeping up with sending the binlogs to it, and there will be no checksums running against the RDS instance itself.
Comment 34•8 years ago
|
||
The checksums replicate, which means they run on ALL the slaves of the master, including backup instances and RDS.
Reporter | ||
Comment 35•8 years ago
|
||
Oh they are calculated as an actual query? Ok, that's pretty write-intensive. (I was imagining a bunch of SELECTs followed by whatever tool calculating the checksums, which then separately stored the checksum results in small writes.) I don't see us needing to keep on using checksums on RDS - and even if we do, we could re-enable later. We can't filter these on RDS, since they don't give us the necessary permissions, and the RDS-provided procedures (such as mysql.rds_start_replication) don't allow for customisation. Could we just disable checksums everywhere, until replication has caught up?
Reporter | ||
Comment 36•8 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #32) > That being said, turning off binary logging for the time being can make > replication faster - up to 15% faster. Do you mean turn off binary logging downstream of the RDS instance? It's a multi-AZ RDS instance, so it has a slave of it's own, so that won't be possible I presume?
Reporter | ||
Comment 37•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #35) > Could we just disable checksums everywhere, until replication has caught up? Since I didn't hear back yesterday and we need to proceed with this as soon as possible, I've gone ahead and disabled checksums temporarily. I couldn't do this via puppet/whatever, since I don't have access, so instead I've altered the `host` for the `checksum` user, so it cannot login. Last night I also scaled the RDS instance's storage from 500GB to 750GB, since that gives us more IOPS (we're using general storage, and whilst we didn't seem to be hitting the credit limit previously, I'm running out of ideas), plus disabled daily backups on RDS, so they don't slow down the catchup. I don't have AWS permissions to change the parameter groups on RDS (and setting variables dynamically isn't supported on RDS), so I couldn't experiment with different cache sizes etc. Seconds_Behind_Master: 285651 = 3.3 days (still)
Reporter | ||
Comment 38•8 years ago
|
||
Jake, could you let me know when you're online and we can try some experiments with parameter groups? (I don't have the required AWS permissions).
Flags: needinfo?(jwatkins)
Reporter | ||
Comment 39•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #37) > I couldn't do this via puppet/whatever, since I don't have access, so > instead I've altered the `host` for the `checksum` user, so it cannot login. Puppet added another checksum user with the prior host, so I've instead set password_expired=Y, since that doesn't appear to get reverted. I'm now running a replication stats script to give a better eta: https://emorley.pastebin.mozilla.org/8913491 s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 250764 2.902361 1.000 2.902 | 1.000 2.902 69.66 250488 2.899167 5.600 0.518 | 3.300 0.879 21.08 250185 2.895660 6.050 0.479 | 4.217 0.687 16.48 250156 2.895324 1.483 1.952 | 3.533 0.819 19.67 250046 2.894051 2.833 1.021 | 3.393 0.853 20.47 249831 2.891563 4.583 0.631 | 3.592 0.805 19.32 249739 2.890498 2.533 1.141 | 3.440 0.840 20.16 249711 2.890174 1.467 1.971 | 3.194 0.905 21.72 249622 2.889144 2.483 1.163 | 3.115 0.928 22.26 249512 2.887870 2.833 1.019 | 3.087 0.936 22.45 s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 249400 2.886574 2.867 1.007 | 3.067 0.941 22.59 249291 2.885312 2.817 1.024 | 3.046 0.947 22.74 249211 2.884387 2.333 1.236 | 2.991 0.964 23.14 249175 2.883970 1.600 1.802 | 2.892 0.997 23.94 249036 2.882361 3.317 0.869 | 2.920 0.987 23.69 248971 2.881609 2.083 1.383 | 2.868 1.005 24.12 248924 2.881065 1.783 1.616 | 2.804 1.028 24.66 248732 2.878843 4.200 0.685 | 2.881 0.999 23.98 248736 2.878889 0.933 3.085 | 2.779 1.036 24.86 248518 2.876366 4.633 0.621 | 2.872 1.002 24.04 For key, see: https://www.percona.com/blog/2012/08/29/heres-a-quick-way-to-foresee-if-replication-slave-is-ever-going-to-catch-up-and-when/ For future reference, switching from db.m4.xlarge to db.m4.2xlarge increased observed write IOPS during replication catchup from an average of 560 to 690.
Comment 40•8 years ago
|
||
Heh, you've thwarted the failsafes we put in place. Checksums are pretty important to make sure that the data gets to AWS OK, but if you're OK with turning them off, it's your app and you know the implications of lack of data quality. I'm also OK with it because the last time there were issues with data quality was March, and because you're going to be failing over pretty soon.
Flags: needinfo?(team73)
Flags: needinfo?(mpressman)
Comment 41•8 years ago
|
||
I've modified the terraform manifests to match the changes made under the console. This brings it back into a synchronized state. https://github.com/mozilla-platform-ops/devservices-aws/commit/b6693d4bed2230e77e8f78d1d4e5fc447a445529
Flags: needinfo?(jwatkins)
Reporter | ||
Comment 42•8 years ago
|
||
Update 2 hours after comment 39: s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 206982 2.395625 33.050 0.072 | 6.570 0.365 8.75 205683 2.380590 22.650 0.105 | 6.692 0.356 8.54 205542 2.378958 3.350 0.710 | 6.667 0.357 8.56 205445 2.377836 2.617 0.909 | 6.637 0.358 8.60 205402 2.377338 1.717 1.385 | 6.600 0.360 8.64 205236 2.375417 3.767 0.631 | 6.579 0.361 8.66 205136 2.374259 2.667 0.890 | 6.551 0.362 8.70 204998 2.372662 3.300 0.719 | 6.527 0.363 8.72 204968 2.372315 1.500 1.582 | 6.491 0.365 8.77 204824 2.370648 3.400 0.697 | 6.469 0.366 8.80 We seem to be winning the battle - although likely helped by the binlog from 2.37 days ago being from the weekend, where Treeherder ingestion is lower. Hopefully yesterday's tree closure will mean we have a 3-day low data ingestion stretch to allow us to catch up, plus now checksums are disabled, the binlogs from today will at least be smaller. (In reply to Sheeri Cabral [:sheeri] from comment #40) > Heh, you've thwarted the failsafes we put in place. Checksums are pretty > important to make sure that the data gets to AWS OK, but if you're OK with > turning them off, it's your app and you know the implications of lack of > data quality. I agree that leaving them enabled is preferable, however in the current situation (and with, if I'm being brutally honest, not much assistance from Pythian/...) there wasn't really much other choice. We can always re-enable checksums once replication has caught up. (In reply to Jake Watkins [:dividehex] from comment #41) > I've modified the terraform manifests to match the changes made under the > console. This brings it back into a synchronized state. Many thanks! As just discussed on IRC, let's hold off on any other changes, now that replication is at least not getting any further behind.
Comment 43•8 years ago
|
||
I have disabled monitoring for 2 weeks on the checksums, so Pythian doesn't get paged needlessly. We can do checksums manually when we need to, as they might cause replication to lag again if we turn them on fully.
Reporter | ||
Comment 44•8 years ago
|
||
The catch-up has slowed down significantly, but seconds behind is still at least decreasing: s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 119421 1.382187 1.583 0.873 | 4.768 0.290 6.96 119435 1.382350 0.767 1.803 | 4.761 0.290 6.97 119414 1.382106 1.350 1.024 | 4.755 0.291 6.98 119408 1.382037 1.100 1.256 | 4.749 0.291 6.98 119420 1.382176 0.800 1.728 | 4.742 0.291 7.00 119365 1.381539 1.917 0.721 | 4.737 0.292 7.00 119384 1.381759 0.683 2.022 | 4.730 0.292 7.01 119335 1.381192 1.817 0.760 | 4.725 0.292 7.02 119339 1.381238 0.933 1.480 | 4.719 0.293 7.02 119290 1.380671 1.817 0.760 | 4.714 0.293 7.03 If we're lucky, we may be caught up sometime tomorrow, or if not, Thursday.
Comment 45•8 years ago
|
||
There might be some ebb and flow to the replication. I've been watching it for most of the day and I can see blocks of times where c_sec_s increases for a time and then decreases again. As long as c_sec_s hangs above above 1.0, things are moving forward. For instance, about 90m ago there was a burst: s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 109105 1.262789 5.917 0.213 | 1.878 0.672 16.14 108969 1.261215 3.267 0.386 | 1.883 0.670 16.08 108701 1.258113 5.467 0.230 | 1.894 0.664 15.94 108556 1.256435 3.417 0.368 | 1.899 0.662 15.88 108316 1.253657 5.000 0.251 | 1.909 0.657 15.76 107962 1.249560 6.900 0.181 | 1.925 0.649 15.58 107618 1.245579 6.733 0.185 | 1.940 0.642 15.41 107179 1.240498 8.317 0.149 | 1.960 0.633 15.19 106985 1.238252 4.233 0.293 | 1.967 0.630 15.11 106577 1.233530 7.800 0.158 | 1.985 0.621 14.91 s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 105918 1.225903 11.983 0.102 | 2.016 0.608 14.59 105082 1.216227 14.933 0.081 | 2.056 0.591 14.19 103878 1.202292 21.067 0.057 | 2.115 0.568 13.64 103730 1.200579 3.467 0.346 | 2.119 0.566 13.60 103716 1.200417 1.233 0.973 | 2.117 0.567 13.61 102987 1.191979 13.150 0.091 | 2.151 0.554 13.30 100302 1.160903 45.750 0.025 | 2.284 0.508 12.20 98801 1.143530 26.017 0.044 | 2.356 0.485 11.65 98095 1.135359 12.767 0.089 | 2.388 0.475 11.41 97465 1.128067 11.500 0.098 | 2.416 0.467 11.21 s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 97154 1.124468 6.183 0.182 | 2.427 0.463 11.12 97102 1.123866 1.867 0.602 | 2.425 0.463 11.12 97062 1.123403 1.667 0.674 | 2.423 0.464 11.13 96881 1.121308 4.017 0.279 | 2.428 0.462 11.09 96673 1.118900 4.467 0.251 | 2.434 0.460 11.03 96494 1.116829 3.983 0.280 | 2.438 0.458 10.99 96310 1.114699 4.067 0.274 | 2.443 0.456 10.95 96119 1.112488 4.183 0.266 | 2.448 0.454 10.91 95847 1.109340 5.533 0.200 | 2.457 0.451 10.83 95585 1.106308 5.367 0.206 | 2.466 0.449 10.77
Reporter | ||
Comment 46•8 years ago
|
||
Replication failed earlier with: Last_SQL_Error: Error 'Access denied for user ''@'' (using password: NO)' on query. Default database: ''. Query: 'GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE, EVENT, TRIGGER, CREATE TABLESPACE ON *.* TO 'checksum'@'REDACTED' IDENTIFIED BY PASSWORD '*REDACTED'' This is just from when puppet was trying to re-establish the checksum user (comment 39), so is safe to skip. After a `CALL mysql.rds_skip_repl_error;`, replication has resumed fine. I've also found a bug in the script from the Percona blog, which resulted in incorrect ETAs. A fixed version (with some other tweaks) is here: https://gist.github.com/edmorley/f8589f4d6fa84df4cce1a13cbac83d98 Note: With the new script, negative `c_sec_s` means "falling even further behind", and 0 means "keeping up but not making headway into the deficit". (ie: subtract 1 from the c_sec_s in previous comments to compare) s_behind h_behind | c_sec_s eta_h | O_c_sec_s O_eta_h 80618 22.39 | 1.283 17.4 | 0.441 50.8 80613 22.39 | 0.083 268.7 | 0.411 54.5 80557 22.38 | 0.933 24.0 | 0.451 49.6 80547 22.37 | 0.167 134.2 | 0.431 51.9 80501 22.36 | 0.767 29.2 | 0.453 49.3 80493 22.36 | 0.133 167.7 | 0.433 51.6 80451 22.35 | 0.700 31.9 | 0.449 49.8 80476 22.35 | -0.417 Inf | 0.401 55.8 80404 22.33 | 1.200 18.6 | 0.443 50.4 80418 22.34 | -0.233 Inf | 0.409 54.6
Reporter | ||
Comment 47•8 years ago
|
||
ETA has now climbed up to 5 days :-/ s_behind h_behind | c_sec_s eta_h | O_c_sec_s O_eta_h 79207 22.00 | -0.233 Inf | 0.201 109.4 79201 22.00 | 0.100 220.0 | 0.200 109.7 79211 22.00 | -0.167 Inf | 0.198 111.2 79215 22.00 | -0.067 Inf | 0.196 112.2 79216 22.00 | -0.017 Inf | 0.195 113.1 79238 22.01 | -0.367 Inf | 0.191 115.4 79253 22.01 | -0.250 Inf | 0.188 117.3 79235 22.01 | 0.300 73.4 | 0.189 116.8 79241 22.01 | -0.100 Inf | 0.187 118.0 79252 22.01 | -0.183 Inf | 0.184 119.6
Comment 48•8 years ago
|
||
Merged and applied https://github.com/mozilla-platform-ops/devservices-aws/pull/1 This sets Enhanced Monitoring metrics back to be polled at 60s again aws_db_instance.treeherder-prod-rds: Modifying... monitoring_interval: "0" => "60" aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (40s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (50s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (1m0s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (1m10s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (1m20s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (1m30s elapsed) aws_db_instance.treeherder-prod-rds: Modifications complete
Comment 49•8 years ago
|
||
As discussed in IRC, I've set the sync_binlog parameter to '0' under the th_replication parameter group https://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#sysvar_sync_binlog aws_db_parameter_group.th_replication-pg: Modifying... parameter.#: "24" => "25" ... parameter.881816039.apply_method: "" => "immediate" parameter.881816039.name: "" => "sync_binlog" parameter.881816039.value: "" => "0" ... aws_db_parameter_group.th_replication-pg: Modifications complete
Comment 50•8 years ago
|
||
I've set innodb_flush_log_at_trx_commit => 0 https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit aws_db_parameter_group.th_replication-pg: Modifying... ... parameter.4012389720.apply_method: "" => "immediate" parameter.4012389720.name: "" => "innodb_flush_log_at_trx_commit" parameter.4012389720.value: "" => "0" ... aws_db_parameter_group.th_replication-pg: Modifications complete
Comment 51•8 years ago
|
||
Changing innodb_flush_log_at_trx_commit => 0 definitely opened the flood gates. At this rate, replication should catch up sometime today. commit da8e29f455882425b0b17f33ec37691c690da8c8 Author: Jake Watkins <jwatkins@mozilla.com> Date: Wed Sep 28 10:21:37 2016 -0700 Bug 1283170 - Set innodb_flush_log_at_trx_commit => 0 commit 461e9902b4d0cc17824f1efe7a032f2e69bdbdf8 Author: Jake Watkins <jwatkins@mozilla.com> Date: Wed Sep 28 08:48:49 2016 -0700 Bug 1283170 - Set sync_binlog => 0
Comment 52•8 years ago
|
||
Replication is completely caught up!
Reporter | ||
Comment 53•8 years ago
|
||
Awesome :-) Could we revert innodb_flush_log_at_trx_commit back to '1' now, and see if it holds steady (just to give more failure protection)? (I'm pretty sure it will)
Comment 54•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #53) > Awesome :-) > > Could we revert innodb_flush_log_at_trx_commit back to '1' now, and see if > it holds steady (just to give more failure protection)? (I'm pretty sure it > will) Sure :-) I've reverted the commit. innodb_flush_log_at_trx_commit should be back to the default value '1' aws_db_parameter_group.th_replication-pg: Modifying... parameter.#: "26" => "25" ... parameter.4012389720.apply_method: "immediate" => "" parameter.4012389720.name: "innodb_flush_log_at_trx_commit" => "" parameter.4012389720.value: "0" => "" ... aws_db_parameter_group.th_replication-pg: Modifications complete
Comment 55•8 years ago
|
||
For some reason, removing the parameter didn't seem to apply the change. Subsequent 'terraform apply' would always want to change the parameter group. I've instead set the parameter explicitly to 1. With that done, terraform state in now properly sync'd with reality.
Reporter | ||
Comment 56•8 years ago
|
||
Please could you re-enable backups for this instance now? :-)
Comment 57•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #56) > Please could you re-enable backups for this instance now? :-) Backups have been re-enabled with 1 day retention. aws_db_instance.treeherder-prod-rds: Modifying... backup_retention_period: "0" => "1" aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed) aws_db_instance.treeherder-prod-rds: Modifications complete commit 2bfae726d85af3463191b4a36021a07133b657b7 Author: Jake Watkins <jwatkins@mozilla.com> Date: Thu Sep 29 08:29:40 2016 -0700 Bug 1283170 - Re-enable rds backups with 1 day retention
Reporter | ||
Comment 58•8 years ago
|
||
Many thanks! Think we're all done here for now. Switching off replication/making non-readonly etc will occur as part of deps of bug 1176484.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 59•8 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #57) > Backups have been re-enabled with 1 day retention. I don't suppose you could apply this change now, rather than at the next maintenance window?
Flags: needinfo?(jwatkins)
Reporter | ||
Comment 60•8 years ago
|
||
And sorry to be a pain, but replication is falling behind slightly again (~5000s behind) - could you set: innodb_flush_log_at_trx_commit => 2 (is slightly safer than the value of '1' we used before, but still faster than '0') Thanks!
Comment 61•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #60) > And sorry to be a pain, but replication is falling behind slightly again > (~5000s behind) - could you set: > innodb_flush_log_at_trx_commit => 2 > (is slightly safer than the value of '1' we used before, but still faster > than '0') > > Thanks! aws_db_parameter_group.th_replication-pg: Modifications complete aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed) aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed) aws_db_instance.treeherder-prod-rds: Modifications complete
Flags: needinfo?(jwatkins)
Comment 62•8 years ago
|
||
Replication has caught up with the innodb_flush_log_at_trx_commit => 2 change. s_behind d_behind c_sec_s eta_d | O_c_sec_s O_eta_d O_eta_h 5969 0.069086 0.467 0.148 | 0.924 0.075 1.79 5980 0.069213 0.817 0.085 | 0.910 0.076 1.82 5975 0.069155 1.083 0.064 | 0.930 0.074 1.79 4495 0.052025 25.667 0.002 | 3.403 0.015 0.37 2824 0.032685 28.850 0.001 | 5.717 0.006 0.14 696 0.008056 36.467 0.000 | 8.279 0.001 0.02 0 0.000000 12.600 0.000 | 8.612 0.000 0.00 0 0.000000 1.000 0.000 | 1.000 0.000 0.00
Comment 63•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #59) > (In reply to Jake Watkins [:dividehex] from comment #57) > > Backups have been re-enabled with 1 day retention. > > I don't suppose you could apply this change now, rather than at the next > maintenance window? I've reapplied this with apply_immediately set 'True' Although terraform is still waiting for the modification to complete (it's been over ~40mins), the console logs show the backup_retention has been properly applied and the instance is in fact running a backup. I'll check back on this later to ensure it finishes without issue.
Comment 64•8 years ago
|
||
~5 hours later and the rds instance state is still 'modifying'. Terraform timed out. Replication still looks fine and with no lag. I'm fairly confident this is due to the backup snapshot that is still being created. The snapshot is at 65% (as of 0610UTC) aws_db_instance.treeherder-prod-rds: Still modifying... (1h55m20s elapsed) Error applying plan: 1 error(s) occurred: * aws_db_instance.treeherder-prod-rds: timeout while waiting for state to become 'available' (last state: 'modifying', timeout: 1h20m0s)
Reporter | ||
Comment 65•8 years ago
|
||
All looks good! Many thanks for your (super-responsive) help Jake :-)
You need to log in
before you can comment on or make changes to this bug.
Description
•