Closed Bug 1283170 Opened 8 years ago Closed 8 years ago

Setup Treeherder replication to AWS (prod SCl3 -> prod RDS instance)

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

This is the RDS production instance equivalent of bug 1176486. See also bug 1246965 and the Pythian plan attached to that bug (attachment 8729694 [details]).

We'd like to get this replication set up roughly prior to the Treeherder stage migration, such that when Treeherder stage has successfully switched over, we're ready to move Treeherder prod to Heroku without needing to wait too long for the DB migration parts. (We can leave it replicating in the meantime).

I've written a rough draft of the plan based on the steps that were documented in the stage bugs / Pythian document:
https://public.etherpad-mozilla.org/p/treeherder-prod-rds-migration-draft

Kendall, would you mind making any changes if needed, so I can confirm the plan of action and post in this bug? We can file a dependant bug from this one for any of the ACL or DBA-requiring parts in their respective components, as needed.

Many thanks :-)
Flags: needinfo?(klibby)
Summary: Setup Treeheder replication to AWS (prod SCl3 -> prod RDS instance) → Setup Treeherder replication to AWS (prod SCl3 -> prod RDS instance)
Only two minor notes added:
- pretty sure the access issue is because we flipped the new prod instance to be public, and the VPN/ACL assume it's got a 10.0.0.0/8 address (it still does but the hostname doesn't resolve to it and there's no way, afaict, to look it up without flipping it back to private)

- when the db dump is created we need the binlog name and timestamp recorded for setting up replication
Flags: needinfo?(klibby)
Thank you for reviewing.

So the plan for SCL3 prod -> RDS *prod* migration, is:

Pre-requisites:
* Re-establish connectivity from SCL3 treeherder prod db to the treeherder-prod RDS instance, since this command currently fails:
     treeherder2.db.scl3$ nc -vz treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com 3306
   -> (were the VPC settings or ACLs changed since bug 1246965 ?)
   -> this may be due to having flipped the instances to public; the VPN and ACL presume it's in the 10.0.0.0/8 space, IIRC
* Check we still have the password for the `repl` user on treeherder2.db.scl3.
* Check the `repl` user still has the necessary grants on treeherder2.db.scl3.
* Adjust `expire_logs_days` on treeherder2.db.scl3, to cover the time between the dump being performed and replication being set up.
* Check the cron job that deletes bin logs has been disabled (see bug 1246965 comment 39).

When we're ready to go:
* Get the DBAs to make a dump of the current Treeherder DB and place it in treeherder2.db.scl3:/data
   -> we need the name of the binlog and timestamp the dump is based on for set_external_master below
* Copy the dump to an EC2 node.
* Apply the `th-import` parameter group to the prod RDS instance.
* Reboot the prod RDS instance.
* Import the dump into the prod RDS instance using the EC2 node.
* Wait for import to complete.
* Run this on the RDS instance (substituting in the necessary values):
  CALL mysql.rds_set_external_master ('<treeherder2.db.scl3 IP address>', 3306, 'repl', '<password>', '<binlog filename>', <timestamp>, 0)
  CALL mysql.rds_start_replication;
* Check replication succeeding by running `show slave status\G` on the RDS instance several times and ensuring that `seconds_behind_master` is reducing.
* Stop replication by running `CALL mysql.rds_stop_replication;` on the RDS instance.
* Apply the `th-replication` parameter group to the prod RDS instance.
* Reboot the prod RDS instance.
* Restart replication on the RDS instance using `CALL mysql.rds_start_replication;`.
* Check replication still working using `show slave status\G`.
* Run this on the RDS instance `CALL mysql.rds_set_configuration('binlog retention hours', 48);` & verify with `CALL mysql.rds_show_configuration;`.
* Add th_admin and th_user grants to the RDS instance.
* Check replication still working using `show slave status\G` & monitor the decreasing `seconds_behind_master`.
> Pre-requisites:
> * Re-establish connectivity from SCL3 treeherder prod db to the treeherder-prod RDS instance

-> Filed bug 1285995.

> * Check we still have the password for the `repl` user on treeherder2.db.scl3.
> * Check the `repl` user still has the necessary grants on treeherder2.db.scl3.

Kendall, can you confirm?

> * Adjust `expire_logs_days` on treeherder2.db.scl3, to cover the time
> between the dump being performed and replication being set up.

On treeherder2.db.scl3:

SHOW VARIABLES LIKE 'expire_logs_days'

+ ------------------ + ---------- +
| Variable_name      | Value      |
+ ------------------ + ---------- +
| expire_logs_days   | 7          |
+ ------------------ + ---------- +
1 rows

> * Check the cron job that deletes bin logs has been disabled (see bug
> 1246965 comment 39).

Kendall, can you confirm?
Flags: needinfo?(klibby)
I have to defer to DBAs on both of those points. NI on Matt/Sheeri
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
Flags: needinfo?(klibby)
I can confirm the purge_logs script (/etc/cron.d/purge_logs) which used to remove binlogs older than 3 hours is no longer on any of the treeherder machines (stage or prod).

I don't know what password you have, but I can confirm that the password begins with "z" and ends with "9", and I don't recall it having been changed in the last few months. The replication user still has the proper grants.
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
Thanks sheeri :-)
It seems like the pre-requesites are now complete. Could I leave you to coordinate with the DBAs on actioning the plan in comment 2? :-)
Flags: needinfo?(klibby)
We're going to make a some invasive DB changes shortly, so would prefer the DB dump not happen during that window. 

I'll reopen needinfo when we're good to proceed.
Flags: needinfo?(klibby)
We're just waiting for bug 1258861 to be finished, but that should be by ~Wed next week.

Kendall, would it be possible for you to have another glance over the plan here + coordinate with the DBA between now and Wed, with the hope of then getting the DB dump/replication at least in progress by the end of next week (15-16th)?

Thanks :-)
Flags: needinfo?(klibby)
f+ from me. I'll try to get things rolling this week, and possibly follow up off-hours while I'm in MTV for training next week. I'm on PTO the last week of September, so will do my best to have it sorted before then.
Flags: needinfo?(klibby)
Depends on: 1302790
Please could we get this started today? :-)
Flags: needinfo?(klibby)
Ball is rolling. Will work on this while I'm in MTV before/after training.
Flags: needinfo?(klibby)
Amazing, thank you :-)
Right, we're ready to proceed with the rest of the comment 2 steps.

The file is at treeherder1.db.scl:/data/treeherderdb_20160920.gz
The master will be treeherder2.db.scl.mozilla.com, with log filename 'treeherder2-bin.020350' and position 190207574.

Thanks!
Flags: needinfo?(klibby)
Data has been copied to a temp instance and import has started.
Blocks: 1277304
Thank you!

Looking at the CloudWatch graphs, I think the initial import has now completed.
The second/corrected dump by Pythian has just completed - the file is at treeherder1.db.scl:/data/treeherder_dump_09212016.dmp.gz

Could you start the copy to EC2 / initial import.

Many thanks!
new dump is loading
Flags: needinfo?(klibby)
I'm seeing the replication error icon on the RDS dashboard - was the existing replication turned off using `CALL mysql.rds_stop_replication;`?
I'm seeing replication error log entries from Sep 22 8:35 PM - please can you disable replication? I'm concerned it may corrupt the new import.
son of a...  yeah, it wasn't disabled. I've stopped it and restarted the import to ensure no issues.
... and the errors keep appearing even after stopping replication. *sigh*
thanks; rebooting looks like it helped, but I've run that as well. *grumble* import is running again.
Unfortunately the daily backup has been running for the last few hours, which is slowing down the import. I guess it's too late this time, but perhaps if we need to restart the import for some reason, we should temporarily disable the backups.
Import finished and replication is underway, if a bit behind. 

For posterity, steps taken:

## scp data to instance
## verify RDS instance has th_import param group set
## verify mysql connection from treeherder1.db to RDS instance
## set bin_log_purge to 1 hour
mysql> call mysql.rds_set_configuration('binlog retention hours', 1);
mysql> call mysql.rds_show_configuration;

## import
$ gunzip -c /mnt/tmp/treeherder_dump_09212016.dmp.gz | mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin --ssl_ca=rds-combined-ca-bundle.pem --ssl-verify-server-cert -p > import.log 2>&1
$ bg
$ disown -h
## verify running
$ mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin --ssl_ca=rds-combined-ca-bundle.pem --ssl-verify-server-cert -p
mysql> show processlist;
mysql> call mysql.rds_set_external_master ('10.22.70.136', 3306, 'repl', 'REDACTED', 'treeherder1-bin.022652', 165073000, 0);
mysql> call mysql.rds_start_replication;
## Check status a few times.
mysql> show slave status\G
## If all OK, and seconds_behind_master reducing, so we can restart with normal parameters.
## Stop before reboot.
mysql> call mysql.rds_stop_replication;

## change param group to th_replication in terraform config and apply
git clone git@github.com:mozilla-platform-ops/devservices-aws.git
cd devservices-aws/treeherder
## aws creds need to be setup/available (eg mfa-start-session devservices)
./init.sh
vi rds.tf
## under treeherder-prod-rds, change parameter_group_name from th-import to th-replication
terraform plan
terraform apply
git add rds.tf
git ci -m "Bug 1283170 - change treeherder-prod to th-replication param group" rds.tf
git push origin master
## reboot instance via console :-(

## Restart replication, import users
mysql> call mysql.rds_start_replication;
## Check status a few times.
mysql> show slave status\G
## Set bin_log_purge to at least 24 hours, recommended 48
mysql> call mysql.rds_set_configuration('binlog retention hours', 48);
## Verify by:
mysql> call mysql.rds_show_configuration;


Seconds_Behind_Master is currently *increasing* which means that it's not yet replaying changes and catching up. Since Relay_Log_Space is also increasing, I believe it's still copying over the events to reply. But I'm keeping an eye on it.

Email with instructions for handling other events forthcoming!
I've connected to the RDS instance (using treeherder1.db.scl3 as a jumphost, since not publicly accessible), and `show slave status` reports Seconds_Behind_Master as 284102 (and rising).

Pythian, any idea what's up here? Is treeherder1 set up properly for replication? (It's the slave to treeherder2 aiui).

The full output of `show slave status` is here:
https://emorley.pastebin.mozilla.org/8913193
Flags: needinfo?(team73)
This seems to imply no replication events are being sent from treeherder1.db.scl3...

Slave_IO_State: Waiting FOR master TO send event

This is really quite frustrating. Instead of being caught up over the weekend, we're now 3 days behind SCL3.
So `show slave status` reports the position as having advanced past that originally set in comment 26:

              Master_Log_File: treeherder1-bin.022672
          Read_Master_Log_Pos: 349072372

So I'm guessing replication was initially fine, but didn't resume post switching parameter group from th-import to th-replication and then restarting? However this was derived from the plan originally devised by Pythian in bug 1246965, so should be correct. Unfortunately fubar is now on PTO, so we have to rely on what comment 26 says.

I've tried another `call mysql.rds_start_replication;` but that just reports replication as already started.

Can anyone else help out here?

It's just it's 17 days since comment 9, and 10 days since when we wanted the replication to be running and caught up, and it feels like we're still no closer to getting this complete. Any further delay will mean we'll have to do another DB dump since we'll be more behind than the number of days binlogs retained on treeherder1.db.scl3.
Flags: needinfo?(scabral)
Flags: needinfo?(mpressman)
I ran a `call mysql.rds_stop_replication` followed by (another) `call mysql.rds_start_replication`, which appeared to result in replication proceeding, however extremely slowly (seconds behind would regularly increase).

We were using a db.m4.xlarge, which only has 16GB RAM and 2 vCPU, so I've changed it to a db.m4.2xlarge which has double of both. Normally we'd want to make that change using Terraform, however I don't have access and we can't wait any longer, so I've gone ahead and adjusted using the AWS console (it turns out the permissions weren't as locked down as they were meant to be).

This changed the instance IP, so I had to adjust the repl user host on SCL3 to use a wildcard on the last octet.

After resuming replication, Seconds_Behind_Master is now decreasing slightly faster, though it seems like the bottleneck might be on the SCL3 side or in the current slave config.

Pythian/Sheeri/Matt, can you suggest some tweaks that might speed things up? Our deviations from the RDS MySQL 5.6 defaults are here (and come from the Pythian plan from bug 1246965):
https://github.com/mozilla-platform-ops/devservices-aws/blob/d07d83bbc207c60824bd74e7e30d592e752a9454/treeherder/rds.tf#L69-L174
If I'm interpreting this correctly, the slowness catching up is not on the IO thread, but the SQL thread:
https://emorley.pastebin.mozilla.org/8913272

CPU utilisation on the RDS instance is low, and the IOPS is well under (like half) of our allowance on AWS, plus queue depth and read/write latency are all low.

As such, I'm somewhat puzzled as to why the slave isn't applying the relay log more quickly? (We're still over 3 days behind)
Treeherder is a large database and we checksum the data twice daily, to make sure the slaves match the masters. So twice a day every row gets read and checksum'd....which makes replication catch-up difficult.

That being said, turning off binary logging for the time being can make replication faster - up to 15% faster.

If the master knows the slave is connected, it will not purge the logs, so that should not be an issue.
Flags: needinfo?(scabral)
I'm not sure how the checksums are relevant here? There is no IO lag on the RDS instance, so the SCL3 instance is keeping up with sending the binlogs to it, and there will be no checksums running against the RDS instance itself.
The checksums replicate, which means they run on ALL the slaves of the master, including backup instances and RDS.
Oh they are calculated as an actual query? Ok, that's pretty write-intensive. (I was imagining a bunch of SELECTs followed by whatever tool calculating the checksums, which then separately stored the checksum results in small writes.)

I don't see us needing to keep on using checksums on RDS - and even if we do, we could re-enable later.

We can't filter these on RDS, since they don't give us the necessary permissions, and the RDS-provided procedures (such as mysql.rds_start_replication) don't allow for customisation.

Could we just disable checksums everywhere, until replication has caught up?
(In reply to Sheeri Cabral [:sheeri] from comment #32)
> That being said, turning off binary logging for the time being can make
> replication faster - up to 15% faster.

Do you mean turn off binary logging downstream of the RDS instance? It's a multi-AZ RDS instance, so it has a slave of it's own, so that won't be possible I presume?
(In reply to Ed Morley [:emorley] from comment #35)
> Could we just disable checksums everywhere, until replication has caught up?

Since I didn't hear back yesterday and we need to proceed with this as soon as possible, I've gone ahead and disabled checksums temporarily.

I couldn't do this via puppet/whatever, since I don't have access, so instead I've altered the `host` for the `checksum` user, so it cannot login.

Last night I also scaled the RDS instance's storage from 500GB to 750GB, since that gives us more IOPS (we're using general storage, and whilst we didn't seem to be hitting the credit limit previously, I'm running out of ideas), plus disabled daily backups on RDS, so they don't slow down the catchup.

I don't have AWS permissions to change the parameter groups on RDS (and setting variables dynamically isn't supported on RDS), so I couldn't experiment with different cache sizes etc.

Seconds_Behind_Master: 285651

= 3.3 days (still)
Jake, could you let me know when you're online and we can try some experiments with parameter groups? (I don't have the required AWS permissions).
Flags: needinfo?(jwatkins)
(In reply to Ed Morley [:emorley] from comment #37)
> I couldn't do this via puppet/whatever, since I don't have access, so
> instead I've altered the `host` for the `checksum` user, so it cannot login.

Puppet added another checksum user with the prior host, so I've instead set password_expired=Y, since that doesn't appear to get reverted.

I'm now running a replication stats script to give a better eta:
https://emorley.pastebin.mozilla.org/8913491

s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  250764 2.902361     1.000   2.902 |     1.000   2.902   69.66
  250488 2.899167     5.600   0.518 |     3.300   0.879   21.08
  250185 2.895660     6.050   0.479 |     4.217   0.687   16.48
  250156 2.895324     1.483   1.952 |     3.533   0.819   19.67
  250046 2.894051     2.833   1.021 |     3.393   0.853   20.47
  249831 2.891563     4.583   0.631 |     3.592   0.805   19.32
  249739 2.890498     2.533   1.141 |     3.440   0.840   20.16
  249711 2.890174     1.467   1.971 |     3.194   0.905   21.72
  249622 2.889144     2.483   1.163 |     3.115   0.928   22.26
  249512 2.887870     2.833   1.019 |     3.087   0.936   22.45
s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  249400 2.886574     2.867   1.007 |     3.067   0.941   22.59
  249291 2.885312     2.817   1.024 |     3.046   0.947   22.74
  249211 2.884387     2.333   1.236 |     2.991   0.964   23.14
  249175 2.883970     1.600   1.802 |     2.892   0.997   23.94
  249036 2.882361     3.317   0.869 |     2.920   0.987   23.69
  248971 2.881609     2.083   1.383 |     2.868   1.005   24.12
  248924 2.881065     1.783   1.616 |     2.804   1.028   24.66
  248732 2.878843     4.200   0.685 |     2.881   0.999   23.98
  248736 2.878889     0.933   3.085 |     2.779   1.036   24.86
  248518 2.876366     4.633   0.621 |     2.872   1.002   24.04

For key, see: https://www.percona.com/blog/2012/08/29/heres-a-quick-way-to-foresee-if-replication-slave-is-ever-going-to-catch-up-and-when/

For future reference, switching from db.m4.xlarge to db.m4.2xlarge increased observed write IOPS during replication catchup from an average of 560 to 690.
Heh, you've thwarted the failsafes we put in place. Checksums are pretty important to make sure that the data gets to AWS OK, but if you're OK with turning them off, it's your app and you know the implications of lack of data quality.

I'm also OK with it because the last time there were issues with data quality was March, and because you're going to be failing over pretty soon.
Flags: needinfo?(team73)
Flags: needinfo?(mpressman)
I've modified the terraform manifests to match the changes made under the console. This brings it back into a synchronized state.

https://github.com/mozilla-platform-ops/devservices-aws/commit/b6693d4bed2230e77e8f78d1d4e5fc447a445529
Flags: needinfo?(jwatkins)
Update 2 hours after comment 39:

s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  206982 2.395625    33.050   0.072 |     6.570   0.365    8.75
  205683 2.380590    22.650   0.105 |     6.692   0.356    8.54
  205542 2.378958     3.350   0.710 |     6.667   0.357    8.56
  205445 2.377836     2.617   0.909 |     6.637   0.358    8.60
  205402 2.377338     1.717   1.385 |     6.600   0.360    8.64
  205236 2.375417     3.767   0.631 |     6.579   0.361    8.66
  205136 2.374259     2.667   0.890 |     6.551   0.362    8.70
  204998 2.372662     3.300   0.719 |     6.527   0.363    8.72
  204968 2.372315     1.500   1.582 |     6.491   0.365    8.77
  204824 2.370648     3.400   0.697 |     6.469   0.366    8.80

We seem to be winning the battle - although likely helped by the binlog from 2.37 days ago being from the weekend, where Treeherder ingestion is lower. Hopefully yesterday's tree closure will mean we have a 3-day low data ingestion stretch to allow us to catch up, plus now checksums are disabled, the binlogs from today will at least be smaller.

(In reply to Sheeri Cabral [:sheeri] from comment #40)
> Heh, you've thwarted the failsafes we put in place. Checksums are pretty
> important to make sure that the data gets to AWS OK, but if you're OK with
> turning them off, it's your app and you know the implications of lack of
> data quality.

I agree that leaving them enabled is preferable, however in the current situation (and with, if I'm being brutally honest, not much assistance from Pythian/...) there wasn't really much other choice. We can always re-enable checksums once replication has caught up.

(In reply to Jake Watkins [:dividehex] from comment #41)
> I've modified the terraform manifests to match the changes made under the
> console. This brings it back into a synchronized state.

Many thanks! As just discussed on IRC, let's hold off on any other changes, now that replication is at least not getting any further behind.
I have disabled monitoring for 2 weeks on the checksums, so Pythian doesn't get paged needlessly. We can do checksums manually when we need to, as they might cause replication to lag again if we turn them on fully.
The catch-up has slowed down significantly, but seconds behind is still at least decreasing:

s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  119421 1.382187     1.583   0.873 |     4.768   0.290    6.96
  119435 1.382350     0.767   1.803 |     4.761   0.290    6.97
  119414 1.382106     1.350   1.024 |     4.755   0.291    6.98
  119408 1.382037     1.100   1.256 |     4.749   0.291    6.98
  119420 1.382176     0.800   1.728 |     4.742   0.291    7.00
  119365 1.381539     1.917   0.721 |     4.737   0.292    7.00
  119384 1.381759     0.683   2.022 |     4.730   0.292    7.01
  119335 1.381192     1.817   0.760 |     4.725   0.292    7.02
  119339 1.381238     0.933   1.480 |     4.719   0.293    7.02
  119290 1.380671     1.817   0.760 |     4.714   0.293    7.03

If we're lucky, we may be caught up sometime tomorrow, or if not, Thursday.
There might be some ebb and flow to the replication.  I've been watching it for most of the day and I can see blocks of times where c_sec_s increases for a time and then decreases again.  As long as c_sec_s hangs above above 1.0, things are moving forward.

For instance, about 90m ago there was a burst:
s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  109105 1.262789     5.917   0.213 |     1.878   0.672   16.14
  108969 1.261215     3.267   0.386 |     1.883   0.670   16.08
  108701 1.258113     5.467   0.230 |     1.894   0.664   15.94
  108556 1.256435     3.417   0.368 |     1.899   0.662   15.88
  108316 1.253657     5.000   0.251 |     1.909   0.657   15.76
  107962 1.249560     6.900   0.181 |     1.925   0.649   15.58
  107618 1.245579     6.733   0.185 |     1.940   0.642   15.41
  107179 1.240498     8.317   0.149 |     1.960   0.633   15.19
  106985 1.238252     4.233   0.293 |     1.967   0.630   15.11
  106577 1.233530     7.800   0.158 |     1.985   0.621   14.91
s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
  105918 1.225903    11.983   0.102 |     2.016   0.608   14.59
  105082 1.216227    14.933   0.081 |     2.056   0.591   14.19
  103878 1.202292    21.067   0.057 |     2.115   0.568   13.64
  103730 1.200579     3.467   0.346 |     2.119   0.566   13.60
  103716 1.200417     1.233   0.973 |     2.117   0.567   13.61
  102987 1.191979    13.150   0.091 |     2.151   0.554   13.30
  100302 1.160903    45.750   0.025 |     2.284   0.508   12.20
   98801 1.143530    26.017   0.044 |     2.356   0.485   11.65
   98095 1.135359    12.767   0.089 |     2.388   0.475   11.41
   97465 1.128067    11.500   0.098 |     2.416   0.467   11.21
s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
   97154 1.124468     6.183   0.182 |     2.427   0.463   11.12
   97102 1.123866     1.867   0.602 |     2.425   0.463   11.12
   97062 1.123403     1.667   0.674 |     2.423   0.464   11.13
   96881 1.121308     4.017   0.279 |     2.428   0.462   11.09
   96673 1.118900     4.467   0.251 |     2.434   0.460   11.03
   96494 1.116829     3.983   0.280 |     2.438   0.458   10.99
   96310 1.114699     4.067   0.274 |     2.443   0.456   10.95
   96119 1.112488     4.183   0.266 |     2.448   0.454   10.91
   95847 1.109340     5.533   0.200 |     2.457   0.451   10.83
   95585 1.106308     5.367   0.206 |     2.466   0.449   10.77
Replication failed earlier with:

Last_SQL_Error: Error 'Access denied for user ''@'' (using password: NO)' on query. Default database: ''. Query: 'GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE, EVENT, TRIGGER, CREATE TABLESPACE ON *.* TO 'checksum'@'REDACTED' IDENTIFIED BY PASSWORD '*REDACTED''

This is just from when puppet was trying to re-establish the checksum user (comment 39), so is safe to skip. After a `CALL mysql.rds_skip_repl_error;`, replication has resumed fine.

I've also found a bug in the script from the Percona blog, which resulted in incorrect ETAs. A fixed version (with some other tweaks) is here:
https://gist.github.com/edmorley/f8589f4d6fa84df4cce1a13cbac83d98

Note: With the new script, negative `c_sec_s` means "falling even further behind", and 0 means "keeping up but not making headway into the deficit". (ie: subtract 1 from the c_sec_s in previous comments to compare)

s_behind h_behind | c_sec_s   eta_h | O_c_sec_s O_eta_h
   80618    22.39 |   1.283    17.4 |     0.441    50.8
   80613    22.39 |   0.083   268.7 |     0.411    54.5
   80557    22.38 |   0.933    24.0 |     0.451    49.6
   80547    22.37 |   0.167   134.2 |     0.431    51.9
   80501    22.36 |   0.767    29.2 |     0.453    49.3
   80493    22.36 |   0.133   167.7 |     0.433    51.6
   80451    22.35 |   0.700    31.9 |     0.449    49.8
   80476    22.35 |  -0.417     Inf |     0.401    55.8
   80404    22.33 |   1.200    18.6 |     0.443    50.4
   80418    22.34 |  -0.233     Inf |     0.409    54.6
ETA has now climbed up to 5 days :-/

s_behind h_behind | c_sec_s   eta_h | O_c_sec_s O_eta_h
   79207    22.00 |  -0.233     Inf |     0.201   109.4
   79201    22.00 |   0.100   220.0 |     0.200   109.7
   79211    22.00 |  -0.167     Inf |     0.198   111.2
   79215    22.00 |  -0.067     Inf |     0.196   112.2
   79216    22.00 |  -0.017     Inf |     0.195   113.1
   79238    22.01 |  -0.367     Inf |     0.191   115.4
   79253    22.01 |  -0.250     Inf |     0.188   117.3
   79235    22.01 |   0.300    73.4 |     0.189   116.8
   79241    22.01 |  -0.100     Inf |     0.187   118.0
   79252    22.01 |  -0.183     Inf |     0.184   119.6
Depends on: 1306010
Merged and applied https://github.com/mozilla-platform-ops/devservices-aws/pull/1

This sets Enhanced Monitoring metrics back to be polled at 60s again

aws_db_instance.treeherder-prod-rds: Modifying...
  monitoring_interval: "0" => "60"
aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (40s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (50s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (1m0s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (1m10s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (1m20s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (1m30s elapsed)
aws_db_instance.treeherder-prod-rds: Modifications complete
As discussed in IRC, I've set the sync_binlog parameter to '0' under the th_replication parameter group

https://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#sysvar_sync_binlog

aws_db_parameter_group.th_replication-pg: Modifying...
  parameter.#:                       "24" => "25"
  ...
  parameter.881816039.apply_method:  "" => "immediate"
  parameter.881816039.name:          "" => "sync_binlog"
  parameter.881816039.value:         "" => "0"
  ...
aws_db_parameter_group.th_replication-pg: Modifications complete
I've set innodb_flush_log_at_trx_commit => 0

https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit

aws_db_parameter_group.th_replication-pg: Modifying...
  ...
  parameter.4012389720.apply_method: "" => "immediate"
  parameter.4012389720.name:         "" => "innodb_flush_log_at_trx_commit"
  parameter.4012389720.value:        "" => "0"
  ...
aws_db_parameter_group.th_replication-pg: Modifications complete
Changing innodb_flush_log_at_trx_commit => 0 definitely opened the flood gates.  At this rate, replication should catch up sometime today.

commit da8e29f455882425b0b17f33ec37691c690da8c8
Author: Jake Watkins <jwatkins@mozilla.com>
Date:   Wed Sep 28 10:21:37 2016 -0700

    Bug 1283170 - Set innodb_flush_log_at_trx_commit => 0

commit 461e9902b4d0cc17824f1efe7a032f2e69bdbdf8
Author: Jake Watkins <jwatkins@mozilla.com>
Date:   Wed Sep 28 08:48:49 2016 -0700

    Bug 1283170 - Set sync_binlog => 0
Replication is completely caught up!
Awesome :-)

Could we revert innodb_flush_log_at_trx_commit back to '1' now, and see if it holds steady (just to give more failure protection)? (I'm pretty sure it will)
(In reply to Ed Morley [:emorley] from comment #53)
> Awesome :-)
> 
> Could we revert innodb_flush_log_at_trx_commit back to '1' now, and see if
> it holds steady (just to give more failure protection)? (I'm pretty sure it
> will)

Sure :-)  I've reverted the commit.  innodb_flush_log_at_trx_commit should be back to the default value '1'

aws_db_parameter_group.th_replication-pg: Modifying...
  parameter.#:                       "26" => "25"
  ...
  parameter.4012389720.apply_method: "immediate" => ""
  parameter.4012389720.name:         "innodb_flush_log_at_trx_commit" => ""
  parameter.4012389720.value:        "0" => ""
  ...
aws_db_parameter_group.th_replication-pg: Modifications complete
For some reason, removing the parameter didn't seem to apply the change.  Subsequent 'terraform apply' would always want to change the parameter group.  I've instead set the parameter explicitly to 1.  With that done, terraform state in now properly sync'd with reality.
Please could you re-enable backups for this instance now? :-)
(In reply to Ed Morley [:emorley] from comment #56)
> Please could you re-enable backups for this instance now? :-)

Backups have been re-enabled with 1 day retention.

aws_db_instance.treeherder-prod-rds: Modifying...
  backup_retention_period: "0" => "1"
aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed)
aws_db_instance.treeherder-prod-rds: Modifications complete

commit 2bfae726d85af3463191b4a36021a07133b657b7
Author: Jake Watkins <jwatkins@mozilla.com>
Date:   Thu Sep 29 08:29:40 2016 -0700

    Bug 1283170 - Re-enable rds backups with 1 day retention
Many thanks!

Think we're all done here for now. Switching off replication/making non-readonly etc will occur as part of deps of bug 1176484.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
(In reply to Jake Watkins [:dividehex] from comment #57)
> Backups have been re-enabled with 1 day retention.

I don't suppose you could apply this change now, rather than at the next maintenance window?
Flags: needinfo?(jwatkins)
And sorry to be a pain, but replication is falling behind slightly again (~5000s behind) - could you set:
innodb_flush_log_at_trx_commit => 2
(is slightly safer than the value of '1' we used before, but still faster than '0')

Thanks!
(In reply to Ed Morley [:emorley] from comment #60)
> And sorry to be a pain, but replication is falling behind slightly again
> (~5000s behind) - could you set:
> innodb_flush_log_at_trx_commit => 2
> (is slightly safer than the value of '1' we used before, but still faster
> than '0')
> 
> Thanks!

aws_db_parameter_group.th_replication-pg: Modifications complete
aws_db_instance.treeherder-prod-rds: Still modifying... (10s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (20s elapsed)
aws_db_instance.treeherder-prod-rds: Still modifying... (30s elapsed)
aws_db_instance.treeherder-prod-rds: Modifications complete
Flags: needinfo?(jwatkins)
Replication has caught up with the innodb_flush_log_at_trx_commit => 2 change.

s_behind d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
    5969 0.069086     0.467   0.148 |     0.924   0.075    1.79
    5980 0.069213     0.817   0.085 |     0.910   0.076    1.82
    5975 0.069155     1.083   0.064 |     0.930   0.074    1.79
    4495 0.052025    25.667   0.002 |     3.403   0.015    0.37
    2824 0.032685    28.850   0.001 |     5.717   0.006    0.14
     696 0.008056    36.467   0.000 |     8.279   0.001    0.02
       0 0.000000    12.600   0.000 |     8.612   0.000    0.00
       0 0.000000     1.000   0.000 |     1.000   0.000    0.00
(In reply to Ed Morley [:emorley] from comment #59)
> (In reply to Jake Watkins [:dividehex] from comment #57)
> > Backups have been re-enabled with 1 day retention.
> 
> I don't suppose you could apply this change now, rather than at the next
> maintenance window?

I've reapplied this with apply_immediately set 'True'
Although terraform is still waiting for the modification to complete (it's been over ~40mins), the console logs show the backup_retention has been properly applied and the instance is in fact running a backup.

I'll check back on this later to ensure it finishes without issue.
~5 hours later and the rds instance state is still 'modifying'.  Terraform timed out.
Replication still looks fine and with no lag.

I'm fairly confident this is due to the backup snapshot that is still being created.  The snapshot is at 65% (as of 0610UTC)

aws_db_instance.treeherder-prod-rds: Still modifying... (1h55m20s elapsed)
Error applying plan:
1 error(s) occurred:
* aws_db_instance.treeherder-prod-rds: timeout while waiting for state to become 'available' (last state: 'modifying', timeout: 1h20m0s)
All looks good! Many thanks for your (super-responsive) help Jake :-)
You need to log in before you can comment on or make changes to this bug.