Closed Bug 712770 Opened 13 years ago Closed 12 years ago

re-synchronize of tb-b01-{master,slave}01

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: scabral)

Details

The slave has been alerting occasionally over the last 48h, e.g.,

16:07 < nagios-sjc1> [34] tm-b01-slave01:MySQL Replication Lag is CRITICAL: Replication Stopped - Last error: Illegal error code: 1062

Sheeri's already working on it, so this is just a pointer for others to look at.
I sync'd the auth_user table, and then checksum'd the database it lay in:

./pt-table-checksum --chunk-size=1000 --modulo=70 --offset=1 h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

and found users_registrationprofile table to be out of sync, so I'm syncing that now.
Tried the checksum again, found the django_session table to be out of sync, so I synced that.

Now there are no tables out of sync in the developer_mozilla_org_django db.
Status: NEW → ASSIGNED
time ./pt-table-checksum --chunk-size=1000  h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

was the command for the full developer_mozilla_org_django checksum (not just offset=1 modulo=70).
Running this now:

./pt-table-checksum --chunk-size=1000 --modulo=70 --offset=1 h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

I will work on making that a regular thing.  If we do it 5 times per day, then we'll go through the entire data set in 2 weeks (14 days).
I've set cron on tm-b01-master01 to run a checksum script for 1/70th of the data, and when it's done, update the offset.  It takes about 30 minutes to run (it tries to reduce slowness with replication) and so I've set it to run at 2300 0000 0100 0200 and 0300 - so it's 5 times per day.

After a few days I'll start to check things and get a nagios check in place to check the output.

The nagios check should:

check that mysql.checksum has values within the last day:

SELECT COUNT(*) FROM mysql.checksum WHERE ts>=NOW()-INTERVAL 1 DAY;

if not, error.  if so, see if there are any bad rows:

SELECT db,tbl,boundaries FROM mysql.checksum WHERE this_crc!=master_crc AND db!='mysql' LIMIT 10;

And if there are rows, e-mail those rows to infra-dbnotices.  I've put the LIMIT 10 there so that the output will be truncated.

There should also be a hook to ignore arbitrary tables, so the query could be:

SELECT db,tbl,boundaries FROM mysql.checksum WHERE this_crc!=master_crc AND db!='mysql' AND tbl NOT IN ('foo','bar','baz') LIMIT 10;
Unfortunately this caused slowness on the backup server.  I am working to change memory parameters on tm-backup01 so that this can be done nightly, without paging us.  For now the nightly cron has been disabled.
tm-backup01 is old and slow and hopefully will be replaced with something new very soon.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.