Closed Bug 712770 Opened 13 years ago Closed 12 years ago

re-synchronize of tb-b01-{master,slave}01

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: scabral)

Details

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Description

•

13 years ago

The slave has been alerting occasionally over the last 48h, e.g.,

16:07 < nagios-sjc1> [34] tm-b01-slave01:MySQL Replication Lag is CRITICAL: Replication Stopped - Last error: Illegal error code: 1062

Sheeri's already working on it, so this is just a pointer for others to look at.

Sheeri Cabral [:sheeri]

Assignee

Comment 1

•

13 years ago

I sync'd the auth_user table, and then checksum'd the database it lay in:

./pt-table-checksum --chunk-size=1000 --modulo=70 --offset=1 h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

and found users_registrationprofile table to be out of sync, so I'm syncing that now.

Sheeri Cabral [:sheeri]

Assignee

Comment 2

•

13 years ago

Tried the checksum again, found the django_session table to be out of sync, so I synced that.

Now there are no tables out of sync in the developer_mozilla_org_django db.

Status: NEW → ASSIGNED

Sheeri Cabral [:sheeri]

Assignee

Comment 3

•

13 years ago

time ./pt-table-checksum --chunk-size=1000  h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

was the command for the full developer_mozilla_org_django checksum (not just offset=1 modulo=70).

Sheeri Cabral [:sheeri]

Assignee

Comment 4

•

12 years ago

Running this now:

./pt-table-checksum --chunk-size=1000 --modulo=70 --offset=1 h=localhost,u=root --databases developer_mozilla_org_django --ask-pass --chunk-size-limit=10 --replicate mysql.checksum --algorithm=BIT_XOR

I will work on making that a regular thing.  If we do it 5 times per day, then we'll go through the entire data set in 2 weeks (14 days).

Sheeri Cabral [:sheeri]

Assignee

Comment 5

•

12 years ago

I've set cron on tm-b01-master01 to run a checksum script for 1/70th of the data, and when it's done, update the offset.  It takes about 30 minutes to run (it tries to reduce slowness with replication) and so I've set it to run at 2300 0000 0100 0200 and 0300 - so it's 5 times per day.

After a few days I'll start to check things and get a nagios check in place to check the output.

The nagios check should:

check that mysql.checksum has values within the last day:

SELECT COUNT(*) FROM mysql.checksum WHERE ts>=NOW()-INTERVAL 1 DAY;

if not, error.  if so, see if there are any bad rows:

SELECT db,tbl,boundaries FROM mysql.checksum WHERE this_crc!=master_crc AND db!='mysql' LIMIT 10;

And if there are rows, e-mail those rows to infra-dbnotices.  I've put the LIMIT 10 there so that the output will be truncated.

There should also be a hook to ignore arbitrary tables, so the query could be:

SELECT db,tbl,boundaries FROM mysql.checksum WHERE this_crc!=master_crc AND db!='mysql' AND tbl NOT IN ('foo','bar','baz') LIMIT 10;

Sheeri Cabral [:sheeri]

Assignee

Comment 6

•

12 years ago

Unfortunately this caused slowness on the backup server.  I am working to change memory parameters on tm-backup01 so that this can be done nightly, without paging us.  For now the nightly cron has been disabled.

Sheeri Cabral [:sheeri]

Assignee

Comment 7

•

12 years ago

tm-backup01 is old and slow and hopefully will be replaced with something new very soon.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

re-synchronize of tb-b01-{master,slave}01

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: scabral)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated