Closed Bug 921395 Opened 11 years ago Closed 11 years ago

Persona master DB crashed. Had to fix replication

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gene, Assigned: scabral)

References

Details

Tonight our global master mysql database restarted (the hardware rebooted). This resulted in the two slaves in AWS reporting "Client requested master to start replication from impossible position" and replication being broken.

I went to the slaves and determined that "        Relay_Master_Log_File: mysql-bin.000253"


I went to the master and did "mysqlbinlog mysql-bin.000253 | tail -n 100" and found the last log position as 4866089

On the slaves the last position was 4868197

The slaves showed :

*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 10.18.20.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 15
              Master_Log_File: mysql-bin.000253
          Read_Master_Log_Pos: 4868197
               Relay_Log_File: mysqld-relay-bin.000249
                Relay_Log_Pos: 4868342
        Relay_Master_Log_File: mysql-bin.000253
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: browserid
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4868197
              Relay_Log_Space: 4868541
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position'
               Last_SQL_Errno: 0
               Last_SQL_Error: 

I then went to both slaves and did

stop slave;
change master to master_log_pos=4866089;
start slave;

This re-established replication.

Does this mean that there were 4868197-4866089 = 2108 transactions which were lost? Does anything else need be done regarding DB consistency?
For the record replication was broken from 1:28:57 to 1:56:00
Gene - have you looked into why the master crashed?

First things first: from the mysql error log:
120711 22:17:37 [ERROR] /usr/sbin/mysqld: Table './browserid/heartbeat' is marked as crashed and should be repaired
130927  1:32:26 [ERROR] /usr/sbin/mysqld: Table './mysql/user' is marked as crashed and should be repaired
130927  1:32:26 [ERROR] /usr/sbin/mysqld: Table './mysql/db' is marked as crashed and should be repaired
130927  1:32:54 [ERROR] /usr/sbin/mysqld: Table './browserid/heartbeat' is marked as crashed and should be repaired


mysql> repair table mysql.user;
+------------+--------+----------+----------+
| Table      | Op     | Msg_type | Msg_text |
+------------+--------+----------+----------+
| mysql.user | repair | status   | OK       |
+------------+--------+----------+----------+
1 row in set (0.00 sec)

mysql> repair table mysql.db;
+----------+--------+----------+----------+
| Table    | Op     | Msg_type | Msg_text |
+----------+--------+----------+----------+
| mysql.db | repair | status   | OK       |
+----------+--------+----------+----------+
1 row in set (0.01 sec)

mysql> repair table browserid.heartbeat;
+---------------------+--------+----------+----------+
| Table               | Op     | Msg_type | Msg_text |
+---------------------+--------+----------+----------+
| browserid.heartbeat | repair | status   | OK       |
+---------------------+--------+----------+----------+
1 row in set (0.01 sec)
Summary: Persona master DB restarted. Had to fix replication → Persona master DB crashed. Had to fix replication
The best way to check for data loss is to run checksums. We run those twice daily on our systems automatically and have Nagios checks to alert us if checksums either aren't running or if they come out different.

I tried to run checksums, but it looks like replication is still broken???

Not sure why you left this in a state where replication is still broken....this is on db2:
              Master_Log_File: mysql-bin.000253
......
          Exec_Master_Log_Pos: 4868197
              Relay_Log_Space: 4868541
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position'

So, one thing to note here is that when MySQL restarts, it creates a new binary log file. The numbers of the "position" is number of bytes.

[root@db1.iddb.phx1.svc mysql]# ls -l mysql-bin.000253 
-rw-rw---- 1 mysql mysql 4866089 Sep 27 01:28 mysql-bin.000253


The binlog file only goes up to 4866089, and the db2 slave is at 4868197 - so you haven't lost any information. In fact, you're trying to set the db2 slave to a position that *does not exist* - it's higher than the number of bytes in the file.

What you want to do is change the position to be byte 4 in the next log file, #254, then replication will start again properly.

Since replication has been broken all night, I see no harm in leaving it until you can confirm this is what you want to do.

What you want to do is start replication from the new file,
This problem has occurred again. The master DB has rebooted and slave replication was broken. This time the two log positions were 99867239 and 99868062 or 823 records. I used the same method to resolve the issue. The problem started at 23:56 and was resolved at 00:11 PDT
And in response to Sheeri's comment above, she was looking at deprecated slaves in PHX1 that are not in use and which I thought had been recycled. The actual slaves are in AWS
Depends on: 923388
This problem has recurred. master was at master_log_pos=20650193 and slaves at 20651071

http://personastatus.org/#1380940860
Master:
20650193
slaves:
20651071

Note that slaves' positions are *past* the master positions. That's how you know the slaves don't have any transactions that aren't also on the master.
Hi Gene, 

After we talked during the summit, you mentioned you'd found a hardware issue with the master DB. I'm going to set this to resolved, but feel free to reopen if there's something outstanding we need to do.
Assignee: server-ops-database → scabral
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Yup, we replaced a bad memory DIMM and the machine appears stable now. Thanks!
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.