Closed
Bug 921395
Opened 11 years ago
Closed 11 years ago
Persona master DB crashed. Had to fix replication
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gene, Assigned: scabral)
References
Details
Tonight our global master mysql database restarted (the hardware rebooted). This resulted in the two slaves in AWS reporting "Client requested master to start replication from impossible position" and replication being broken. I went to the slaves and determined that " Relay_Master_Log_File: mysql-bin.000253" I went to the master and did "mysqlbinlog mysql-bin.000253 | tail -n 100" and found the last log position as 4866089 On the slaves the last position was 4868197 The slaves showed : *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.18.20.21 Master_User: replication Master_Port: 3306 Connect_Retry: 15 Master_Log_File: mysql-bin.000253 Read_Master_Log_Pos: 4868197 Relay_Log_File: mysqld-relay-bin.000249 Relay_Log_Pos: 4868342 Relay_Master_Log_File: mysql-bin.000253 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: browserid Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 4868197 Relay_Log_Space: 4868541 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position' Last_SQL_Errno: 0 Last_SQL_Error: I then went to both slaves and did stop slave; change master to master_log_pos=4866089; start slave; This re-established replication. Does this mean that there were 4868197-4866089 = 2108 transactions which were lost? Does anything else need be done regarding DB consistency?
Reporter | ||
Comment 1•11 years ago
|
||
For the record replication was broken from 1:28:57 to 1:56:00
Assignee | ||
Comment 2•11 years ago
|
||
Gene - have you looked into why the master crashed? First things first: from the mysql error log: 120711 22:17:37 [ERROR] /usr/sbin/mysqld: Table './browserid/heartbeat' is marked as crashed and should be repaired 130927 1:32:26 [ERROR] /usr/sbin/mysqld: Table './mysql/user' is marked as crashed and should be repaired 130927 1:32:26 [ERROR] /usr/sbin/mysqld: Table './mysql/db' is marked as crashed and should be repaired 130927 1:32:54 [ERROR] /usr/sbin/mysqld: Table './browserid/heartbeat' is marked as crashed and should be repaired mysql> repair table mysql.user; +------------+--------+----------+----------+ | Table | Op | Msg_type | Msg_text | +------------+--------+----------+----------+ | mysql.user | repair | status | OK | +------------+--------+----------+----------+ 1 row in set (0.00 sec) mysql> repair table mysql.db; +----------+--------+----------+----------+ | Table | Op | Msg_type | Msg_text | +----------+--------+----------+----------+ | mysql.db | repair | status | OK | +----------+--------+----------+----------+ 1 row in set (0.01 sec) mysql> repair table browserid.heartbeat; +---------------------+--------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------------------+--------+----------+----------+ | browserid.heartbeat | repair | status | OK | +---------------------+--------+----------+----------+ 1 row in set (0.01 sec)
Summary: Persona master DB restarted. Had to fix replication → Persona master DB crashed. Had to fix replication
Assignee | ||
Comment 3•11 years ago
|
||
The best way to check for data loss is to run checksums. We run those twice daily on our systems automatically and have Nagios checks to alert us if checksums either aren't running or if they come out different. I tried to run checksums, but it looks like replication is still broken??? Not sure why you left this in a state where replication is still broken....this is on db2: Master_Log_File: mysql-bin.000253 ...... Exec_Master_Log_Pos: 4868197 Relay_Log_Space: 4868541 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position' So, one thing to note here is that when MySQL restarts, it creates a new binary log file. The numbers of the "position" is number of bytes. [root@db1.iddb.phx1.svc mysql]# ls -l mysql-bin.000253 -rw-rw---- 1 mysql mysql 4866089 Sep 27 01:28 mysql-bin.000253 The binlog file only goes up to 4866089, and the db2 slave is at 4868197 - so you haven't lost any information. In fact, you're trying to set the db2 slave to a position that *does not exist* - it's higher than the number of bytes in the file. What you want to do is change the position to be byte 4 in the next log file, #254, then replication will start again properly. Since replication has been broken all night, I see no harm in leaving it until you can confirm this is what you want to do. What you want to do is start replication from the new file,
Reporter | ||
Comment 4•11 years ago
|
||
This problem has occurred again. The master DB has rebooted and slave replication was broken. This time the two log positions were 99867239 and 99868062 or 823 records. I used the same method to resolve the issue. The problem started at 23:56 and was resolved at 00:11 PDT
Reporter | ||
Comment 5•11 years ago
|
||
And in response to Sheeri's comment above, she was looking at deprecated slaves in PHX1 that are not in use and which I thought had been recycled. The actual slaves are in AWS
Reporter | ||
Comment 6•11 years ago
|
||
This problem has recurred. master was at master_log_pos=20650193 and slaves at 20651071 http://personastatus.org/#1380940860
Assignee | ||
Comment 7•11 years ago
|
||
Master: 20650193 slaves: 20651071 Note that slaves' positions are *past* the master positions. That's how you know the slaves don't have any transactions that aren't also on the master.
Assignee | ||
Comment 8•11 years ago
|
||
Hi Gene, After we talked during the summit, you mentioned you'd found a hardware issue with the master DB. I'm going to set this to resolved, but feel free to reopen if there's something outstanding we need to do.
Assignee: server-ops-database → scabral
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 9•11 years ago
|
||
Yup, we replaced a bad memory DIMM and the machine appears stable now. Thanks!
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•