Closed Bug 785930 Opened 12 years ago Closed 12 years ago

node37.seamicro.phx1.mozilla.com MySQL crashing

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86_64
FreeBSD
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: afernandez, Assigned: scabral)

Details

Creating bug as per sheeri.

MySQL keeps crashing;
....
InnoDB: End of page dump
120827 10:06:24  InnoDB: Page checksum 1351490099 (32bit_calc: 35087108), prior-to-4.0.14-form checksum 4233694457
InnoDB: stored checksum 587866912, prior-to-4.0.14-form stored checksum 3396583943
InnoDB: Page lsn 1919027747 168455535, low 4 bytes of lsn at page end 4000231286
InnoDB: Page number (if stored to page already) 1718707310,
InnoDB: space id (if created with >= MySQL-4.1.1 and stored already) 1949973072
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 7.
....
yes, the important part there is:

120827 10:06:03  InnoDB: Error: space id and page n:o stored in the page
InnoDB: read in are 1949973072:1718707310, should be 1392:7!
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 7.

looks like corruption. This is a slave, and there are 2 other slaves that are OK, so it's likely a hardware thing. I'll see if I can recover.
Started with innodb_force_recovery=1, but it still crashed, worked my way up to where it didn't, which was innodb_force_recovery=4....

120827 10:13:20 Percona XtraDB (http://www.percona.com) 1.0.17-13.4 started; log sequence number 14573800889
InnoDB: !!! innodb_force_recovery is set to 4 !!!
120827 10:13:20 [Note] Recovering after a crash using /var/lib/mysql/node37-bin
120827 10:13:20 [Note] Starting crash recovery...
120827 10:13:20 [Note] Crash recovery finished.
120827 10:13:21 [Note] Event Scheduler: Loaded 0 events
120827 10:13:21 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.1.63-rel13.4-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), 13.4, Revision 443
120827 10:13:21 [Note] Slave SQL thread initialized, starting replication in log 'node36-bin.000001' at position 258927923, relay log '/var/lib/mysql/node37-relay-bin.000002' position: 258928069
120827 10:13:21 [Note] Slave I/O thread: connected to master 'repl@node36.seamicro.phx1.mozilla.com:3306',replication started in log 'node36-bin.000001' at position 260995688
InnoDB: A new raw disk partition was initialized or
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that newraw is replaced
InnoDB: with raw, and innodb_force_... is removed.
120827 10:13:21 [ERROR] Slave SQL: Error 'Got error -1 from storage engine' on query. Default database: 'reps_dev_allizom_org'. Query: 'INSERT INTO `events_event` (`name`, `slug`, `start`, `end`, `timezone`, `venue`, `city`, `region`, `country`, `lat`, `lon`, `external_link`, `owner_id`, `planning_pad_url`, `estimated_attendance`, `description`, `extra_content`, `mozilla_event`, `hashtag`, `converted_visitors`, `swag_bug_id`, `budget_bug_id`) VALUES ('test66', 'test66', '2012-02-01 12:00:00', '2012-02-01 13:00:00', 'Africa/Abidjan', 'wdfw', 'rewe', 'werwer', 'Albania', 54.9776136706963, -29.53125, '', 9, 'http://etherpad.mozilla.org/remo-test66', 10, 'osdfnhosid hfoisd hfosdihf osdfih sdofhsd f', '', 0, '', 0, NULL, NULL)', Error_code: 1030
120827 10:13:21 [Warning] Slave: Got error -1 from storage engine Error_code: 1030
120827 10:13:21 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'node36-bin.000001' position 258927923
Most importantly, the crash recovery is OK, but it looks like the first SQL statement in the slave is causing the problems:

120827 10:13:16 [Note] Slave SQL thread initialized, starting replication in log 'node36-bin.000001' at position 258927923, relay log '/var/lib/mysql/node37-relay-bin.000002' position: 258928069
120827 10:13:16 [Note] Slave I/O thread: connected to master 'repl@node36.seamicro.phx1.mozilla.com:3306',replication started in log 'node36-bin.000001' at position 260995688
120827 10:13:16  InnoDB: Error: space id and page n:o stored in the page
InnoDB: read in are 1949973072:1718707310, should be 1392:7!
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 7.

So we can export that table, and hopefully export/reimport of that table will work.
[root@node37.seamicro.phx1 mysql]# du -sh reps_dev_allizom_org/
27M     reps_dev_allizom_org/
[root@node37.seamicro.phx1 mysql]# mysqldump reps_dev_allizom_org > reps_dev_allizom_org.sql

Then I restarted with innodb_force_recovery commented out, and skip_slave_start enabled. That started OK (yay!)

But trying to drop the reps_dev_allizom_org database or using that database caused issues. seeing if I can do an individual table drop of events_event.
Dropping the table is also causing crashing. :( As is truncating, dropping the database, etc.

Going to re-sync from the master. 

[13:25:52] <Aj> sheeri, yea seeing some errors; EXT4-fs error
[13:26:04] <sheeri> *nod*
[13:26:12] <sheeri> FWIW Linux node37.seamicro.phx1.mozilla.com 2.6.32-71.el6.i686 #1 SMP Wed Sep 1 01:26:34 EDT 2010 i686 i686 i386 GNU/Linux
[13:26:21] <Aj> want me to reboot and run a fsck ?
[13:26:52] <sheeri> sure, I have deleted the db files, I'll restore from backup
[13:26:59] <sheeri> but I'm logged out now.
[13:27:18] <Aj> ok, want me to update the kernel while at it ?
[13:27:34] <sheeri> yes please :D
[13:27:39] <sheeri> and why not do a yum update :D
As per sheeri, ran;
yum -y update

kernel: 2.6.32-71.el6.i686 -> 2.6.32-279.2.1.el6.i686

Then performed a fsck -y which fixed lots of file system errors.

Host is now back online.
resynced from node36, all is happy now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.