As per justdave, we're having some database issues that may be happening because we need to resync the MDC databases after the last patch we did to fix database troubles. Let's do that resync and see if it resolves the issues.
This is to resolve problems that were created by a bug in Dekiwiki which was recently patched that caused frequent replication failures, and corresponding data integrity issues on the slaves. The only databases with any reason to be suspect of their integrity are the MDC and Library databases. Rather than re-syncing the entire cluster (which would mean putting 30 or 40 production applications in read-only status for about 2 hours), my plan is to take MDC and Library completely offline, dump those two databases, then drop them and re-import from the dump. The complete drop and restore should effectively clean up the state of those two databases on the slaves via replication.
Can this be squeezed into Tuesday's window?
I did a trial run of this on the backup server, and the total time to dump of both databases combined is under 5 minutes, so this should actually go by pretty quick. I'd give it 30 minutes for the outage window, just to play it safe.
This is now completed in production, I'd be nice if someone could sanity check the site and make sure everything is working properly.
Our mysqldump syntax that we use for backups was missing the flag to dump stored procedures, which deki uses a lot of. MDC is down until we find a way to recover those. Quickest fix would be to just reinstall them from the install scripts, but they have variables in them and look like they need to get parsed before getting sent to mysql. Until I track down someone who can tell me how to do that, the backup plan it to restore the binary dump of the c01 cluster from last night and then dump only the stored procedures from that and apply it to production. Unfortunately the c01 cluster is huge, and the backup restore is probably going to take a while.
OK, we're back up and running. Got the c01 backup restored and dumped the procedures out of it and restored those to production before we found anyone to tell us how to run the install scripts.