Closed
Bug 774162
Opened 12 years ago
Closed 12 years ago
db consistency checks for bugzilla
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cshields, Assigned: scabral)
References
Details
(Whiteboard: [2012q3])
- start doing regular consistency checks of the databases - just the checksum. Actually monitoring those checksums and fixing them probably can't all be done in Q3. - I recommend starting with bugzilla, since I know that checksums don't cause slave lag on these, and have a checksum user for this - after bugzilla, addons, then sumo - document this procedure
Comment 1•12 years ago
|
||
How do we measure success and/or progress on this task?
Assignee | ||
Comment 2•12 years ago
|
||
Well, there are 2 parts - doing the check itself, and checking the results. We would use an automated script to do the checking, and put a Nagios monitor in place to check the results. So for each cluster (bugzilla, then addons, then sumo, then others) there are 2 parts....if the official goal is "get this done on all 3" then there are 6 "milestones", and progress can be measured that way.
Comment 3•12 years ago
|
||
Corey, is that the goal here all 3 this quarter?
Assignee | ||
Comment 4•12 years ago
|
||
Let's make the goal easy to attain - just bugzilla for this quarter. We may do more than that....
Reporter | ||
Updated•12 years ago
|
Summary: db consistency checks → db consistency checks for bugzilla
Assignee | ||
Comment 5•12 years ago
|
||
I am running the consistency checks on bugzilla and have made a /usr/local/bin/checksum.sh with all the info we need. If this is successful, I'll work on a cron entry and getting a once-a-day nagios check for each slave to check the output.
Assignee | ||
Comment 6•12 years ago
|
||
The good news is that the slaves have all matching data from the one-time run. The bad news is that there are some issues: [root@tp-bugs01-master01 bin]# ./checksum.sh Cannot connect to h=10.8.70.152,p=...,u=checksum 08-10T08:52:52 Error checksumming table bugs.bugs: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `bug_id`, `assigned_to`, `bug_file_loc`, `bug_severity`, `bug_status`, `creation_ts`, `delta_ts`, `short_desc`, `op_sys`, `priority`, `rep_platform`, `reporter`, `version`, `resolution`, `target_milestone`, `qa_contact`, `status_whiteboard`, `votes`, `lastdiffed`, `everconfirmed`, `reporter_accessible`, `cclist_accessible`, `alias`, `estimated_time`, `remaining_time`, `product_id`, `component_id`, `deadline`, `cf_blocking_fennec`, `cf_blocking_191`, `cf_status_191`, `cf_status_192`, `cf_blocking_20`, `cf_blocking_thunderbird30`, `cf_status_thunderbird30`, `cf_blocking_192`, `cf_blocking_thunderbird31`, `cf_status_thunderbird31`, `cf_status_20`, `cf_status_seamonkey21`, `cf_blocking_seamonkey21`, `cf_blocking_thunderbird32`, `cf_status_thunderbird32`, `cf_blocking_thunderbird33`, `cf_status_thunderbird33`, `cf_blocking_fx`, `cf_tracking_firefox5`, `cf_status_firefox5`, `cf_tracking_firefox6`, `cf_colo_site`, `cf_tracking_firefox7`, `cf_status_firefox6`, `cf_status_firefox7`, `cf_crash_signature`, `cf_tracking_thunderbird6`, `cf_tracking_thunderbird7`, `cf_status_thunderbird6`, `cf_status_thunderbird7`, `cf_tracking_seamonkey22`, `cf_tracking_seamonkey23`, `cf_tracking_seamonkey24`, `cf_tracking_firefox8`, `cf_status_firefox8`, `cf_tracking_seamonkey25`, `cf_status_seamonkey22`, `cf_status_seamonkey23`, `cf_status_seamonkey24`, `cf_status_seamonkey25`, `cf_tracking_thunderbird8`, `cf_status_thunderbird8`, `cf_tracking_firefox9`, `cf_status_firefox9`, `cf_tracking_seamonkey26`, `cf_status_seamonkey26`, `cf_tracking_thunderbird9`, `cf_status_thunderbird9`, `cf_office`, `cf_due_date`, `cf_tracking_firefox10`, `cf_status_firefox10`, `cf_tracking_thunderbird10`, `cf_status_thunderbird10`, `cf_tracking_seamonkey27`, `cf_status_seamonkey27`, `cf_tracking_firefox11`, `cf_status_firefox11`, `cf_tracking_thunderbird11`, `cf_status_thunderbird11`, `cf_tracking_seamonkey28`, `cf_status_seamonkey28`, `cf_tracking_firefox12`, `cf_status_firefox12`, `cf_tracking_thunderbird12`, `cf_status_thunderbird12`, `cf_tracking_seamonkey29`, `cf_status_seamonkey29`, `cf_tracking_esr10`, `cf_status_esr10`, `cf_tracking_firefox13`, `cf_status_firefox13`, `cf_tracking_thunderbird13`, `cf_status_thunderbird13`, `cf_tracking_seamonkey210`, `cf_status_seamonkey210`, `cf_tracking_thunderbird_esr10`, `cf_status_thunderbird_esr10`, `cf_blocking_fennec10`, `cf_tracking_firefox14`, `cf_status_firefox14`, `cf_tracking_thunderbird14`, `cf_status_thunderbird14`, `cf_tracking_seamonkey211`, `cf_status_seamonkey211`, `cf_last_resolved`, `cf_blocking_kilimanjaro`, `cf_tracking_firefox15`, `cf_status_firefox15`, `cf_tracking_thunderbird15`, `cf_status_thunderbird15`, `cf_tracking_seamonkey212`, `cf_status_seamonkey212`, `cf_blocking_basecamp`, `cf_tracking_firefox16`, `cf_status_firefox16`, `cf_tracking_thunderbird16`, `cf_status_thunderbird16`, `cf_tracking_seamonkey213`, `cf_status_seamonkey213`, `cf_tracking_firefox17`, `cf_status_firefox17`, `cf_tracking_thunderbird17`, `cf_status_thunderbird17`, `cf_tracking_seamonkey214`, `cf_status_seamonkey214`, CONCAT(ISNULL(`bug_file_loc`), ISNULL(`creation_ts`), ISNULL(`qa_contact`), ISNULL(`lastdiffed`), ISNULL(`alias`), ISNULL(`deadline`), ISNULL(`cf_crash_signature`), ISNULL(`cf_due_date`), ISNULL(`cf_last_resolved`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `bugs`.`bugs` FORCE INDEX(`PRIMARY`) WHERE ((`bug_id` >= ?)) AND ((`bug_id` <= ?)) /*checksum chunk*/" with ParamValues: 0='bugs', 1='bugs', 2=25, 3='PRIMARY', 4='514824', 5='538635', 6='514824', 7='538635'] at /usr/local/percona/pt-table-checksum line 6932. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 08-10T08:52:52 1 0 514653 25 0 12.591 bugs.bugs 08-10T08:58:34 Error checksumming table bugs.user_group_map: Possible infinite loop detected! The lower boundary for chunk 2 is <69> and the lower boundary for chunk 3 is also <69>. This usually happens when using a non-unique single column index. The current chunk index for table bugs.user_group_map is fk_user_group_map_group_id_groups_id which is not unique and covers 1 column. 08-10T08:58:34 1 0 0 2 2 0.091 bugs.user_group_map 08-10T08:58:35 Error checksumming table bugs.votes: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `who`, `bug_id`, `vote_count`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `bugs`.`votes` FORCE INDEX(`votes_bug_id_idx`) WHERE ((`bug_id` >= ?)) AND ((`bug_id` <= ?)) /*checksum chunk*/" with ParamValues: 0='bugs', 1='votes', 2=2, 3='votes_bug_id_idx', 4='142898', 5='719764', 6='142898', 7='719764'] at /usr/local/percona/pt-table-checksum line 6932. 08-10T08:58:35 1 0 64626 2 0 0.421 bugs.votes 08-10T08:58:35 0 1 0 1 0 0.011 mysql.proc 08-10T08:58:35 0 1 300 1 0 0.012 mysql.tables_priv 08-10T08:58:35 0 1 30 1 0 0.010 mysql.user real 16m19.751s user 0m3.520s sys 0m0.506s Tried a 2nd run, to see if there were the same issues, and for the most part, yeah. [root@tp-bugs01-master01 bin]# ./checksum.sh Cannot connect to h=10.8.70.152,p=...,u=checksum 08-10T10:44:12 Error checksumming table bugs.bug_group_map: Possible infinite loop detected! The lower boundary for chunk 4 is <26> and the lower boundary for chunk 5 is also <26>. This usually happens when using a non-unique single column index. The current chunk index for table bugs.bug_group_map is bug_group_map_group_id_idx which is not unique and covers 1 column. TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 08-10T10:44:12 1 0 0 4 4 0.015 bugs.bug_group_map 08-10T10:50:11 Error checksumming table bugs.user_group_map: Possible infinite loop detected! The lower boundary for chunk 2 is <69> and the lower boundary for chunk 3 is also <69>. This usually happens when using a non-unique single column index. The current chunk index for table bugs.user_group_map is fk_user_group_map_group_id_groups_id which is not unique and covers 1 column. 08-10T10:50:11 1 0 0 2 2 0.092 bugs.user_group_map 08-10T10:50:12 0 1 0 1 0 0.010 mysql.proc 08-10T10:50:13 0 1 300 1 0 0.012 mysql.tables_priv 08-10T10:50:13 0 1 30 1 0 0.010 mysql.user real 18m47.019s user 0m3.645s sys 0m0.467s
Assignee | ||
Comment 7•12 years ago
|
||
Here's the bugs.bug_group_map table: Table: bug_group_map Create Table: CREATE TABLE `bug_group_map` ( `bug_id` mediumint(9) NOT NULL, `group_id` mediumint(9) NOT NULL, UNIQUE KEY `bug_group_map_bug_id_idx` (`bug_id`,`group_id`), KEY `bug_group_map_group_id_idx` (`group_id`), CONSTRAINT `fk_bug_group_map_bug_id_bugs_bug_id` FOREIGN KEY (`bug_id`) REFERENCES `bugs` (`bug_id`) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_bug_group_map_group_id_groups_id` FOREIGN KEY (`group_id`) REFERENCES `groups` (`id`) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.03 sec) and the bugs.user_group_map table: Table: user_group_map Create Table: CREATE TABLE `user_group_map` ( `user_id` mediumint(9) NOT NULL, `group_id` mediumint(9) NOT NULL, `isbless` tinyint(4) NOT NULL DEFAULT '0', `grant_type` tinyint(4) NOT NULL DEFAULT '0', UNIQUE KEY `user_group_map_user_id_idx` (`user_id`,`group_id`,`grant_type`,`isbless`), KEY `fk_user_group_map_group_id_groups_id` (`group_id`), CONSTRAINT `fk_user_group_map_group_id_groups_id` FOREIGN KEY (`group_id`) REFERENCES `groups` (`id`) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_user_group_map_user_id_profiles_userid` FOREIGN KEY (`user_id`) REFERENCES `profiles` (`userid`) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec) There's over 40,000 in bug_group_map and over 400,000 in user_group_map.
Assignee | ||
Comment 8•12 years ago
|
||
I found a way to prefer the PRIMARY key by adding --chunk-index=PRIMARY, and now I'm getting: 08-28T13:44:36 Skipping chunk 3 of bugs.bugs_activity because it is oversized. The current chunk size limit is 126522 rows (chunk size=63261 * chunk size limit=2.0), but MySQL estimates that there are 133178 rows in the chunk. 08-28T13:45:54 Skipping chunk 1 of bugs.dependencies because it is oversized. The current chunk size limit is 25994 rows (chunk size=12997 * chunk size limit=2.0), but MySQL estimates that there are 28724 rows in the chunk. 08-28T13:48:05 Skipping chunk 1 of bugs.user_group_map because MySQL used only 6 bytes of the user_group_map_user_id_idx index instead of 8. See the --[no]check-plan documentation for more information. real 24m16.891s user 0m4.350s sys 0m0.530s We can fix the first 2 problems by setting --chunk-size-limit=5, the last one is a mystery.
Assignee | ||
Comment 9•12 years ago
|
||
setting --chunk-size-limit=7, here's the output: TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-04T12:36:20 0 2 9442441 117 0 94.215 bugs.bugs_activity 09-04T12:36:45 Skipping table bugs.duplicates because on the master it would be checksummed in one chunk but on these replicas it has too many rows: 188726 rows on tp-bugs01-slave02.phx.mozilla.com 196235 rows on tp-bugs01-slave03.phx.mozilla.com 162682 rows on tp-bugs01-master01.phx.mozilla.com The current chunk size limit is 114044 rows (chunk size=16292 * chunk size limit=7). 09-04T12:40:15 Skipping table bugs.profile_search because on the master it would be checksummed in one chunk but on these replicas it has too many rows: 521446 rows on tp-bugs01-master01.phx.mozilla.com The current chunk size limit is 338345 rows (chunk size=48335 * chunk size limit=7). 09-04T12:40:34 Skipping chunk 1 of bugs.user_group_map because MySQL used only 6 bytes of the user_group_map_user_id_idx index instead of 8. See the --[no]check-plan documentation for more information. 09-04T12:40:36 0 1 1 1 0 0.010 mysql.proc 09-04T12:40:36 0 1 305 1 0 0.011 mysql.tables_priv 09-04T12:40:36 0 1 23 1 0 0.011 mysql.user real 15m59.729s user 0m4.065s sys 0m0.492s Looks like I can't win for trying. I don't like the skipping. :( Other than the 3 skipped tables, there's only 1 table with differences, and that only has 2 out of 117 chunks different.
Assignee | ||
Comment 10•12 years ago
|
||
FINALLY got it working: [root@tp-bugs01-slave01 bin]# time ./checksum.sh Cannot connect to h=10.8.70.152,p=...,u=checksum TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-05T13:43:21 0 2 9447127 82 0 47.044 bugs.bugs_activity 09-05T13:47:53 0 1 1 1 0 0.009 mysql.proc 09-05T13:47:53 0 1 305 1 0 0.014 mysql.tables_priv 09-05T13:47:53 0 1 23 1 0 0.009 mysql.user real 15m23.631s user 0m2.361s sys 0m0.340s [root@tp-bugs01-slave01 bin]# more checksum.sh #!/bin/sh /usr/bin/pt-table-checksum --user checksum --password **ELIDED** --lock-wait-time=50 --quiet --chunk-size-limit=0 --no-check-plan Looks like there are 5 differences, not bad. I'll check those soon and see about fixing them.
Assignee | ||
Comment 11•12 years ago
|
||
Resolving the differences themselves will be in a different bug. The next step is running this checksum regularly, and putting a nagios check in place to let us know if there are issues. There is already a nagios check at: https://github.com/palominodb/PalominoDB-Public-Code-Repository/blob/master/nagios/table_checksums/check_table_checksums.pl But it uses an older version of pt-table-checksum, and some fields are different. I've done some updates and asked for a sanity check in https://bugzilla.mozilla.org/show_bug.cgi?id=789058. After that, we'll put it in place on the bugzilla machines.
Assignee | ||
Comment 12•12 years ago
|
||
resolved, monitoring is in place. BOOYAH.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•