Closed Bug 774162 Opened 12 years ago Closed 12 years ago

db consistency checks for bugzilla

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cshields, Assigned: scabral)

References

Details

(Whiteboard: [2012q3])

- start doing regular consistency checks of the databases - just the checksum. Actually monitoring those checksums and fixing them probably can't all be done in Q3.
  - I recommend starting with bugzilla, since I know that checksums don't cause slave lag on these, and have a checksum user for this
  - after bugzilla, addons, then sumo
  - document this procedure
How do we measure success and/or progress on this task?
Well, there are 2 parts - doing the check itself, and checking the results. We would use an automated script to do the checking, and put a Nagios monitor in place to check the results.

So for each cluster (bugzilla, then addons, then sumo, then others) there are 2 parts....if the official goal is "get this done on all 3" then there are 6 "milestones", and progress can be measured that way.
Corey, is that the goal here all 3 this quarter?
Let's make the goal easy to attain - just bugzilla for this quarter. We may do more than that....
Summary: db consistency checks → db consistency checks for bugzilla
I am running the consistency checks on bugzilla and have made a /usr/local/bin/checksum.sh with all the info we need. If this is successful, I'll work on a cron entry and getting a once-a-day nagios check for each slave to check the output.
The good news is that the slaves have all matching data from the one-time run. The bad news is that there are some issues:

[root@tp-bugs01-master01 bin]# ./checksum.sh 
Cannot connect to h=10.8.70.152,p=...,u=checksum
08-10T08:52:52 Error checksumming table bugs.bugs: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `bug_id`, `assigned_to`, `bug_file_loc`, `bug_severity`, `bug_status`, `creation_ts`, `delta_ts`, `short_desc`, `op_sys`, `priority`, `rep_platform`, `reporter`, `version`, `resolution`, `target_milestone`, `qa_contact`, `status_whiteboard`, `votes`, `lastdiffed`, `everconfirmed`, `reporter_accessible`, `cclist_accessible`, `alias`, `estimated_time`, `remaining_time`, `product_id`, `component_id`, `deadline`, `cf_blocking_fennec`, `cf_blocking_191`, `cf_status_191`, `cf_status_192`, `cf_blocking_20`, `cf_blocking_thunderbird30`, `cf_status_thunderbird30`, `cf_blocking_192`, `cf_blocking_thunderbird31`, `cf_status_thunderbird31`, `cf_status_20`, `cf_status_seamonkey21`, `cf_blocking_seamonkey21`, `cf_blocking_thunderbird32`, `cf_status_thunderbird32`, `cf_blocking_thunderbird33`, `cf_status_thunderbird33`, `cf_blocking_fx`, `cf_tracking_firefox5`, `cf_status_firefox5`, `cf_tracking_firefox6`, `cf_colo_site`, `cf_tracking_firefox7`, `cf_status_firefox6`, `cf_status_firefox7`, `cf_crash_signature`, `cf_tracking_thunderbird6`, `cf_tracking_thunderbird7`, `cf_status_thunderbird6`, `cf_status_thunderbird7`, `cf_tracking_seamonkey22`, `cf_tracking_seamonkey23`, `cf_tracking_seamonkey24`, `cf_tracking_firefox8`, `cf_status_firefox8`, `cf_tracking_seamonkey25`, `cf_status_seamonkey22`, `cf_status_seamonkey23`, `cf_status_seamonkey24`, `cf_status_seamonkey25`, `cf_tracking_thunderbird8`, `cf_status_thunderbird8`, `cf_tracking_firefox9`, `cf_status_firefox9`, `cf_tracking_seamonkey26`, `cf_status_seamonkey26`, `cf_tracking_thunderbird9`, `cf_status_thunderbird9`, `cf_office`, `cf_due_date`, `cf_tracking_firefox10`, `cf_status_firefox10`, `cf_tracking_thunderbird10`, `cf_status_thunderbird10`, `cf_tracking_seamonkey27`, `cf_status_seamonkey27`, `cf_tracking_firefox11`, `cf_status_firefox11`, `cf_tracking_thunderbird11`, `cf_status_thunderbird11`, `cf_tracking_seamonkey28`, `cf_status_seamonkey28`, `cf_tracking_firefox12`, `cf_status_firefox12`, `cf_tracking_thunderbird12`, `cf_status_thunderbird12`, `cf_tracking_seamonkey29`, `cf_status_seamonkey29`, `cf_tracking_esr10`, `cf_status_esr10`, `cf_tracking_firefox13`, `cf_status_firefox13`, `cf_tracking_thunderbird13`, `cf_status_thunderbird13`, `cf_tracking_seamonkey210`, `cf_status_seamonkey210`, `cf_tracking_thunderbird_esr10`, `cf_status_thunderbird_esr10`, `cf_blocking_fennec10`, `cf_tracking_firefox14`, `cf_status_firefox14`, `cf_tracking_thunderbird14`, `cf_status_thunderbird14`, `cf_tracking_seamonkey211`, `cf_status_seamonkey211`, `cf_last_resolved`, `cf_blocking_kilimanjaro`, `cf_tracking_firefox15`, `cf_status_firefox15`, `cf_tracking_thunderbird15`, `cf_status_thunderbird15`, `cf_tracking_seamonkey212`, `cf_status_seamonkey212`, `cf_blocking_basecamp`, `cf_tracking_firefox16`, `cf_status_firefox16`, `cf_tracking_thunderbird16`, `cf_status_thunderbird16`, `cf_tracking_seamonkey213`, `cf_status_seamonkey213`, `cf_tracking_firefox17`, `cf_status_firefox17`, `cf_tracking_thunderbird17`, `cf_status_thunderbird17`, `cf_tracking_seamonkey214`, `cf_status_seamonkey214`, CONCAT(ISNULL(`bug_file_loc`), ISNULL(`creation_ts`), ISNULL(`qa_contact`), ISNULL(`lastdiffed`), ISNULL(`alias`), ISNULL(`deadline`), ISNULL(`cf_crash_signature`), ISNULL(`cf_due_date`), ISNULL(`cf_last_resolved`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `bugs`.`bugs` FORCE INDEX(`PRIMARY`) WHERE ((`bug_id` >= ?)) AND ((`bug_id` <= ?)) /*checksum chunk*/" with ParamValues: 0='bugs', 1='bugs', 2=25, 3='PRIMARY', 4='514824', 5='538635', 6='514824', 7='538635'] at /usr/local/percona/pt-table-checksum line 6932.


            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-10T08:52:52      1      0   514653      25       0  12.591 bugs.bugs
08-10T08:58:34 Error checksumming table bugs.user_group_map: Possible infinite loop detected!  The lower boundary for chunk 2 is <69> and the lower boundary for chunk 3 is also <69>.  This usually happens when using a non-unique single column index.  The current chunk index for table bugs.user_group_map is fk_user_group_map_group_id_groups_id which is not unique and covers 1 column.

08-10T08:58:34      1      0        0       2       2   0.091 bugs.user_group_map
08-10T08:58:35 Error checksumming table bugs.votes: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `who`, `bug_id`, `vote_count`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `bugs`.`votes` FORCE INDEX(`votes_bug_id_idx`) WHERE ((`bug_id` >= ?)) AND ((`bug_id` <= ?)) /*checksum chunk*/" with ParamValues: 0='bugs', 1='votes', 2=2, 3='votes_bug_id_idx', 4='142898', 5='719764', 6='142898', 7='719764'] at /usr/local/percona/pt-table-checksum line 6932.


08-10T08:58:35      1      0    64626       2       0   0.421 bugs.votes
08-10T08:58:35      0      1        0       1       0   0.011 mysql.proc
08-10T08:58:35      0      1      300       1       0   0.012 mysql.tables_priv
08-10T08:58:35      0      1       30       1       0   0.010 mysql.user

real    16m19.751s
user    0m3.520s
sys     0m0.506s




Tried a 2nd run, to see if there were the same issues, and for the most part, yeah.


[root@tp-bugs01-master01 bin]# ./checksum.sh 
Cannot connect to h=10.8.70.152,p=...,u=checksum
08-10T10:44:12 Error checksumming table bugs.bug_group_map: Possible infinite loop detected!  The lower boundary for chunk 4 is <26> and the lower boundary for chunk 5 is also <26>.  This usually happens when using a non-unique single column index.  The current chunk index for table bugs.bug_group_map is bug_group_map_group_id_idx which is not unique and covers 1 column.

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-10T10:44:12      1      0        0       4       4   0.015 bugs.bug_group_map
08-10T10:50:11 Error checksumming table bugs.user_group_map: Possible infinite loop detected!  The lower boundary for chunk 2 is <69> and the lower boundary for chunk 3 is also <69>.  This usually happens when using a non-unique single column index.  The current chunk index for table bugs.user_group_map is fk_user_group_map_group_id_groups_id which is not unique and covers 1 column.

08-10T10:50:11      1      0        0       2       2   0.092 bugs.user_group_map
08-10T10:50:12      0      1        0       1       0   0.010 mysql.proc
08-10T10:50:13      0      1      300       1       0   0.012 mysql.tables_priv
08-10T10:50:13      0      1       30       1       0   0.010 mysql.user

real    18m47.019s
user    0m3.645s
sys     0m0.467s
Here's the bugs.bug_group_map table:

       Table: bug_group_map
Create Table: CREATE TABLE `bug_group_map` (
  `bug_id` mediumint(9) NOT NULL,
  `group_id` mediumint(9) NOT NULL,
  UNIQUE KEY `bug_group_map_bug_id_idx` (`bug_id`,`group_id`),
  KEY `bug_group_map_group_id_idx` (`group_id`),
  CONSTRAINT `fk_bug_group_map_bug_id_bugs_bug_id` FOREIGN KEY (`bug_id`) REFERENCES `bugs` (`bug_id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `fk_bug_group_map_group_id_groups_id` FOREIGN KEY (`group_id`) REFERENCES `groups` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.03 sec)


and the bugs.user_group_map table:
       Table: user_group_map
Create Table: CREATE TABLE `user_group_map` (
  `user_id` mediumint(9) NOT NULL,
  `group_id` mediumint(9) NOT NULL,
  `isbless` tinyint(4) NOT NULL DEFAULT '0',
  `grant_type` tinyint(4) NOT NULL DEFAULT '0',
  UNIQUE KEY `user_group_map_user_id_idx` (`user_id`,`group_id`,`grant_type`,`isbless`),
  KEY `fk_user_group_map_group_id_groups_id` (`group_id`),
  CONSTRAINT `fk_user_group_map_group_id_groups_id` FOREIGN KEY (`group_id`) REFERENCES `groups` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `fk_user_group_map_user_id_profiles_userid` FOREIGN KEY (`user_id`) REFERENCES `profiles` (`userid`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

There's over 40,000 in bug_group_map and over 400,000 in user_group_map.
I found a way to prefer the PRIMARY key by adding --chunk-index=PRIMARY, and now I'm getting:

08-28T13:44:36 Skipping chunk 3 of bugs.bugs_activity because it is oversized.  The current chunk size limit is 126522 rows (chunk size=63261 * chunk size limit=2.0), but MySQL estimates that there are 133178 rows in the chunk.

08-28T13:45:54 Skipping chunk 1 of bugs.dependencies because it is oversized.  The current chunk size limit is 25994 rows (chunk size=12997 * chunk size limit=2.0), but MySQL estimates that there are 28724 rows in the chunk.

08-28T13:48:05 Skipping chunk 1 of bugs.user_group_map because MySQL used only 6 bytes of the user_group_map_user_id_idx index instead of 8.  See the --[no]check-plan documentation for more information.

real    24m16.891s
user    0m4.350s
sys     0m0.530s

We can fix the first 2 problems by setting --chunk-size-limit=5, the last one is a mystery.
setting --chunk-size-limit=7, here's the output:

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-04T12:36:20      0      2  9442441     117       0  94.215 bugs.bugs_activity
09-04T12:36:45 Skipping table bugs.duplicates because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
  188726 rows on tp-bugs01-slave02.phx.mozilla.com
  196235 rows on tp-bugs01-slave03.phx.mozilla.com
  162682 rows on tp-bugs01-master01.phx.mozilla.com
The current chunk size limit is 114044 rows (chunk size=16292 * chunk size limit=7).
09-04T12:40:15 Skipping table bugs.profile_search because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
  521446 rows on tp-bugs01-master01.phx.mozilla.com
The current chunk size limit is 338345 rows (chunk size=48335 * chunk size limit=7).
09-04T12:40:34 Skipping chunk 1 of bugs.user_group_map because MySQL used only 6 bytes of the user_group_map_user_id_idx index instead of 8.  See the --[no]check-plan documentation for more information.
09-04T12:40:36      0      1        1       1       0   0.010 mysql.proc
09-04T12:40:36      0      1      305       1       0   0.011 mysql.tables_priv
09-04T12:40:36      0      1       23       1       0   0.011 mysql.user

real    15m59.729s
user    0m4.065s
sys     0m0.492s

Looks like I can't win for trying. I don't like the skipping. :( Other than the 3 skipped tables, there's only 1 table with differences, and that only has 2 out of 117 chunks different.
FINALLY got it working:

[root@tp-bugs01-slave01 bin]# time ./checksum.sh 
Cannot connect to h=10.8.70.152,p=...,u=checksum
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-05T13:43:21      0      2  9447127      82       0  47.044 bugs.bugs_activity
09-05T13:47:53      0      1        1       1       0   0.009 mysql.proc
09-05T13:47:53      0      1      305       1       0   0.014 mysql.tables_priv
09-05T13:47:53      0      1       23       1       0   0.009 mysql.user

real    15m23.631s
user    0m2.361s
sys     0m0.340s

[root@tp-bugs01-slave01 bin]# more checksum.sh 
#!/bin/sh

/usr/bin/pt-table-checksum --user checksum --password **ELIDED**  --lock-wait-time=50 --quiet --chunk-size-limit=0  --no-check-plan

Looks like there are 5 differences, not bad. I'll check those soon and see about fixing them.
Depends on: 789058
Resolving the differences themselves will be in a different bug.

The next step is running this checksum regularly, and putting a nagios check in place to let us know if there are issues.

There is already a nagios check at:
https://github.com/palominodb/PalominoDB-Public-Code-Repository/blob/master/nagios/table_checksums/check_table_checksums.pl

But it uses an older version of pt-table-checksum, and some fields are different. I've done some updates and asked for a sanity check in https://bugzilla.mozilla.org/show_bug.cgi?id=789058. After that, we'll put it in place on the bugzilla machines.
resolved, monitoring is in place. BOOYAH.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Blocks: 790749
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.