Closed
Bug 822733
Opened 12 years ago
Closed 12 years ago
apparent data inconsistency between aus4-dev rw and ro DBs
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Unassigned)
References
Details
We started having some trouble with the aus4-dev environment a couple of weeks ago. Today I managed to dig into it more and I think I've discovered that the rw and ro databases are out of sync. Here's a query from each showing that: mysql> select md5(data) from releases where name='Firefox-mozilla-central-nightly-latest'; +----------------------------------+ | md5(data) | +----------------------------------+ | 3af58a9bbae1c935915d9e4546cf5da6 | +----------------------------------+ 1 row in set (0.01 sec) mysql> select md5(data) from releases where name='Firefox-mozilla-central-nightly-latest'; +----------------------------------+ | md5(data) | +----------------------------------+ | a82adf54e67a6ddcf9bddc541182f542 | +----------------------------------+ 1 row in set (0.00 sec) The first one is from the rw db, which has the correct data. The second is from the ro db, which seems to not have anything new since the 2012-12-06.
Comment 1•12 years ago
|
||
Replication is running just fine, no dbs or tables are ignored. I am running a checksum to see what's out of sync. This is on the dev cluster in phx, with the aus4_dev (last written today). I think I have the right DB. Can you confirm what hostname you use to connect to the aus4-dev environment?
Reporter | ||
Comment 2•12 years ago
|
||
The rw host is dev-zeus-rw.db.phx1.mozilla.com The ro host is 10.8.70.88 (which appears to be dev-zeus-ro.db.phx1.mozilla.com)
Comment 3•12 years ago
|
||
OK, so I'm looking at the proper servers. I ran a checksum just on the aus4_dev databases. This tells us *what* is different, though not *why*... mysql> select * from percona.checksums where this_Crc!=master_crc\G *************************** 1. row *************************** db: aus4_dev tbl: releases chunk: 1 chunk_time: 2.75618 chunk_index: NULL lower_boundary: NULL upper_boundary: NULL this_crc: 70ccc187 this_cnt: 1290 master_crc: fb5fc47f master_cnt: 1299 ts: 2012-12-18 12:27:09 1 row in set (0.00 sec) So there are 9 more entries on the master than on the slave.....I will see which entries are different and add them in. This won't fix the cause, but it will fix the symptom, and the missing data may give us insight into what happened.
Reporter | ||
Comment 4•12 years ago
|
||
Sounds good to me. Let me know what else I can do to help debug.
Comment 5•12 years ago
|
||
There are thousands of teeny tiny differences. For instance, the row for: 'Firefox-mozilla-central-nightly-latest' 876794c876794 < \"et\": {\"buildID\": \"20120505030510\" --- > \"zu\": {\"buildID\": \"20120505030510\" and the corresponding et/zu difference here: < \"partial\": {\"fileUrl\": \"http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/05/2012-05-05-03-05-10-mozilla-central-l10n/firefox-15.0a1.et.win32.partial.20120504122939-20120505030510.mar\" --- > \"partial\": {\"fileUrl\": \"http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/05/2012-05-05-03-05-10-mozilla-central-l10n/firefox-15.0a1.zu.win32.partial.20120504122939-20120505030510.mar\" and a different hash value in that row too. A date discrepancy in the same row: 877531c877531 < \"pa-IN\": {\"buildID\": \"20121218030803\" --- > \"pa-IN\": {\"buildID\": \"20121206030737\" 877533,877537c877533,877537 < \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-03-08-03-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.partial.20121217030850-20121218030803.mar\" < \"from\": \"Firefox-mozilla-central-nightly-20121217030850\" < \"hashValue\": \"fbfe122d38d4b7efeef45e04454dafc0b8d8e37744a626f93033802f565470eeb178f55f8f7770afe79d4bed86c2fd43b82ed5e6da4ca9788d0ff623de558df8\" < \"filesize\": \"6279575\"} < \"complete\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-03-08-03-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.complete.mar\" --- > \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-06-03-07-37-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.partial.20121205030759-20121206030737.mar\" > \"from\": \"Firefox-mozilla-central-nightly-20121205030759\" > \"hashValue\": \"fa52858666d251d0ebb7a80a809b0915cf007d09ea7ae63f957db0d05c86dc3903bae3d44d03f31f9d4de3615bcff600240b4b0c1ec5c02fe433c97144ed0708\" > \"filesize\": \"5077205\"} > \"complete\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-06-03-07-37-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.complete.mar\" ###################################################################### and another sample row with a problem, the name is 'Firefox-mozilla-aurora-nightly-20121218042014': 2368658c2360355 < \"sv-SE\": {\"buildID\": \"20121218042014\" --- > \"et\": {\"buildID\": \"20121218042014\" 2368660c2360357 < \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.sv-SE.mac.partial.20121217042016-20121218042014.mar\" --- > \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.et.mac.partial.20121217042016-20121218042014.mar\" ###################################################################### Given the previous entry I'm not convinced recent entries aren't getting in. But there is definitely a data integrity issue here. :( Can you spot-check and see if you can divine a source of truth? we can sync to that.
Comment 6•12 years ago
|
||
Note - it's possible the entire rows are OK, with just different ordering of values (say, in the first one, the et and zu values are both there but ordered differently). I didn't check the full rows like like that, as they are quite large. I did a diff of just the names, so just the unique keys that are different on the master and slave. Everything on the slave was also on the master, but there were a few rows on the master but not the slave: Firefox-birch-nightly-20121207040206 Firefox-elm-nightly-20121207040205 Firefox-ionmonkey-nightly-20121207040201 Firefox-mozilla-aurora-nightly-20121207042017 Firefox-mozilla-central-nightly-20121207030741 Firefox-mozilla-esr10-nightly-20121207034501 Firefox-mozilla-esr17-nightly-20121207034502 Firefox-ux-nightly-20121207040202 Thunderbird-comm-central-nightly-20121207030322
Reporter | ||
Comment 7•12 years ago
|
||
Echo'ing from IRC: 16:19 < bhearsum> one thing we do that might be a bit abnormal is do a ton of updates to rows around the same time (after the en-US builds finish, l10n repacks happen in parallel, and they all update the same set of rows) 16:19 < bhearsum> so if a bunch of updates close together could cause this, that could be why 16:21 < bhearsum> it's interesting to me that the only things missing entirely from the slave are from 20121207030322 though -- because that's the day after we noticed it broke 16:21 < bhearsum> i wonder if there was a network or other server event that put things into a bad state or something? I'm not sure what else to think...this seems to be happening below the app-level, so I'm not sure what else to do to debug.
Comment 8•12 years ago
|
||
I will add in the missing rows; for the longer term, we should figure out exactly what is going wrong so we know whether it's a problem or not.
Comment 9•12 years ago
|
||
[root@dev1.db.phx1 ~]# mysqldump --no-create-info --skip-opt aus4_dev releases --where "name in > ('Firefox-birch-nightly-20121207040206', > 'Firefox-elm-nightly-20121207040205', > 'Firefox-ionmonkey-nightly-20121207040201', > 'Firefox-mozilla-aurora-nightly-20121207042017', > 'Firefox-mozilla-central-nightly-20121207030741', > 'Firefox-mozilla-esr10-nightly-20121207034501', > 'Firefox-mozilla-esr17-nightly-20121207034502', > 'Firefox-ux-nightly-20121207040202', > 'Thunderbird-comm-central-nightly-20121207030322')" > importme.sql copied the file over to dev2 and imported: [root@dev2.db.phx1 ~]# mysql aus4_dev < /home/scabral/importme.sql [root@dev2.db.phx1 ~]# Are things better? the missing rows are there, but it also sounds like the updates aren't going through. :(
Reporter | ||
Comment 10•12 years ago
|
||
So, those rows seem to be in both places now. However, Firefox-mozilla-central-nightly-latest is still inconsistent between the two, which is one of the key entries :(. I'm not sure if "it also sounds like the updates aren't going through" means that you expected that or not, though.
Comment 11•12 years ago
|
||
Well, the inconsistency exists, which is why I deduced that "it also sounds like the updates aren't going through". Meaning "there's still a problem to debug and fix."
Comment 12•12 years ago
|
||
Fascinating stuff....the Firefox-mozilla-central-nightly-latest build had 3,915 JSON objects in it, and was 317K. Ben updated it by putting 9999999999 into the buildID and it didn't replicate to the slave. I deleted the row and added a dummy row, which quickly got updated with real data, and that's flowing nicely from master to slave. Both master and slave seem to be OK with sizes: mysql> show variables like '%packet%'; +--------------------------+------------+ | Variable_name | Value | +--------------------------+------------+ | max_allowed_packet | 33554432 | | slave_max_allowed_packet | 1073741824 | +--------------------------+------------+ 2 rows in set (0.00 sec) So that's 32M max_allowed_packet and 1G slave_max_allowed_packet. Also, no errors in the error log, and replication is flowing. And just so it's logged here, this is on dev2 (the slave), note that nothing is being filtered: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: dev1.db.phx1.mozilla.com Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: dev1-bin.000255 Read_Master_Log_Pos: 88682759 Relay_Log_File: dev2-relay-bin.000127 Relay_Log_Pos: 88680653 Relay_Master_Log_File: dev1-bin.000255 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 88680509 Relay_Log_Space: 88683099 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 1 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: 1 row in set (0.00 sec) Nothing filtered in show master status on dev1 either: mysql> show master status\G *************************** 1. row *************************** File: dev1-bin.000255 Position: 89477530 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec)
Comment 13•12 years ago
|
||
So, we know something happened on the 6th or 7th that we lost the rows in comment 6. We also lost updates to other rows as well, and since the updates end with: data_version=153 WHERE releases.name = 'Firefox-elm-nightly-latest' AND releases.data_version = 152 once the data_version on the slave was out of sync (in this case, it was on version 126 instead of version 152), it stays that way. So the symptom fix is to sync the release_version table, and the way to catch this in the future is checksums!
Comment 14•12 years ago
|
||
Table sync'd, and a checksum shows that it is in fact in sync. Ben says not to worry about the releases_history table.
Reporter | ||
Comment 15•12 years ago
|
||
I think we're all done here now?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 16•12 years ago
|
||
FWIW yesterday I did a checksum and it took 204 minutes, or 3.4 hours. So that's OK to do 12 hours apart.
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•