Closed Bug 822733 Opened 12 years ago Closed 12 years ago

apparent data inconsistency between aus4-dev rw and ro DBs

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

We started having some trouble with the aus4-dev environment a couple of weeks ago. Today I managed to dig into it more and I think I've discovered that the rw and ro databases are out of sync. Here's a query from each showing that:
mysql> select md5(data) from releases where name='Firefox-mozilla-central-nightly-latest';
+----------------------------------+
| md5(data)                        |
+----------------------------------+
| 3af58a9bbae1c935915d9e4546cf5da6 |
+----------------------------------+
1 row in set (0.01 sec)


mysql> select md5(data) from releases where name='Firefox-mozilla-central-nightly-latest';
+----------------------------------+
| md5(data)                        |
+----------------------------------+
| a82adf54e67a6ddcf9bddc541182f542 |
+----------------------------------+
1 row in set (0.00 sec)


The first one is from the rw db, which has the correct data. The second is from the ro db, which seems to not have anything new since the 2012-12-06.
Replication is running just fine, no dbs or tables are ignored. I am running a checksum to see what's out of sync.

This is on the dev cluster in phx, with the aus4_dev (last written today). I think I have the right DB. Can you confirm what hostname you use to connect to the aus4-dev environment?
The rw host is dev-zeus-rw.db.phx1.mozilla.com
The ro host is 10.8.70.88 (which appears to be dev-zeus-ro.db.phx1.mozilla.com)
OK, so I'm looking at the proper servers.

I ran a checksum just on the aus4_dev databases. This tells us *what* is different, though not *why*...

mysql> select * from percona.checksums where this_Crc!=master_crc\G
*************************** 1. row ***************************
            db: aus4_dev
           tbl: releases
         chunk: 1
    chunk_time: 2.75618
   chunk_index: NULL
lower_boundary: NULL
upper_boundary: NULL
      this_crc: 70ccc187
      this_cnt: 1290
    master_crc: fb5fc47f
    master_cnt: 1299
            ts: 2012-12-18 12:27:09
1 row in set (0.00 sec)

So there are 9 more entries on the master than on the slave.....I will see which entries are different and add them in. This won't fix the cause, but it will fix the symptom, and the missing data may give us insight into what happened.
Sounds good to me. Let me know what else I can do to help debug.
There are thousands of teeny tiny differences.

For instance, the row for:
'Firefox-mozilla-central-nightly-latest'

876794c876794
<  \"et\": {\"buildID\": \"20120505030510\"
---
>  \"zu\": {\"buildID\": \"20120505030510\"

and the corresponding et/zu difference here:

<  \"partial\": {\"fileUrl\": \"http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/05/2012-05-05-03-05-10-mozilla-central-l10n/firefox-15.0a1.et.win32.partial.20120504122939-20120505030510.mar\"
---
>  \"partial\": {\"fileUrl\": \"http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012/05/2012-05-05-03-05-10-mozilla-central-l10n/firefox-15.0a1.zu.win32.partial.20120504122939-20120505030510.mar\"

and a different hash value in that row too.

A date discrepancy in the same row:
877531c877531
<  \"pa-IN\": {\"buildID\": \"20121218030803\"
---
>  \"pa-IN\": {\"buildID\": \"20121206030737\"
877533,877537c877533,877537
<  \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-03-08-03-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.partial.20121217030850-20121218030803.mar\"
<  \"from\": \"Firefox-mozilla-central-nightly-20121217030850\"
<  \"hashValue\": \"fbfe122d38d4b7efeef45e04454dafc0b8d8e37744a626f93033802f565470eeb178f55f8f7770afe79d4bed86c2fd43b82ed5e6da4ca9788d0ff623de558df8\"
<  \"filesize\": \"6279575\"}
<  \"complete\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-03-08-03-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.complete.mar\"
---
>  \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-06-03-07-37-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.partial.20121205030759-20121206030737.mar\"
>  \"from\": \"Firefox-mozilla-central-nightly-20121205030759\"
>  \"hashValue\": \"fa52858666d251d0ebb7a80a809b0915cf007d09ea7ae63f957db0d05c86dc3903bae3d44d03f31f9d4de3615bcff600240b4b0c1ec5c02fe433c97144ed0708\"
>  \"filesize\": \"5077205\"}
>  \"complete\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-06-03-07-37-mozilla-central-l10n/firefox-20.0a1.pa-IN.linux-x86_64.complete.mar\"


######################################################################
and another sample row with a problem, the name is 'Firefox-mozilla-aurora-nightly-20121218042014':


2368658c2360355
<  \"sv-SE\": {\"buildID\": \"20121218042014\"
---
>  \"et\": {\"buildID\": \"20121218042014\"
2368660c2360357
<  \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.sv-SE.mac.partial.20121217042016-20121218042014.mar\"
---
>  \"partial\": {\"fileUrl\": \"http://ftp-scl3.mozilla.com/pub/mozilla.org/firefox/nightly/2012/12/2012-12-18-04-20-14-mozilla-aurora-l10n/firefox-19.0a2.et.mac.partial.20121217042016-20121218042014.mar\"

######################################################################
Given the previous entry I'm not convinced recent entries aren't getting in. But there is definitely a data integrity issue here. :( Can you spot-check and see if you can divine a source of truth? we can sync to that.
Note - it's possible the entire rows are OK, with just different ordering of values (say, in the first one, the et and zu values are both there but ordered differently). I didn't check the full rows like like that, as they are quite large.

I did a diff of just the names, so just the unique keys that are different on the master and slave. Everything on the slave was also on the master, but there were a few rows on the master but not the slave:

Firefox-birch-nightly-20121207040206
Firefox-elm-nightly-20121207040205
Firefox-ionmonkey-nightly-20121207040201
Firefox-mozilla-aurora-nightly-20121207042017
Firefox-mozilla-central-nightly-20121207030741
Firefox-mozilla-esr10-nightly-20121207034501
Firefox-mozilla-esr17-nightly-20121207034502
Firefox-ux-nightly-20121207040202
Thunderbird-comm-central-nightly-20121207030322
Echo'ing from IRC:
16:19 < bhearsum> one thing we do that might be a bit abnormal is do a ton of updates to 
                  rows around the same time (after the en-US builds finish, l10n repacks 
                  happen in parallel, and they all update the same set of rows)
16:19 < bhearsum> so if a bunch of updates close together could cause this, that could be 
                  why
16:21 < bhearsum> it's interesting to me that the only things missing entirely from the 
                  slave are from 20121207030322 though -- because that's the day after we 
                  noticed it broke
16:21 < bhearsum> i wonder if there was a network or other server event that put things 
                  into a bad state or something?


I'm not sure what else to think...this seems to be happening below the app-level, so I'm not sure what else to do to debug.
I will add in the missing rows; for the longer term, we should figure out exactly what is going wrong so we know whether it's a problem or not.
[root@dev1.db.phx1 ~]# mysqldump --no-create-info --skip-opt aus4_dev releases --where "name in 
> ('Firefox-birch-nightly-20121207040206',
> 'Firefox-elm-nightly-20121207040205',
> 'Firefox-ionmonkey-nightly-20121207040201',
> 'Firefox-mozilla-aurora-nightly-20121207042017',
> 'Firefox-mozilla-central-nightly-20121207030741',
> 'Firefox-mozilla-esr10-nightly-20121207034501',
> 'Firefox-mozilla-esr17-nightly-20121207034502',
> 'Firefox-ux-nightly-20121207040202',
> 'Thunderbird-comm-central-nightly-20121207030322')"  > importme.sql

copied the file over to dev2 and imported:
[root@dev2.db.phx1 ~]# mysql aus4_dev < /home/scabral/importme.sql 
[root@dev2.db.phx1 ~]# 

Are things better? the missing rows are there, but it also sounds like the updates aren't going through. :(
So, those rows seem to be in both places now. However, Firefox-mozilla-central-nightly-latest is still inconsistent between the two, which is one of the key entries :(. I'm not sure if "it also sounds like the updates aren't going through" means that you expected that or not, though.
Well, the inconsistency exists, which is why I deduced that "it also sounds like the updates aren't going through". Meaning "there's still a problem to debug and fix."
Fascinating stuff....the Firefox-mozilla-central-nightly-latest build had 3,915 JSON objects in it, and was 317K. Ben updated it by putting 9999999999 into the buildID and it didn't replicate to the slave.

I deleted the row and added a dummy row, which quickly got updated with real data, and that's flowing nicely from master to slave.

Both master and slave seem to be OK with sizes:

mysql>  show variables like '%packet%';
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| max_allowed_packet       | 33554432   |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+
2 rows in set (0.00 sec)

So that's 32M max_allowed_packet and 1G slave_max_allowed_packet. Also, no errors in the error log, and replication is flowing.

And just so it's logged here, this is on dev2 (the slave), note that nothing is being filtered:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: dev1.db.phx1.mozilla.com
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: dev1-bin.000255
          Read_Master_Log_Pos: 88682759
               Relay_Log_File: dev2-relay-bin.000127
                Relay_Log_Pos: 88680653
        Relay_Master_Log_File: dev1-bin.000255
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 88680509
              Relay_Log_Space: 88683099
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 1
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
1 row in set (0.00 sec)

Nothing filtered in show master status on dev1 either:

mysql> show master status\G
*************************** 1. row ***************************
            File: dev1-bin.000255
        Position: 89477530
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)
So, we know something happened on the 6th or 7th that we lost the rows in comment 6. We also lost updates to other rows as well, and since the updates end with:

 data_version=153 WHERE releases.name = 'Firefox-elm-nightly-latest'
 AND releases.data_version = 152

once the data_version on the slave was out of sync (in this case, it was on version 126 instead of version 152), it stays that way.

So the symptom fix is to sync the release_version table, and the way to catch this in the future is checksums!
Table sync'd, and a checksum shows that it is in fact in sync. 

Ben says not to worry about the releases_history table.
I think we're all done here now?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
FWIW yesterday I did a checksum and it took 204 minutes, or 3.4 hours. So that's OK to do 12 hours apart.
Blocks: 823096
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.