Closed Bug 1227868 Opened 9 years ago Closed 9 years ago

production bedrock database replication is broken

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgmize, Unassigned)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2256] )

Attachments

(2 files)

Steps to replicate:

ssh bedrockadm
source /data/bedrock/www/www.mozilla.org-django/venv/bin/activate
cd /data/bedrock/src/www.mozilla.org-django/bedrock
./manage.py shell_plus

>>> settings.DATABASES.keys()
['default', 'readonly']
>>> settings.DATABASES['default']['HOST']
'db-prod-rw'
>>> settings.DATABASES['readonly']['HOST']
'db-prod-ro'
>>> ro = ProductDetailsFile.objects.using('readonly')
>>> ro_versions = ro.get(name='firefox_versions.json')
>>> rw = ProductDetailsFile.objects.using('default')
>>> rw_versions = rw.get(name='firefox_versions.json')
>>> import json
>>> json.loads(ro_versions.content)['LATEST_FIREFOX_DEVEL_VERSION']
u'43.0b5'
>>> json.loads(rw_versions.content)['LATEST_FIREFOX_DEVEL_VERSION']
u'43.0b6'


Because bedrock uses the 'readonly' db for all select queries by default, content on www.mozilla.org that is served from the DB in the SCL3 datacenter is out of date, so for example the beta download button on http://www.mozilla.org/firefox/channel has a download button pointing to version 43.0b5 instead of 43.0b6 when the request is directed to the SCL3 datacenter. Bedrock deployments in AWS use independent databases that pull from product-details and other data sources directly, and so are unaffected by this issue.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2256]
See Also: → 1227755
So, replication is not broken from bedrock1 in scl3 to bedrock2 in scl3. bedrock in phx1 was decommissioned, so any connections to them would fail (not get wrong data).

Latest checksum data:

mysql> select max(ts),count(*) from percona.checksums;
+---------------------+----------+
| max(ts)             | count(*) |
+---------------------+----------+
| 2015-11-23 18:00:31 |       54 |
+---------------------+----------+
1 row in set (0.00 sec)

mysql> select * from percona.checksums where this_crc!=master_crc;
Empty set (0.00 sec)

But that's from 2 days ago. I will re-run checksums from bedrock in scl3 (and make sure they're automated).

If there are discrepancies, I will have Pythian take a look.
Hrm, checksums are scheduled to run twice daily automatically. And I re-ran checksums, and replication is indeed not working, even though SHOW SLAVE STATUS reports that it's fine. This may be a GTID issue. I will look into having Pythian sync the slave to the master.
Loadbalancer updated, all traffic is going to the master, so end-user should see no difference now
We will work on re-syncing the slave in the background
While Pythian's doing the technical work, I've been looking at how we could have been notified of this issue. We have a heartbeat running, but we're apparently not monitoring it:

mysql> select max(ts) from heartbeat_scl3;
+----------------------------+
| max(ts)                    |
+----------------------------+
| 2015-11-23T22:18:32.000750 |
+----------------------------+
1 row in set (0.00 sec)

So that's something we can monitor. We do monitor checksums but that doesn't page; we should make it page and fix any stragglers that show up.
Confirmed that I now get:

>>> json.loads(ro_versions.content)['LATEST_FIREFOX_DEVEL_VERSION']
u'43.0b6'
>>> json.loads(rw_versions.content)['LATEST_FIREFOX_DEVEL_VERSION']
u'43.0b6'
Found the following misconfiguration on the master (bedrock1):

<pre>
mysql> select @@hostname, now(), @@sql_log_bin;
+------------------------------+---------------------+---------------+
| @@hostname                   | now()               | @@sql_log_bin |
+------------------------------+---------------------+---------------+
| bedrock1.db.scl3.mozilla.com | 2015-11-25 21:13:32 |             0 |
+------------------------------+---------------------+---------------+
1 row in set (0.00 sec)
</pre>


So, after confirm that there wasn't any entry on the master's binlog, I proceeded to enable the sql_log_bin to 1 and wrote the /etc/my.cnf to persist the changes.

Once enabled, I took backup from the master and synchronized bedrock2 and backup4. Tested the replication through the pt-table-checksum on the master and verified checksums table.
(In reply to Pythian Team73 from comment #6)
> Found the following misconfiguration on the master (bedrock1):
> 
> <pre>
> mysql> select @@hostname, now(), @@sql_log_bin;
> +------------------------------+---------------------+---------------+
> | @@hostname                   | now()               | @@sql_log_bin |
> +------------------------------+---------------------+---------------+
> | bedrock1.db.scl3.mozilla.com | 2015-11-25 21:13:32 |             0 |
> +------------------------------+---------------------+---------------+
> 1 row in set (0.00 sec)
> </pre>
> 
> 
> So, after confirm that there wasn't any entry on the master's binlog, I
> proceeded to enable the sql_log_bin to 1 and wrote the /etc/my.cnf to
> persist the changes.
> 
> Once enabled, I took backup from the master and synchronized bedrock2 and
> backup4. Tested the replication through the pt-table-checksum on the master
> and verified checksums table.

DBC: Emanuel Calvo
Confirmed working and back in the load balancer today.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Might this be happening again?

https://www.mozilla.org/en-US/thunderbird/38.4.0/releasenotes/ was updated per bug 1228895 comment 2 - 3.  But what's being served still has an incorrect link for "Various security fixes". It was updated to https://www.mozilla.org/en-US/security/known-vulnerabilities/thunderbird/#thunderbird38.4
reopening, pythian should look into this.
Status: RESOLVED → REOPENED
Flags: needinfo?(team73)
Resolution: FIXED → ---
I just took a look at
https://www.mozilla.org/en-US/thunderbird/38.4.0/releasenotes/
and while it says 'Check out "What’s New" and "Known Issues" for this version of Thunderbird below' there are no sections labeled "What's New" or "Known Issues", and no text that links to
https://www.mozilla.org/en-US/security/known-vulnerabilities/thunderbird/#thunderbird38.4
Agreed, it has gotten worse.
I see the same thing as Steve.
See Also: → 1228895
AFAICT, Firefox release notes is not currently affected, only Thunderbird
Hello
I routed read traffic to the master, so now you should not see the issue, and we are investigating the data differences on the other server

DBC: Janos Ruszo
Flags: needinfo?(team73)
Strange, that I still see the issue, even when data is pulled from RW server, which should mean, that the data this data is not present in the DB at all, or its not displayed from application side
Wayne, can you give us some hint where should we look for the data? (which table, or a query) so we can check if the data is there on DB side
Thank you
Flags: needinfo?(vseerror)
I don't know anything about the database, but rkent  made all the changes was pretty clear that what he saw in staging was correct [1]. perhaps his multiple attempts to "force" the updates to be public broke something?   fallen any thoughts?

FWIW, last week we had Bug 1228083, which was related to this replication. 


[1] email exchange with Kent about relnotes

90 minutes have passed, and still no update. I wonder if bug 1227868 is still an issue?

On 12/1/2015 4:13 PM, R Kent James wrote:
> After several hours, the staging copied was updated with the correct link, but not the public copy. I tried disabling and re-enabling public access to see if that kicks it into submission.
>
> On 12/1/2015 1:18 PM, R Kent James wrote:
>> I've now updated the security link using the reference below. Thanks for the link!
>>
>> On 11/27/2015 5:45 PM, Al Billings wrote:
>>> I'd open a bug but I'm not sure who to assign it to at this point.
>>>
>>> The Thunderbird release notes are pointing to Firefox ESR advisories for
>>> Thunderbird "security fixes." The release notes should point to the
>>> Thunderbird advisories at
>>> https://www.mozilla.org/en-US/security/known-vulnerabilities/thunderbird/ .
>>>
>>> These generally go up after the release since I'm not informed when
>>> releases are live and need to update them after the fact. It would be
Flags: needinfo?(vseerror) → needinfo?(philipp)
Thank you Wayne!
I sent needinfo for rkent, maybe he can shed some light on this

Janos Ruszo
kohei might also have insight. 
ref bug 1228895 comment 4

(I would not expect kent to know anything about the tables)
Pythian, 

note after/during the attempts that rkent made, I was checking frequently to check that they were corrected.  The release notes did change a couple times. For example, once in IE I found security fixes line item is FIRST.  But they are CORRECT at https://www-dev.allizom.org/en-US/thunderbird/38.4.0/releasenotes/ (and security fixes line item is SECOND)

Then about half a day later we now see it as attachment 8694969 [details]
Just a nit, and perhaps this is deliberate, but the font used by
https://www.allizom.org/en-US/thunderbird/38.4.0/releasenotes/
and the font used by
https://www.mozilla.org/en-US/thunderbird/38.4.0/releasenotes/
are slightly different.
(Open them in adjacent tabs and then switch back and forth.)
At 12:54 PDT kent "did a minor change (added Windows 10) and saved".  Meaming "Windows 10" was added to the systems requirement page, and was immediately reflected in all 3 locations
https://www-dev.allizom.org/en-US/thunderbird/38.4.0/system-requirements/
https://www.allizom.org/en-US/thunderbird/38.4.0/system-requirements/
https://www.mozilla.org/en-US/thunderbird/38.4.0/system-requirements/

However, the release notes pages are still screwed up - both of these lack any content
https://www.allizom.org/en-US/thunderbird/38.4.0/releasenotes/
https://www.mozilla.org/en-US/thunderbird/38.4.0/releasenotes/

We are not knowledgeable about these systems, and defer to you as to where the problem is and how to resolve.
Flags: needinfo?(team73)
Flags: needinfo?(scabral)
Flags: needinfo?(philipp)
The current thunderbird release notes issue is unrelated to the issue this bug was originally opened for, and no further action is needed from :scabral or Pythian. The original issue was resolved, and I am working on the new issue now and will file a separate bug with more details as to the cause and resolution as soon as I have them, and will "see also" this bug to keep everyone in the loop.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(team73)
Flags: needinfo?(scabral)
Resolution: --- → FIXED
See Also: → 1230451
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: