Open Bug 1562017 Opened 5 years ago Updated 3 years ago

Test out idea to use database replica for get requests

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: armenzg, Unassigned)

Details

The concerns about this idea was a lag between insertions and reads (the UI needing to read something it wrote not long after).
According to ckolos we're talking about a lag of miliseconds.

This has the potential to isolate UI slowdowns due to MySql inserts slowing down MySql select operations (see bug 1553199 for details).

Some docs:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html

Some info from #django channel

r1cky [10:33 AM]
@Armen (armenzg) SUMO used to (probably still does) have the DB set up as one master (RW) and multiple slaves (RO). So all the GETs were handled by the slaves


Armen (armenzg) [10:36 AM]
@r1cky have you been involved with the work to switch to that model? or you know someone who has?
I'm interested to know what problems I could encounter when switching over

r1cky [10:37 AM]
I was but that was years ago... looks like they are using https://github.com/jbalogh/django-multidb-router now
jbalogh/django-multidb-router
Round-robin master/slave db router for Django.
Website
http://jbalogh.me/projects/multidb-router
Stars
263
jbalogh/django-multidb-router | Feb 9th, 2010 | Added by GitHub
https://github.com/mozilla/kitsune/blob/master/kitsune/settings.py#L72-L80
kitsune/settings.py:72-80
```if DATABASES['default']['ENGINE'] == 'django.db.backends.mysql':
    DATABASES['default']['CONN_MAX_AGE'] = DB_CONN_MAX_AGE
    DATABASES['default']['OPTIONS'] = {'init_command': 'SET storage_engine=InnoDB'}
    DATABASE_ROUTERS = ('multidb.PinningMasterSlaveRouter',)```

 Show more
mozilla/kitsune | Added by GitHub
The Pinning router is so that a client that just did a WRITE gets "pinned" to the master database for the next few READs/GETs. otherwise they will see stale data if there is a delay in master/slave replication (edited) 

pmac [10:40 AM]
I think we were using that when SUMO was in the data center, but not in AWS, but no reason it wouldn't work there

mat [11:18 AM]
AMO is using that as well ; similar strategy, primary db for writes and reads coming from a request that did some writes recently + some replicas for rest of the reads
unless something is very wrong the replica delay is always < 1 second. only potential issue is when you write something, then trigger a celery task, you need to have the task use the primary db through a decorator or something, even if it's just reading, to ensure the data is there
(on a similar note, not related to replicas but you need to make sure any transaction is committed before triggering a task if you're wrapping views in a transaction - we do that with https://github.com/mozilla/django-post-request-task/)

I think this is worth trying. I do think we'll want to try what mat suggested in the dialog you posted - to use the "primary db for writes and reads coming from a request that did some writes". A potential area of concern is with Perfherder - because of how the backend logic is structured for the alert data, an updated alert summary is requested if an alert status has changed and any delay in that would affect the UI. So I'd want to test the change thoroughly to see how it affects that part of the app.

Priority: -- → P3
Blocks: 1553199

This might actually be really helpful in alleviating some of the UI slowdown. Perhaps it should be a higher priority?

I was just trying to get it out of the queue of "untriaged" bugs.

It's important. We should try it soon.

Priority: P3 → P1

I discussed this problem with a DB expert and it was clear that we would just be moving the problem to another place.

The replications from the master DB + the reads from the slave DB would be the same amount of IOPS.

In any case, the database IOPS capping is solved.

No longer blocks: 1067846, 1504990
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
No longer blocks: 1553199
Assignee: nobody → cdawson

De-prioritizing Treeherder work in new reorg. So removing myself.

Assignee: cdawson → nobody
You need to log in before you can comment on or make changes to this bug.