Try sending as many SELECT queries to the replica database as possible
Categories
(Tree Management :: Treeherder, enhancement)
Tracking
(Not tracked)
People
(Reporter: wezhou, Unassigned)
Details
While investigating the 100% database CPU usage issue today, we find most of the SELECT queries are sent to the primary database including some time consuming slow query (see https://bugzilla.mozilla.org/show_bug.cgi?id=1814312). During the incident, the primary database reached 100% CPU while the replica only uses 2% CPU.
We should send as many READ queries to the replicas so that the WRITE queries can return quickly and improve performance. Another benefit of sending READ queries to replicas is that replicas can be horizontally scaled and help improve the overall throughput of the system.
I just found out the current existing replica is meant for use by redash only and not for the application itself (as hinted in https://github.com/mozilla-services/cloudops-infra/pull/3120).
If we were go down this route (i.e. sending READ queries to read replicas), we probably need to set up new replicas. However, it feels like this ticket can wait till bug #1814312 is resolved and if that solves the problem, this one becomes less urgent (considering we have doubled the number of vCPUs of the primary today).
Description
•