Determine source of slow downs for treeherder-prod
Categories
(Tree Management :: Treeherder: Infrastructure, task, P3)
Tracking
(Not tracked)
People
(Reporter: armenzg, Unassigned)
References
Details
Attachments
(4 files, 1 obsolete file)
When we have a high increase of Read IOPS we start having trouble returning responses from treeherder's APIs.
On IRC, I was talking that perhaps we're reaching our cap of IOPS, however, upon closer attention we're most likely below our 3,000 IOPS with the 1TB General purpose storage we have (3 IOPS per GiB).
Perhaps we can bump the storage another 500GB and we would get another 1,500 IOPS or we can look into switching into using the Provisioned IOPS storage. Pricing is found here. db.m5.2xlarge
costs $1.368/hour. We might be able to reduce the instance type if we can guarantee the IOPS.
This is all assuming is IOPS that is getting on the way.
:dividehex, :ckolos, do you have suggestions on how to find out what could we causing the issues? Or how to even be alerted? Or someone that could help me investigate live?
Reporter | ||
Comment 1•5 years ago
|
||
Reporter | ||
Comment 2•5 years ago
|
||
Reporter | ||
Comment 3•5 years ago
|
||
Reporter | ||
Updated•5 years ago
|
Comment 4•5 years ago
|
||
Armen, how well is treeherder utilizing the treeherder-prod-ro read replica? I don't see very many connections to that rds instance so I wonder if read queries are being routed there at all?
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html
Also, have you looked at the cloudwatch UI? If not, you might find more graphing option on the data being produced from rds.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 5•5 years ago
|
||
No read queries are being routed there.
I filed bug 1562017 to experiment with the idea.
Is it fine to experiment with IOPS storage now while resources are allocated for that work?
I've played a lot with Cloudwatch graphs. I was wondering if you had a specific metric that might be useful that I have not yet thought of.
Reporter | ||
Comment 6•5 years ago
|
||
I've found the real CloudWatch service and I can get to 1 second intervals and I can SUM the two metrics. We can hit the 3000 IOPS easily.
We can only look 3 hours into the past.
Reporter | ||
Comment 7•5 years ago
|
||
treeherder-dev only has 100GiB more than the other instances, however, it's being able to reach higher IOPS throughput. That should only allow for another 300 IOPS. The graph shows close to 8000 IOPS.
Reporter | ||
Comment 8•5 years ago
|
||
We fixed this on bug 1567257.
We determined that we were hitting the IOPS cap.
Reporter | ||
Updated•5 years ago
|
Description
•