Closed Bug 1564584 Opened 5 years ago Closed 5 years ago

Determine source of slow downs for treeherder-prod

Categories

(Tree Management :: Treeherder: Infrastructure, task, P3)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

Attachments

(4 files, 1 obsolete file)

When we have a high increase of Read IOPS we start having trouble returning responses from treeherder's APIs.

On IRC, I was talking that perhaps we're reaching our cap of IOPS, however, upon closer attention we're most likely below our 3,000 IOPS with the 1TB General purpose storage we have (3 IOPS per GiB).

Perhaps we can bump the storage another 500GB and we would get another 1,500 IOPS or we can look into switching into using the Provisioned IOPS storage. Pricing is found here. db.m5.2xlarge costs $1.368/hour. We might be able to reduce the instance type if we can guarantee the IOPS.

This is all assuming is IOPS that is getting on the way.

:dividehex, :ckolos, do you have suggestions on how to find out what could we causing the issues? Or how to even be alerted? Or someone that could help me investigate live?

Comment on attachment 9076934 [details] Showing Read IOPS with Write IOPS of treeherder-prod https://irccloud.mozilla.com/file/S1u2hEFR/image.png
Attachment #9076934 - Attachment filename: file_1564584.txt → file_1564584.png
Attachment #9076934 - Attachment mime type: text/plain → image/png
Attachment #9076934 - Attachment is obsolete: true
Flags: needinfo?(jwatkins)
Flags: needinfo?(ckolos)

Armen, how well is treeherder utilizing the treeherder-prod-ro read replica? I don't see very many connections to that rds instance so I wonder if read queries are being routed there at all?

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html

Also, have you looked at the cloudwatch UI? If not, you might find more graphing option on the data being produced from rds.

Flags: needinfo?(jwatkins)
Severity: normal → critical
Priority: -- → P3

No read queries are being routed there.
I filed bug 1562017 to experiment with the idea.

Is it fine to experiment with IOPS storage now while resources are allocated for that work?

I've played a lot with Cloudwatch graphs. I was wondering if you had a specific metric that might be useful that I have not yet thought of.

I've found the real CloudWatch service and I can get to 1 second intervals and I can SUM the two metrics. We can hit the 3000 IOPS easily.
We can only look 3 hours into the past.

treeherder-dev only has 100GiB more than the other instances, however, it's being able to reach higher IOPS throughput. That should only allow for another 300 IOPS. The graph shows close to 8000 IOPS.

We fixed this on bug 1567257.
We determined that we were hitting the IOPS cap.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Type: defect → task
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: