Closed Bug 1564584 Opened 5 years ago Closed 5 years ago

Determine source of slow downs for treeherder-prod

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

Attachments

(4 files, 1 obsolete file)

Showing Read IOPS with Write IOPS of treeherder-prod 5 years ago Armen [:armenzg] 52 bytes, image/png		Details
Showing Read IOPS with Write IOPS of treeherder-prod 5 years ago Armen [:armenzg] 453.28 KB, image/png		Details
Correlation of Read IOPS spikes and backend API slow downs 5 years ago Armen [:armenzg] 352.53 KB, image/png		Details
treeherder-prod - SUM of Read & Write IOPS 5 years ago Armen [:armenzg] 329.12 KB, image/png		Details
treeherder-dev - SUM of Write and Read IOPS 5 years ago Armen [:armenzg] 249.82 KB, image/png		Details

Armen [:armenzg]

Reporter

Description

•

5 years ago

Attached image Showing Read IOPS with Write IOPS of treeherder-prod (obsolete) — Details

When we have a high increase of Read IOPS we start having trouble returning responses from treeherder's APIs.

On IRC, I was talking that perhaps we're reaching our cap of IOPS, however, upon closer attention we're most likely below our 3,000 IOPS with the 1TB General purpose storage we have (3 IOPS per GiB).

Perhaps we can bump the storage another 500GB and we would get another 1,500 IOPS or we can look into switching into using the Provisioned IOPS storage. Pricing is found here. db.m5.2xlarge costs $1.368/hour. We might be able to reduce the instance type if we can guarantee the IOPS.

This is all assuming is IOPS that is getting on the way.

:dividehex, :ckolos, do you have suggestions on how to find out what could we causing the issues? Or how to even be alerted? Or someone that could help me investigate live?

Armen [:armenzg]

Reporter

Comment 1

•

5 years ago

Comment on attachment 9076934 [details] Showing Read IOPS with Write IOPS of treeherder-prod https://irccloud.mozilla.com/file/S1u2hEFR/image.png

Attachment #9076934 - Attachment filename: file_1564584.txt → file_1564584.png

Attachment #9076934 - Attachment mime type: text/plain → image/png

Armen [:armenzg]

Reporter

Comment 2

•

5 years ago

Attached image Showing Read IOPS with Write IOPS of treeherder-prod — Details

Attachment #9076934 - Attachment is obsolete: true

Armen [:armenzg]

Reporter

Comment 3

•

5 years ago

Attached image Correlation of Read IOPS spikes and backend API slow downs — Details

Armen [:armenzg]

Reporter

Updated

•

5 years ago

Flags: needinfo?(jwatkins)

Flags: needinfo?(ckolos)

Jake Watkins [:dividehex]

Comment 4

•

5 years ago

Armen, how well is treeherder utilizing the treeherder-prod-ro read replica? I don't see very many connections to that rds instance so I wonder if read queries are being routed there at all?

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html

Also, have you looked at the cloudwatch UI? If not, you might find more graphing option on the data being produced from rds.

Flags: needinfo?(jwatkins)

Armen [:armenzg]

Reporter

Updated

•

5 years ago

Severity: normal → critical

Priority: -- → P3

Armen [:armenzg]

Reporter

Comment 5

•

5 years ago

No read queries are being routed there.
I filed bug 1562017 to experiment with the idea.

Is it fine to experiment with IOPS storage now while resources are allocated for that work?

I've played a lot with Cloudwatch graphs. I was wondering if you had a specific metric that might be useful that I have not yet thought of.

Armen [:armenzg]

Reporter

Comment 6

•

5 years ago

Attached image treeherder-prod - SUM of Read & Write IOPS — Details

I've found the real CloudWatch service and I can get to 1 second intervals and I can SUM the two metrics. We can hit the 3000 IOPS easily.
We can only look 3 hours into the past.

Armen [:armenzg]

Reporter

Comment 7

•

5 years ago

Attached image treeherder-dev - SUM of Write and Read IOPS — Details

treeherder-dev only has 100GiB more than the other instances, however, it's being able to reach higher IOPS throughput. That should only allow for another 300 IOPS. The graph shows close to 8000 IOPS.

Chris Kolosiwsky [:ckolos] (ckolos has left the building)

Updated

•

5 years ago

Flags: needinfo?(ckolos)

Armen [:armenzg]

Reporter

Comment 8

•

5 years ago

We fixed this on bug 1567257.
We determined that we were hitting the IOPS cap.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Reporter

Updated

•

5 years ago

Type: defect → task

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Determine source of slow downs for treeherder-prod

Categories

(Tree Management :: Treeherder: Infrastructure, task, P3)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated

Attachment

General

Description

File Name

Content Type