Closed
Bug 1330738
Opened 8 years ago
Closed 8 years ago
log parser queue backlogs on stage and prod
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Attachments
(1 file)
Over the last few days I've had multiple CloudAMQP queue alerts for the various stage log parser queues.
Bumping the number of workers didn't appear to help (bug 1292720 comment 6).
Looking at the AWS console it seems that stage RDS is under much higher load than prod, even taking into account stage being an m4.xlarge vs prod's m4.2xlarge (we'd intentionally kept stage smaller since it doesn't have much API load from users).
The cause is likely a combination of:
1) our data ingestion causing most of the DB load (and more load than previously, post ORM migration), therefore stage RDS has almost the load of prod even though it has no one hitting the API (so the m4.xlarge vs m4.2xlarge reasoning doesn't hold as well)
2) stage RDS having 500GB provisioned vs prod's 750GB, so has a baseline performance of 1500 IOPS rather than 2250 IOPS (see http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD)
3) Some query no longer fitting in RAM on stage RDS, since it has half the RAM of prod's m4.2xlarge (16GB vs 32GB)
Bug 1330728 will look into #1.
In this bug I'll also try raising the stage storage from 500GB to 750GB to get the increased baseline IOPS performance. This is something that would have happened at some point in the future anyway, when we next reset stage to prod (since restoring from snapshots means inheriting the snapshots allocated storage size).
Failing that we can increase stage to an m4.2xlarge to match prod, however:
* it means we're less likely to catch DB perf issues on stage, since it will then have more headroom than prod (given no user API load)
* bug 1315329 converted dev+stage+prod to reserved instances, which may end up being wasted (hopefully another project or the read only replica can use them instead)
Assignee | ||
Comment 1•8 years ago
|
||
Attachment #8826410 -
Flags: review?(klibby)
Assignee | ||
Comment 2•8 years ago
|
||
I've had 10+ alerts from stage today, and a handful from prod too.
Assignee | ||
Updated•8 years ago
|
Summary: log parser queue backlogs on stage → log parser queue backlogs on stage and prod
Assignee | ||
Comment 3•8 years ago
|
||
Comment on attachment 8826410 [details] [review]
devservices-aws PR #31: Increase stage RDS storage to 750GB
https://github.com/mozilla-platform-ops/devservices-aws/commit/ca99123da94fafb4ec222354d66c5021cdf8c077
Attachment #8826410 -
Flags: checkin+
Comment 4•8 years ago
|
||
sekrit (master)$ terraform apply
data.terraform_remote_state.base: Refreshing state...
...
aws_db_instance.treeherder-stage-rds: Modifying...
allocated_storage: "500" => "750"
...
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
Assignee | ||
Updated•8 years ago
|
Attachment #8826410 -
Flags: review?(klibby) → review+
Assignee | ||
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•