Figure out why the Heroku DB performance is worse than production

RESOLVED FIXED

Status

Tree Management
Treeherder
P3
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

2 years ago
I'm happy to dig into this.
We'll need access to the RDS instance on the IT AWS account.
(Assignee)

Updated

2 years ago
Depends on: 1179860
(Assignee)

Comment 1

2 years ago
If you look at say:
https://rpm.newrelic.com/accounts/677903/applications/7385291/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e61727469666163743a4172746966616374566965775365742e637265617465222c22225d

...then the DB does seem to be the slowest link (vs app).

We're using db.m3.xlarge which has 4 vCPU and 15GiB of RAM:
http://aws.amazon.com/rds/details/

Now that I have access to the RDS instance settings, I can confirm it's in us-east-1e (secondary zone is us-east-1d) - which should be fine, given Heroku US is in us-east-1:
https://devcenter.heroku.com/articles/regions#data-center-locations

And we're definitely using the US Heroku (rather than EU):
https://dashboard.heroku.com/apps/treeherder-heroku/settings

RDS CPU utilisation is under 20%, and there rarely seems to be any less than 3GB free RAM.

Wonder if IOPS might be the problem? We're using general storage, which gives us 3 IOPS per 1GB storage, so 300 IOPS - and it looks like we've using 250 IOPS typically:
https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=treeherder-heroku;sf=all;v=mm
(Assignee)

Comment 2

2 years ago
Things seem much better post bug 1182201; many of the metrics (see last link in comment 8) improved by 30-60% around the 10th July (comparing weekday to weekday; the major dips are weekends).

eg...

Network transmit throughput: ~6.5 MB/s -> ~3.5 MB/s
Network receive throughput: ~1.0 MB/s -> ~0.4 MB/s
Write throughput: ~5 MB/s -> ~3 MB/s
Write latency: ~30ms -> ~10ms

I'm guessing we were just overloading the network/disk IO quotas.

You can see the massive drop around the 10th July here:
https://rpm.newrelic.com/accounts/677903/applications/7385291/datastores?tw%5Bend%5D=1436959663&tw%5Bstart%5D=1436354863#/overview/All?value=total_call_time_per_minute

Comparing the DB response times for Heroku vs prod shows them to be much more similar now:

Heroku: https://rpm.newrelic.com/accounts/677903/applications/7385291/datastores#/overview/MySQL?value=average_response_time
Prod: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/MySQL?value=average_response_time

Removing the objectstore (bug 1140349) should reduce the DB write churn and rate of reads even more.

As is, I think we can call this fixed for now :-)
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Depends on: 1182201, 1140349
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.