I'm happy to dig into this. We'll need access to the RDS instance on the IT AWS account.
If you look at say: https://rpm.newrelic.com/accounts/677903/applications/7385291/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e61727469666163743a4172746966616374566965775365742e637265617465222c22225d ...then the DB does seem to be the slowest link (vs app). We're using db.m3.xlarge which has 4 vCPU and 15GiB of RAM: http://aws.amazon.com/rds/details/ Now that I have access to the RDS instance settings, I can confirm it's in us-east-1e (secondary zone is us-east-1d) - which should be fine, given Heroku US is in us-east-1: https://devcenter.heroku.com/articles/regions#data-center-locations And we're definitely using the US Heroku (rather than EU): https://dashboard.heroku.com/apps/treeherder-heroku/settings RDS CPU utilisation is under 20%, and there rarely seems to be any less than 3GB free RAM. Wonder if IOPS might be the problem? We're using general storage, which gives us 3 IOPS per 1GB storage, so 300 IOPS - and it looks like we've using 250 IOPS typically: https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=treeherder-heroku;sf=all;v=mm
Things seem much better post bug 1182201; many of the metrics (see last link in comment 8) improved by 30-60% around the 10th July (comparing weekday to weekday; the major dips are weekends). eg... Network transmit throughput: ~6.5 MB/s -> ~3.5 MB/s Network receive throughput: ~1.0 MB/s -> ~0.4 MB/s Write throughput: ~5 MB/s -> ~3 MB/s Write latency: ~30ms -> ~10ms I'm guessing we were just overloading the network/disk IO quotas. You can see the massive drop around the 10th July here: https://rpm.newrelic.com/accounts/677903/applications/7385291/datastores?tw%5Bend%5D=1436959663&tw%5Bstart%5D=1436354863#/overview/All?value=total_call_time_per_minute Comparing the DB response times for Heroku vs prod shows them to be much more similar now: Heroku: https://rpm.newrelic.com/accounts/677903/applications/7385291/datastores#/overview/MySQL?value=average_response_time Prod: https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/MySQL?value=average_response_time Removing the objectstore (bug 1140349) should reduce the DB write churn and rate of reads even more. As is, I think we can call this fixed for now :-)