Closed Bug 1245079 Opened 9 years ago Closed 8 years ago

Heroku RDS instance has <20GB storage remaining

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

Attachments

(2 files)

I have a CloudWatch alert that has just triggered. There's 20GB out of 100GB general storage remaining on the prototype Heroku RDS instance. Current consumption is at ~3GB/week. Execute: > SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),2) size_gb, table_name FROM information_schema.tables GROUP BY table_name ORDER BY size_gb DESC LIMIT 10 + ------------ + --------------- + | size_gb | table_name | + ------------ + --------------- + | 24.78 | job_artifact | | 15.06 | job | | 11.60 | failure_line | | 8.13 | performance_datum | | 5.53 | job_log_url | | 0.16 | revision | | 0.05 | reference_data_signatures | | 0.05 | revision_map | | 0.03 | result_set | | 0.02 | failure_match | + ------------ + --------------- + 10 rows
Summary: Heroku RDS instances has <20GB storage remaining → Heroku RDS instance has <20GB storage remaining
Depends on: 1245080
This is now down to 6GB. So it turns out the table size calculation query in comment 0 relies on the table having been recently analysed. I've done an ANALYZE TABLE on all Heroku RDS instance tables, using this to generate the list of statements: SELECT CONCAT('ANALYZE TABLE ',table_schema,'.',table_name,';') FROM information_schema.tables WHERE table_schema = 'treeherder' OR table_schema LIKE '%_jobs%'; After running the ANALYZE statements in this attachement, the new breakdown is: > SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),2) size_gb, table_name FROM information_schema.tables GROUP BY table_name ORDER BY size_gb DESC LIMIT 10 + ------------ + --------------- + | size_gb | table_name | + ------------ + --------------- + | 24.10 | job_artifact | | 22.30 | failure_line | | 16.06 | job | | 13.00 | performance_datum | | 9.21 | job_log_url | | 0.19 | revision | | 0.06 | reference_data_signatures | | 0.05 | revision_map | | 0.04 | result_set | | 0.03 | failure_match | + ------------ + --------------- + 10 rows
I've modified the RDS instance from 100GB general storage -> 150GB general storage. Performance will be reduced whilst this resize takes place. We may also be able to free space by performing an OPTIMIZE on failure_line (since Cameron truncated it a while ago, at least for stage/prod - not sure if Heroku's RDS instance received the same). However with only 6-7GB left there was not enough free disk to run this OPTIMIZE. Please can we all try and keep an eye on data usage due to auto-classification (eg bug 1245080)? If I hadn't happened to come back to this, the instance would have run out of disk space.
The Cloudwatch alert just triggered again. Free space has gone down from 45GB to 18GB in the last 24 hours (presumably partly due to failed migrations or ...?). Total disk is 150GB. James, can I leave this in your hands? (I only have one full working day left between now and 4th April, and unfortunately don't have any more cycles for classification/perfherder infra issues) Ask fubar to add you to the treeherder IAM group on the devservices AWS account if you don't already have access :-)
Flags: needinfo?(james)
+---------+-----------------------+ | size_gb | table_name | +---------+-----------------------+ | 24.61 | job_artifact | | 23.63 | failure_line | | 17.40 | job | | 13.00 | performance_datum | | 9.22 | job_log_url | | 0.33 | text_log_summary | | 0.29 | text_log_summary_line | | 0.21 | celery_taskmeta | | 0.20 | revision | | 0.17 | performance_signature | +---------+-----------------------+ It doesn't look like the size of anything exploded recently.
Was an analyze performed prior? (comment 1)
After ANALYZEing all the things: | size_gb | table_name | +---------+-----------------------+ | 27.60 | failure_line | | 24.78 | job_artifact | | 17.48 | job | | 15.85 | performance_datum | | 10.22 | job_log_url | | 0.37 | text_log_summary | | 0.33 | text_log_summary_line | | 0.21 | revision | | 0.19 | celery_taskmeta | | 0.18 | performance_signature | +---------+-----------------------+ So, bigger, but not tens of Gb bigger.
It's a small win, but I guess we can `DROP TABLE celery_taskmeta` now?
You are receiving this email because your Amazon CloudWatch Alarm "treeherder-heroku-RDS-disk-space-low" in the US - N. Virginia region has entered the ALARM state, because "Threshold Crossed: 2 datapoints were less than the threshold (2.097152E10). The most recent datapoints: [2.0970786816E10, 2.09591590912E10]." at "Thursday 05 May, 2016 23:56:59 UTC". View this alarm in the AWS Management Console: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#s=Alarms&alarm=treeherder-heroku-RDS-disk-space-low Alarm Details: - Name: treeherder-heroku-RDS-disk-space-low - Description: - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 2 datapoints were less than the threshold (2.097152E10). The most recent datapoints: [2.0970786816E10, 2.09591590912E10]. - Timestamp: Thursday 05 May, 2016 23:56:59 UTC - AWS Account: 699292812394 Threshold: - The alarm is in the ALARM state when the metric is LessThanThreshold 2.097152E10 for 300 seconds. Monitored Metric: - MetricNamespace: AWS/RDS - MetricName: FreeStorageSpace - Dimensions: [DBInstanceIdentifier = treeherder-heroku] - Period: 300 seconds - Statistic: Average - Unit: not specified State Change Actions: - OK: - ALARM: [arn:aws:sns:us-east-1:699292812394:treeherder-devs] - INSUFFICIENT_DATA:
James, do you have RDS admin panel access? (In reply to Ed Morley [:emorley] from comment #3) > Ask fubar to add you to the treeherder IAM group on the devservices AWS > account if you don't already have access :-)
Ah, no. fubar, can you help?
Flags: needinfo?(james) → needinfo?(klibby)
argh, sorry for forgetting to add your account. that end of March was a bit crazy. account created and info sent.
Flags: needinfo?(klibby)
Seems like we maybe should enable table compression on the very large tables? [1] Might be good to get a DBA to confirm that's not a terrible idea. [1] http://dev.mysql.com/doc/refman/5.6/en/innodb-compression-usage.html
Flags: needinfo?(scabral)
(In reply to James Graham [:jgraham] from comment #12) > Seems like we maybe should enable table compression on the very large > tables? [1] Might be good to get a DBA to confirm that's not a terrible idea. > > [1] http://dev.mysql.com/doc/refman/5.6/en/innodb-compression-usage.html We've talked about this a bit before: http://logs.glob.uno/?c=mozilla%23treeherder&s=25+Feb+2016&e=25+Feb+2016&h=GB#c92350 I'm fine with completely truncating the performance data tables for the heroku prototype, which I assume will be going away once the migration finishes. I'm not using them for anything. IMO spending time micro-optimizing database storage on a prototype is not a good use of our time.
I agree that "have enough storage space" is a good baseline, and we shouldn't spend too much time trying to cram everything under some largely artifical limit. But it sounds like compressing the table data may be a performance win (trading less io for more CPU usage), so it could be worthwhile even once we have more storage available.
(In reply to William Lachance (:wlach) from comment #13) > IMO spending time > micro-optimizing database storage on a prototype is not a good use of our > time. I agree, and I'm not against us increasing the prototype storage (again). I'd just like to avoid yet another infra issue eating into my goals again :-)
Sadly, infra issues don't go away on AWS, they are just shifted. Compression is tricky because there is the I/O vs. CPU tradeoff, so it depends on how fast/slow each is. Try it in test, and run benchmarks, is my advice.
Flags: needinfo?(scabral)
This is now alerting again (I've just updated the alert so it goes to the new treeherder-internal list not just me). It has 10GB remaining out of 200GB. A new breakdown: > SELECT ROUND(SUM((data_length+index_length)/power(1024,3)),2) size_gb, table_name FROM information_schema.tables GROUP BY table_name ORDER BY size_gb DESC LIMIT 10 + ------------ + --------------- + | size_gb | table_name | + ------------ + --------------- + | 44.04 | failure_line | | 31.03 | job_artifact | | 24.46 | job | | 21.38 | performance_datum | | 14.27 | job_log_url | | 1.39 | text_log_summary_line | | 1.23 | text_log_summary | | 0.64 | performance_signature | | 0.28 | revision | | 0.19 | celery_taskmeta | + ------------ + --------------- + 10 rows Though the recent spike is due to the table copy for the latest migration.
The instance was down to 24GB remaining (after having dropped to 16GB during a DB migration). Have bumped provisioned storage from 250GB to 400GB (is only $0.230 per GB-month, so not worth keeping on doing this incrementally). This takes the prototype instance closer to the 500GB that stage/prod RDS have.
Fixed after the reset to prod, since it's now matching prod's storage.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: