Closed Bug 1303150 Opened 5 years ago Closed 4 years ago
Make TC AWS costs less eye-watering
## Total Cost Looking just at the TC accounts, the services costing us >1% of our spend are EC2 - Compute S3 - Storage CloudFront - Transfer EBS - Storage EC2 - Transfer S3 - Transfer so anything beyond that is unlikely to move the needle. The first two items are the big ones, and if we want to make a big dent in this cost, those are the places to start. ## EC2 - Compute ($96k in August) About $6k of that is ondemand instances. I see only four running: two for docker cloud, one for elastic container service (I emailed about that a few months ago..), and one that belongs to pete. My devel host is in there too. There are a bunch of stopped windows systems as well -- looks like just the base machines in use1 and usw1, but a bunch more in usw2. So it's hard to see why that's $6k. Breaking it down by instance type shows a patchwork, but mostly c4.*, which seems to be what docker cloud is using. So I'm guessing that's cloud-mirror. The remainder is spot. I've attached a mapping from instance type to workerType. The top offenders are m3.xlarge - $28k - desktop-test-xlarge / gecko-decison c3.xlarge - $15k - desktop-test-xlarge / gecko-decision c3.2xlarge - $20k - windows build / android m1.medium - $10k - desktop-test c4.4xlarge - $7k - linux build m3.2xlarge - $6k - flame-kk / android I've included a graph of those instance types by week. ## S3 - Storage ($47k) I've attached a shot of our S3 usage by week. The elephant in the room is "taskcluster-public-artifacts" at about $45k/mo. "taskcluster-private-artifacts" and "taskcluster-artifacts" are distant second and third at around $1k/mo and the rest is noise. The worrying bit is the obvious trend in S3 costs -- $2k/wk to $12k/wk in the last year. Could we get rid of the "taskcluster-artifacts" bucket? Our S3 transfer rates look to be almost entirely regional, which I assume is the "free" category :) ## CloudFront - Transfer ($9k) Per Travis, these are not our responsibility.
Considered remediations, with monthly savings S3 (max of $47k) - delete unused taskcluster-artifacts bucket (bug 1303147) ($1k) - delete old try artifacts (bug 1303153) (??) - configure more branches (e.g., integration) with shorter retention (??) EC2 (max of $96k) - audit hidden jobs: maybe kill whole swathes of permaorange? (small - ?? $4k) - run tier-2 jobs on fewer branches (maybe just central and not integration?) (temporary savings only) - look at the cost/performance tradeoff of the various desktop-test-* instances (?? $20k)
Some other things that could (and should eventually) be cleaned are listed below. They do not amount to much though. Unused AMIs - > 1300 snapshots costing around $300/month Unused EBS volumes - 61 created before August 30th that are "available" but not attached ($300/month)
Another idea for S3 - change to use the infrequent access policy.
travis is looking into moving our cloudfront cost ($8k/month) to another cost center because it should have been billed differently.
This will not solve an immediate problem, but I opened up bug 1303214 to lower the default artifact expiration for job running on Try in Buildbot.
Also, once we determine the total size used by try jobs (from buildbot and taskcluster), I will send an email to dev-platform suggesting that we remove try artifacts that are older than 14 days.
In bug 1303153 I estimated that we can save about $25k in S3 storage without much effort, with diminishing returns after that. So the total savings available so far is about $35k. We already have a stated retention policy of 14 days for try jobs, so I don't think we need to re-request that permission. However, we do need to ask about integration branches. I think the place to look for further cost savings is in EC2, and in particular at the utilization of the test instances. If we can get another $20k savings there, then that just leaves $30k for releng to shave off and we are at the $85k combined goal.
I put a bunch of useful data on all tasks, durations, branches, workerType, tiers, create date, and S3 storage at https://s3.amazonaws.com/taskcluster-bug1303153/tasks.csv please have a look and analyze the heck out of it.
This has tracked a lot of work, and we're no longer terribly concerned with cutting costs.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.