Closed Bug 1303150 Opened 5 years ago Closed 4 years ago

Make TC AWS costs less eye-watering

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

Attachments

(3 files)

Attached file workerTypes.txt
## Total Cost

Looking just at the TC accounts, the services costing us >1% of our spend are

EC2 - Compute  
S3 - Storage       
CloudFront - Transfer  
EBS - Storage  
EC2 - Transfer 
S3 - Transfer

so anything beyond that is unlikely to move the needle.  The first two items are the big ones, and if we want to make a big dent in this cost, those are the places to start.

## EC2 - Compute ($96k in August)

About $6k of that is ondemand instances.  I see only four running: two for docker cloud, one for elastic container service (I emailed about that a few months ago..), and one that belongs to pete.  My devel host is in there too.  There are a bunch of stopped windows systems as well -- looks like just the base machines in use1 and usw1, but a bunch more in usw2.  So it's hard to see why that's $6k.  Breaking it down by instance type shows a patchwork, but mostly c4.*, which seems to be what docker cloud is using.  So I'm guessing that's cloud-mirror.

The remainder is spot.  I've attached a mapping from instance type to workerType.  The top offenders are

m3.xlarge - $28k - desktop-test-xlarge / gecko-decison
c3.xlarge - $15k - desktop-test-xlarge / gecko-decision
c3.2xlarge - $20k - windows build / android
m1.medium - $10k - desktop-test
c4.4xlarge - $7k - linux build
m3.2xlarge - $6k - flame-kk / android

I've included a graph of those instance types by week.

## S3 - Storage ($47k)

I've attached a shot of our S3 usage by week.  The elephant in the room is "taskcluster-public-artifacts" at about $45k/mo.  "taskcluster-private-artifacts" and "taskcluster-artifacts" are distant second and third at around $1k/mo and the rest is noise.

The worrying bit is the obvious trend in S3 costs -- $2k/wk to $12k/wk in the last year.

Could we get rid of the "taskcluster-artifacts" bucket?

Our S3 transfer rates look to be almost entirely regional, which I assume is the "free" category :)

## CloudFront - Transfer ($9k)

Per Travis, these are not our responsibility.
Attached image s3-overall.png
Depends on: 1303153
Considered remediations, with monthly savings

 S3 (max of $47k)
  - delete unused taskcluster-artifacts bucket (bug 1303147) ($1k)
  - delete old try artifacts (bug 1303153) (??)
  - configure more branches (e.g., integration) with shorter retention (??)

 EC2 (max of $96k)
  - audit hidden jobs: maybe kill whole swathes of permaorange? (small - ?? $4k)
  - run tier-2 jobs on fewer branches (maybe just central and not integration?) (temporary savings only)
  - look at the cost/performance tradeoff of the various desktop-test-* instances (?? $20k)
Some other things that could (and should eventually) be cleaned are listed below.  They do not amount to much though.

Unused AMIs
  - > 1300 snapshots costing around $300/month

Unused EBS volumes
  - 61 created before August 30th that are "available" but not attached ($300/month)
Another idea for S3 - change to use the infrequent access policy.
travis is looking into moving our cloudfront cost ($8k/month) to another cost center because it should have been billed differently.
This will not solve an immediate problem, but I opened up bug 1303214 to lower the default artifact expiration for job running on Try in Buildbot.
Also, once we determine the total size used by try jobs (from buildbot and taskcluster), I will send an email to dev-platform suggesting that we remove try artifacts that are older than 14 days.
Depends on: 1303319
In bug 1303153 I estimated that we can save about $25k in S3 storage without much effort, with diminishing returns after that.  So the total savings available so far is about $35k.

We already have a stated retention policy of 14 days for try jobs, so I don't think we need to re-request that permission.  However, we do need to ask about integration branches.

I think the place to look for further cost savings is in EC2, and in particular at the utilization of the test instances.  If we can get another $20k savings there, then that just leaves $30k for releng to shave off and we are at the $85k combined goal.
I put a bunch of useful data on all tasks, durations, branches, workerType, tiers, create date, and S3 storage at
  https://s3.amazonaws.com/taskcluster-bug1303153/tasks.csv
please have a look and analyze the heck out of it.
Depends on: 1303839
Depends on: 1303841
Depends on: 1304177
Depends on: 1304180
Depends on: 1304181
Depends on: 1304182
Depends on: 1304723
Duplicate of this bug: 1268670
Depends on: 1268668
Assignee: garndt → nobody
This has tracked a lot of work, and we're no longer terribly concerned with cutting costs.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.