Closed
Bug 1431262
Opened 7 years ago
Closed 6 years ago
Prune EBS snapshots belonging to unknown AMIs
Categories
(Taskcluster :: Operations and Service Requests, task, P5)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1373754
People
(Reporter: gps, Unassigned)
References
Details
Attachments
(1 file, 2 obsolete files)
6.04 KB,
text/plain
|
Details |
I was poking through the mozilla-taskcluster AWS account and noticed that us-west-2 had a ton of EBS snapshots. Terabytes from visual inspection. Many of them appeared to be associated with AMIs.
I wrote a script (attached) that iterates through all EC2 regions and finds all EBS snapshots that are associated with unknown AMIs. The output of that script is as follows (I pruned EC2 regions without orphaned snapshots):
eu-central-1:
48139 AMIs
2633 Snapshots
160588 GB Total snapshot storage
2605 AMI snapshots
160364 GB Total AMI snapshot storage
2519 Orphaned AMI snapshots
153644 GB Orphaned AMI snapshot storage
eu-west-1:
66348 AMIs
2 Snapshots
240 GB Total snapshot storage
2 AMI snapshots
240 GB Total AMI snapshot storage
0 Orphaned AMI snapshots
0 GB Orphaned AMI snapshot storage
us-east-1:
101279 AMIs
3204 Snapshots
171783 GB Total snapshot storage
3097 AMI snapshots
170703 GB Total AMI snapshot storage
2595 Orphaned AMI snapshots
154238 GB Orphaned AMI snapshot storage
us-east-2:
23102 AMIs
670 Snapshots
52730 GB Total snapshot storage
670 AMI snapshots
52730 GB Total AMI snapshot storage
636 Orphaned AMI snapshots
50000 GB Orphaned AMI snapshot storage
us-west-1:
67612 AMIs
3120 Snapshots
164321 GB Total snapshot storage
3005 AMI snapshots
163401 GB Total AMI snapshot storage
2535 Orphaned AMI snapshots
149103 GB Orphaned AMI snapshot storage
us-west-2:
75367 AMIs
3962 Snapshots
191147 GB Total snapshot storage
3957 AMI snapshots
191107 GB Total AMI snapshot storage
2646 Orphaned AMI snapshots
155718 GB Orphaned AMI snapshot storage
Annual cost of orphaned AMI snapshots: ~$174953
Assuming the logic in the script is sound, running the script with --prune will save Mozilla ~$175,000 annually.
It's worth noting that we'll need to run this cleanup periodically because some process in the wild is creating these orphaned snapshots. We should probably get this installed as a periodic task somewhere. Given the potential cost savings, it is well worth our time to do that.
Attachment #8943442 -
Flags: review?(dustin)
Comment 1•7 years ago
|
||
Comment on attachment 8943442 [details]
Script to identify and remove orphaned AMI snapshots
The script looks solid to me.
Grenade, do you think these are from something in OCC? Or, wcosta, are we just not cleaning up any of the snapshots associated with AMIs we create?
Reporter | ||
Comment 2•7 years ago
|
||
The other thing here is we have tons of old AMIs sitting around that will likely never be used. Most (all?) belong to docker and windows workers. We could prune old AMIs and snapshots to save even more.
Reporter | ||
Comment 3•7 years ago
|
||
Now with concurrent.futures for faster execution.
Also tweaked the output a bit to display totals at the bottom. It now yields:
10931 Total orphaned AMI snapshots
662703 GB Total orphaned snapshot storage
$174953 Estimated annual storage cost
662 PB of storage (~60 GB/snapshot). Good times.
Attachment #8943442 -
Attachment is obsolete: true
Reporter | ||
Comment 4•7 years ago
|
||
That last comment should obviously have been 662 TB, not PB. Still nothing to sneeze at!
Comment 5•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Comment on attachment 8943442 [details]
> Script to identify and remove orphaned AMI snapshots
>
> The script looks solid to me.
>
> Grenade, do you think these are from something in OCC? Or, wcosta, are we
> just not cleaning up any of the snapshots associated with AMIs we create?
Honestly, I thought ec2-manager was already doing this. We can use gps' script as a hook.
Flags: needinfo?(wcosta)
Comment 6•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #5)
> Honestly, I thought ec2-manager was already doing this. We can use gps'
> script as a hook.
AMI management has never been in the purview of the ec2-manager or provisioner. We had a UCOSP project in place to monitor ebs volume usage, but that was not snapshots and never landed in a completed form.
If we have a set of scripts relevant to monitoring our EC2 account, I'm happy to include them in the hourly sweep over our EC2 account that's built into the EC2-Manager, but those would be new requirements.
Comment 7•7 years ago
|
||
Greg, the EC2-Manager does hourly sweeps of the EC2 account for reconciling state drift from the eventually consistent nature of the EC2 Api. Since this is a state reconciliation, where differences are expected, there's no notification system. If you think checks like this one would be useful, I'm happy to use the EC2-Manager infrastructure to do so. We could probably hook up SNS when a threshold is hit.
There's a lot of activity in the EC2-Manager codebase right now, dealing with the new spot model, but if you'd like to check out the situation, the repository is https://github.com/taskcluster/ec2-manager. The most relevant file is lib/housekeeping.js. My only request is that the checks use the runaws wrapper, the checks are all tested with fully offline mocks, and they don't do list-the-world-in-one-call style API requests.
Let me know if there's anything that I can do to help with integration, if we want it!
Comment 8•7 years ago
|
||
occ does clean up snapshots associated with amis that it creates:
https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh#L241
i don't know how to explain the terrabytes of orphaned snapshots but i'm glad there's a plan to nuke them.
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 9•7 years ago
|
||
jhford: I view the problem as a garbage collection problem. I think we want to purge orphaned EBS snapshots attributed to unknown AMIs if they are more than N hours/days old. As long as the threshold is above the eventual consistency window for the EC2 API, it should be safe. If a snapshot belongs to an AMI that no longer exists, I can't think of a good reason to keep it around. That sounds pretty cut and dry to me.
I'll look at porting this code to work in taskcluster/ec2-manager.
There is a related problem of purging old AMIs periodically so we don't accumulate AMIs and their snapshots. I think that is worth discussing in another bug, since strictly speaking it is a separate problem. And, it's not as big a problem. From the latest run of the script:
740849 GB Total snapshot storage
10931 Total orphaned AMI snapshots
662703 GB Total orphaned snapshot storage
$174953 Estimated annual storage cost
So "only" 78,146 GB belong to non-orphan snapshots. (This includes non-AMI snapshots though.) That's ~$20k/year. A significant sum. But small in the grand scheme of things. Still worth someone's time to look into though.
Reporter | ||
Comment 10•7 years ago
|
||
I ran the script and deleted all orphaned snapshots, freeing up ~662,703 GB in the process. The script now reports:
75890 GB Total AMI snapshot storage
0 Total orphaned AMI snapshots
0 GB Total orphaned snapshot storage
$0 Estimated annual storage cost
I also changed the script to report AMI snapshot totals instead of all EBS snapshots. That's why 78,146 from comment #9 disagrees with the current 75,890. But that's still ~76 TB of AMI snapshots worth cleaning up.
Reporter | ||
Comment 11•7 years ago
|
||
I think I incorrectly measured the cost impact here. Looking at the AWS bill in detail, it appears we only pay for the actual EBS snapshot storage used, not the listed size of the snapshot. So e.g. a 120 GB snapshot may only use 4 GB of storage. I also estimated the billing rate incorrectly. It actually varies a bit by region. And the billing rate is 2-4x what I estimated. But since we use far less than the listed size, we still come out under.
How far under?
In us-west-2 in December, we were billed for 3,203 GB-months. Contrast with the 191,147 I thought we were getting billed for. So I may have over-estimated the monetary impact by ~1.5 magnitudes.
That still comes out to thousands of dollars per year. But not the win I thought it would be. Good thing I found another 6 figure bug yesterday (bug 1431291) to help save face :)
Summary: Prune EBS snapshots belonging to unknown AMIs to save Mozilla ~$175,000 → Prune EBS snapshots belonging to unknown AMIs
Reporter | ||
Comment 12•7 years ago
|
||
Here is the latest version of the script. I used this version to delete orphans.
I had to tweak it a bit to not run into API request throttling.
Attachment #8943464 -
Attachment is obsolete: true
Reporter | ||
Comment 13•7 years ago
|
||
jhford: I was going to have a go at implementing the cleanup functionality in ec2-manager. But given the reduced monetary impact and the fact I have some urgent <normal day job> things to work on, I'm going to hold off. If you have any questions or want me to review the JS code in ec2-manager, you know how to reach me.
Comment 15•7 years ago
|
||
The problem here is that any sort of automatic system here is going to be rather scary. We have the last used dates for the EC2-Manager managed instances:
https://ec2-manager.taskcluster.net/v1/internal/ami-usage
which is currently behind a very restrictive scope. I could make this a less-restricted scope so that we can make this information known. Between that and the view-worker-type endpoints, we should be able to get a list of all AMIs which are either configured or used in the provisioner-managed world. This is of course dangerous for those AMIs which are in our account but aren't managed by the provisioner.
Until we have a dedicated account for the provisioner managed instances, any sort of automated management of AMI resources is probably not something we should consider.
Updated•7 years ago
|
Priority: -- → P5
Assignee | ||
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•