Closed Bug 1431262 Opened 7 years ago Closed 6 years ago

Prune EBS snapshots belonging to unknown AMIs

Categories

(Taskcluster :: Operations and Service Requests, task, P5)

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1373754

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(1 file, 2 obsolete files)

I was poking through the mozilla-taskcluster AWS account and noticed that us-west-2 had a ton of EBS snapshots. Terabytes from visual inspection. Many of them appeared to be associated with AMIs. I wrote a script (attached) that iterates through all EC2 regions and finds all EBS snapshots that are associated with unknown AMIs. The output of that script is as follows (I pruned EC2 regions without orphaned snapshots): eu-central-1: 48139 AMIs 2633 Snapshots 160588 GB Total snapshot storage 2605 AMI snapshots 160364 GB Total AMI snapshot storage 2519 Orphaned AMI snapshots 153644 GB Orphaned AMI snapshot storage eu-west-1: 66348 AMIs 2 Snapshots 240 GB Total snapshot storage 2 AMI snapshots 240 GB Total AMI snapshot storage 0 Orphaned AMI snapshots 0 GB Orphaned AMI snapshot storage us-east-1: 101279 AMIs 3204 Snapshots 171783 GB Total snapshot storage 3097 AMI snapshots 170703 GB Total AMI snapshot storage 2595 Orphaned AMI snapshots 154238 GB Orphaned AMI snapshot storage us-east-2: 23102 AMIs 670 Snapshots 52730 GB Total snapshot storage 670 AMI snapshots 52730 GB Total AMI snapshot storage 636 Orphaned AMI snapshots 50000 GB Orphaned AMI snapshot storage us-west-1: 67612 AMIs 3120 Snapshots 164321 GB Total snapshot storage 3005 AMI snapshots 163401 GB Total AMI snapshot storage 2535 Orphaned AMI snapshots 149103 GB Orphaned AMI snapshot storage us-west-2: 75367 AMIs 3962 Snapshots 191147 GB Total snapshot storage 3957 AMI snapshots 191107 GB Total AMI snapshot storage 2646 Orphaned AMI snapshots 155718 GB Orphaned AMI snapshot storage Annual cost of orphaned AMI snapshots: ~$174953 Assuming the logic in the script is sound, running the script with --prune will save Mozilla ~$175,000 annually. It's worth noting that we'll need to run this cleanup periodically because some process in the wild is creating these orphaned snapshots. We should probably get this installed as a periodic task somewhere. Given the potential cost savings, it is well worth our time to do that.
Attachment #8943442 - Flags: review?(dustin)
Comment on attachment 8943442 [details] Script to identify and remove orphaned AMI snapshots The script looks solid to me. Grenade, do you think these are from something in OCC? Or, wcosta, are we just not cleaning up any of the snapshots associated with AMIs we create?
Flags: needinfo?(wcosta)
Flags: needinfo?(rthijssen)
Attachment #8943442 - Flags: review?(dustin)
The other thing here is we have tons of old AMIs sitting around that will likely never be used. Most (all?) belong to docker and windows workers. We could prune old AMIs and snapshots to save even more.
Attached file prune-ami-snapshots.py (obsolete) —
Now with concurrent.futures for faster execution. Also tweaked the output a bit to display totals at the bottom. It now yields: 10931 Total orphaned AMI snapshots 662703 GB Total orphaned snapshot storage $174953 Estimated annual storage cost 662 PB of storage (~60 GB/snapshot). Good times.
Attachment #8943442 - Attachment is obsolete: true
That last comment should obviously have been 662 TB, not PB. Still nothing to sneeze at!
(In reply to Dustin J. Mitchell [:dustin] from comment #1) > Comment on attachment 8943442 [details] > Script to identify and remove orphaned AMI snapshots > > The script looks solid to me. > > Grenade, do you think these are from something in OCC? Or, wcosta, are we > just not cleaning up any of the snapshots associated with AMIs we create? Honestly, I thought ec2-manager was already doing this. We can use gps' script as a hook.
Flags: needinfo?(wcosta)
(In reply to Wander Lairson Costa [:wcosta] from comment #5) > Honestly, I thought ec2-manager was already doing this. We can use gps' > script as a hook. AMI management has never been in the purview of the ec2-manager or provisioner. We had a UCOSP project in place to monitor ebs volume usage, but that was not snapshots and never landed in a completed form. If we have a set of scripts relevant to monitoring our EC2 account, I'm happy to include them in the hourly sweep over our EC2 account that's built into the EC2-Manager, but those would be new requirements.
Greg, the EC2-Manager does hourly sweeps of the EC2 account for reconciling state drift from the eventually consistent nature of the EC2 Api. Since this is a state reconciliation, where differences are expected, there's no notification system. If you think checks like this one would be useful, I'm happy to use the EC2-Manager infrastructure to do so. We could probably hook up SNS when a threshold is hit. There's a lot of activity in the EC2-Manager codebase right now, dealing with the new spot model, but if you'd like to check out the situation, the repository is https://github.com/taskcluster/ec2-manager. The most relevant file is lib/housekeeping.js. My only request is that the checks use the runaws wrapper, the checks are all tested with fully offline mocks, and they don't do list-the-world-in-one-call style API requests. Let me know if there's anything that I can do to help with integration, if we want it!
occ does clean up snapshots associated with amis that it creates: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh#L241 i don't know how to explain the terrabytes of orphaned snapshots but i'm glad there's a plan to nuke them.
Flags: needinfo?(rthijssen)
jhford: I view the problem as a garbage collection problem. I think we want to purge orphaned EBS snapshots attributed to unknown AMIs if they are more than N hours/days old. As long as the threshold is above the eventual consistency window for the EC2 API, it should be safe. If a snapshot belongs to an AMI that no longer exists, I can't think of a good reason to keep it around. That sounds pretty cut and dry to me. I'll look at porting this code to work in taskcluster/ec2-manager. There is a related problem of purging old AMIs periodically so we don't accumulate AMIs and their snapshots. I think that is worth discussing in another bug, since strictly speaking it is a separate problem. And, it's not as big a problem. From the latest run of the script: 740849 GB Total snapshot storage 10931 Total orphaned AMI snapshots 662703 GB Total orphaned snapshot storage $174953 Estimated annual storage cost So "only" 78,146 GB belong to non-orphan snapshots. (This includes non-AMI snapshots though.) That's ~$20k/year. A significant sum. But small in the grand scheme of things. Still worth someone's time to look into though.
I ran the script and deleted all orphaned snapshots, freeing up ~662,703 GB in the process. The script now reports: 75890 GB Total AMI snapshot storage 0 Total orphaned AMI snapshots 0 GB Total orphaned snapshot storage $0 Estimated annual storage cost I also changed the script to report AMI snapshot totals instead of all EBS snapshots. That's why 78,146 from comment #9 disagrees with the current 75,890. But that's still ~76 TB of AMI snapshots worth cleaning up.
I think I incorrectly measured the cost impact here. Looking at the AWS bill in detail, it appears we only pay for the actual EBS snapshot storage used, not the listed size of the snapshot. So e.g. a 120 GB snapshot may only use 4 GB of storage. I also estimated the billing rate incorrectly. It actually varies a bit by region. And the billing rate is 2-4x what I estimated. But since we use far less than the listed size, we still come out under. How far under? In us-west-2 in December, we were billed for 3,203 GB-months. Contrast with the 191,147 I thought we were getting billed for. So I may have over-estimated the monetary impact by ~1.5 magnitudes. That still comes out to thousands of dollars per year. But not the win I thought it would be. Good thing I found another 6 figure bug yesterday (bug 1431291) to help save face :)
Summary: Prune EBS snapshots belonging to unknown AMIs to save Mozilla ~$175,000 → Prune EBS snapshots belonging to unknown AMIs
Attached file prune-ami-snapshots.py
Here is the latest version of the script. I used this version to delete orphans. I had to tweak it a bit to not run into API request throttling.
Attachment #8943464 - Attachment is obsolete: true
jhford: I was going to have a go at implementing the cleanup functionality in ec2-manager. But given the reduced monetary impact and the fact I have some urgent <normal day job> things to work on, I'm going to hold off. If you have any questions or want me to review the JS code in ec2-manager, you know how to reach me.
The problem here is that any sort of automatic system here is going to be rather scary. We have the last used dates for the EC2-Manager managed instances: https://ec2-manager.taskcluster.net/v1/internal/ami-usage which is currently behind a very restrictive scope. I could make this a less-restricted scope so that we can make this information known. Between that and the view-worker-type endpoints, we should be able to get a list of all AMIs which are either configured or used in the provisioner-managed world. This is of course dangerous for those AMIs which are in our account but aren't managed by the provisioner. Until we have a dedicated account for the provisioner managed instances, any sort of automated management of AMI resources is probably not something we should consider.
Priority: -- → P5
Component: Operations → Operations and Service Requests
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: