Closed Bug 1373754 Opened 8 years ago Closed 3 years ago

Investigate AWS snapshots in each region

Categories

(Infrastructure & Operations :: RelOps: General, task, P5)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: garndt, Unassigned)

References

Details

Attachments

(1 file)

Attached file abandoned_snapshots.py
We have a large number of snapshots in each of our regions. From what I understand these are snapshots of volumes for AMIs not volumes for running instances, but I could be wrong. I wrote up a script that (as hacky as it may be) should show all snapshots that are currently not associated with an active ami, which leads me to believe that perhaps an AMI is removed/deregistered but the corresponding snapshot is not removed. Based on what I think is the right storage cost, $0.049/gb/month this can add up to quite a bit of money if these snapshots are lingering around and really are not used. We need to investigate that these are indeed left over and not used, and if so, remove them. The script is attached and here is the output from running it today: us-west-1 rogue snapshots: 2341 size: 132851 GiB -------------------------------------------------- us-west-2 rogue snapshots: 2149 size: 124610 GiB -------------------------------------------------- us-east-1 rogue snapshots: 2363 size: 134546 GiB -------------------------------------------------- us-east-2 rogue snapshots: 449 size: 35260 GiB -------------------------------------------------- eu-central-1 rogue snapshots: 2271 size: 134888 GiB -------------------------------------------------- total rogue snapshots: 9573 total size: 562155 GiB
Based on the counts, I'm guessing these are associated with building windows AMIs?
I believe so. Pete took a look at them, and also based on the size seem to be for Windows. Linux uses 8gb snapshots for the AMIs, and most of these are much larger than that.
I think OCC just deletes one snapshot per AMI[1], but I think there might be more than one created per AMI (due to having multiple drives). I haven't dug deeper into it yet, but that could be a possible cause. -- [1] https://github.com/mozilla-releng/OpenCloudConfig/blob/769bc87944edaefbdc41328ab64dcd656a9a478f/ci/update-workertype.sh#L205
Another thing to check is e.g. if when AMIs are copied across regions, if the snapshots are copied too (presumably they must be) and then to check if the cleanup script also purges snapshots in those other regions too.
Found in triage. These are still peanuts compared to artifact storage and active EBS volumes, but for the sake of hygiene we should clean this up periodically. I'd like to see a monthly report that contains the manual commands to delete rogue volumes if there are no anolmalies after inspection.
Priority: -- → P5
Found in triage. The worker build process is changing, i.e. images may not even still be created this way, making it easy to cleanup at that point.
Component: Operations → Operations and Service Requests

This is still an issue.

I think it would be worthwhile for Pete, Wander, and Rob to sit down (perhaps virtually) and go through the list of existing EBS snapshots again and see what can be purged. Maybe that even leads to a heuristic that can become a cleanup script?

Here is some investigation we did earlier this year around instance cleanup: https://docs.google.com/spreadsheets/d/1IOOLikW3ms1yEs8HNzr6YetSU77U5jdy8RZQ6fFkvJU/edit#gid=0

Still lots of EBS snapshots to go through.

Assignee: nobody → relops
Component: Operations and Service Requests → RelOps: General
Product: Taskcluster → Infrastructure & Operations
QA Contact: klibby

we'll focus on migration and revisit this if needed.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: