Closed Bug 1071724 Opened 5 years ago Closed 4 years ago

Store symbols in S3

Categories

(Socorro :: General, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: rhelmer)

References

Details

Over the years we've looked at a few different options to change our symbol store (which is currently a giant NFS mount) to something more manageable. I think the best thing we can do at this point is to stuff them all into S3.

Advantages:
* Probably cheaper than the NetApp we're currently using
* Don't have to worry so much about disk usage
* Already available via HTTP for symbol server
* Makes life pretty simple if we decide to move processing into EC2 some day

Disadvantages:
* HTTP requests will almost certainly be slower than NFS
* Adds more complexity to the system
* Need to migrate symbol uploaders to use the symbol upload API to get them uploading to the right place (this is not totally a disadvantage)

I think we can mitigate the extra slowness of HTTP by just using a local disk cache for each processor. We gathered some data in bug 981021, and bug 981079 comment 1 has the summary: a 99% hit rate for a symbol cache would only require ~35GB of space. Seems totally doable.

We'll almost certainly need to make some changes to the symbol upload api to make it suitable for use by all of our build machines--for one, we'll need to add some sort of metadata table to store information about what symbols were uploaded so that we can do proper cleanup (I filed this as bug 914106 while we were working on the Postgres symbol db, I think it's still valid).
I think this sounds like an excellent idea. NFS has, for the upload part, been hell. Not because it's been unreliable but it's been a mess with NFS permissions. 

I'd be delighted to rewrite the upload API part to go straight to S3 instead. We'd be able to stop using a cron job to move it from the webhead to the final location. When partners upload we'll, in synchronous order, issue the S3 upload straight from the webheads temp buffer and report done when it's made it all the way to the final destination. 

Considering how you've phrased this bug as a "maybe it would be nice" I'm not going to add a dependent bug for the webapp part yet.
I'm pretty confident that this is what we want to do, I just want to make sure to suss out all the requirements first.

I think we might need to keep an NFS mount around for at least the Adobe symbols, we're not supposed to make those public, so we will probably want to keep them on a private mount.
I like this idea. I like it very much.

couple of complications --

1. there will be some bandwidth charges if we implement this before the processors are in AWS. Given estimates of total processor inbound bandwidth (which should include netapp traffic) I think it's an acceptable cost, though, given what we pay for netapp. It will shift the cost between cost centers, which I'll start to message now.

2. Once we move processing to AWS, local disk is expensive and ephemeral. If we use spot instances for processing we don't want to wait to pull 35 GB over the network before processing starts. However, with spot instances we probably don't need the local disk cache to preheat... we can pull from s3 and have processors scale with queue volume to overcome the additional processing time per crash.
Using an estimated 50TB for symbols (how far off is that?) we're looking at ~$1500 to store it. With a high cache rate the bandwidth out for additional symbols is probably under $500. This seems acceptable.
For #1, I think caching will mitigate the bandwidth usage pretty well, although we could stand to do some testing on that certainly. Symbols will cache extremely well, so we could even layer a HTTP cache in front of S3 if needed.

For #2, I don't think it's too terrible, 35GB is what you'd cache for 99% of a full day's worth of symbols, so you could probably just let the cache fill up as the processor worked. A HTTP cache like I mentioned above might also make that simpler (although it's one more moving piece, certainly).
We're at ~4TB total after 7 years. Even if we're really lazy about things I can't imagine we'll break 10TB in the next 5 years.
Wow was I off. 4TB is easy, money wise.
I was wrong too, but I was closer:
x.x.x.x:/symbols              7.5T  6.6T 1021G  87% /mnt/netapp/breakpad

(we're also doing a terrible job of cleanup right now, but I can't find that bug at the moment)
Depends on: 981079
Depends on: 1083546
Depends on: 1084544
Depends on: 1085530
Depends on: 1085557
Depends on: 1097178
The df above masks that the filer is dedup'ing that volume, saving ~28%.  Actual use is (as of today) 10430GB.

Growth is (ballpark) 600G per month, under current cleanup conditions.
Depends on: 1097209
Depends on: 1097210
Depends on: 1097216
Blocks: 1118288
Depends on: 1119369
Depends on: 1119372
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
Blocks: 528092
Blocks: 607831
No longer depends on: 1097210
Depends on: 1124155
Blocks: 943492
No longer depends on: 943492
Depends on: 1130138
Depends on: 1131083
No longer depends on: 1131083
I wasn't sure where in the hierarchy to put this - but we may need to move symbol storage to us-east to support Heroku Postgres. Relevant details are in bug 1144179.
Depends on: 1162060
No longer blocks: 528092
No longer depends on: 1144179
More work to do in dependent bugs, but in our current state does not block AWS move.

It does block turning off symbolpush and decom'ing the netapp.
No longer blocks: 1118288
Blocks: 1170212
Depends on: 1170253
Depends on: 1155013
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.