Closed Bug 1129515 Opened 10 years ago Closed 10 years ago

estimate elastic search metrics and stats

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: lonnen, Unassigned)

Details

ES is going to be expensive to run in the cloud. I'd like to better understand what we're storing and how much space it's taking up, roughly, so we can make decisions about costs per feature. * how much are we storing in TB? * how many weeks of data does that represent? * how much of that, percentage or actual, is occupied by raw crashes, processed crashes? * what do people look at in raw crashes that is not in processed crashes? * are there cheap optimizations we're not using (data compression, de-dupe, etc)?
Flags: needinfo?(adrian)
I'm also curious -- what is our best estimate, in $$/GB or $$/TB to run ES ourselves in the cloud, in AWS costs alone?
Flags: needinfo?(dmaher)
(In reply to Chris Lonnen :lonnen from comment #1) > I'm also curious -- what is our best estimate, in $$/GB or $$/TB to run ES > ourselves in the cloud, in AWS costs alone? See this gist: https://gist.github.com/phrawzty/8b4c6bf611a6939177e6 The tl;dr: * On-demand: $9328.68 per month ($111944.16 per year) * 1 yr upfront reserved: $5769.63 per month ($69235.56 per year) I really must stress that these are _highly_ ballpark figures - notably because we just don't know how these particular instances will handle our use case. I'm sure there's a way to get a better fix on which instances (and how many thereof) would be perfect, but that would require much deeper knowledge about AWS (and our ES usage profile) than I currently possess.
Flags: needinfo?(dmaher)
(In reply to Chris Lonnen :lonnen from comment #1) > I'm also curious -- what is our best estimate, in $$/GB or $$/TB to run ES > ourselves in the cloud, in AWS costs alone? Addendum: If we were to use "i2.2xlarge" instances for the Data nodes (which would more closely mimic our existing infra), the cost would jump up to: * On-demand: $12115.35 per month ($145384.20 per year) * 1 yr upfront reserved: $61917.00 per year
In an Elasticsearch 1.4 instance, one node cluster, no replication, with our current production mapping: Data | 100 cashes | average crash | 1 week estimate | 26 weeks estimate =============+============+===============+=================+================== all | 15932 K | 159 K | 364 G | 9.2 T no json_dump | 6828 K | 68 K | 156 G | 4.0 T no raw_crash | 15296 K | 152 K | 350 G | 8.8 T I estimate that removing duplicates would be about the same as removing the raw_crash, so the gain is really low. As expected, the dump is the biggest part of a crash report, but it also has a lot of value. I hope this will help make a decision!
Flags: needinfo?(adrian)
Adding more numbers (computed on the last 4 full weeks of data in Elasticsearch): Crashes | Number | % ================+=========+==== Total | 9778805 | 100 Firefox | 6658083 | 68 FF Nightly | 144097 | 1 FF Aurora | 119345 | 1 FF Beta | 2132704 | 22 FF Release | 3849771 | 39 FF Others | 412166 | 4 Other products | 3120722 | 32 What those numbers mean is: if we remove all Firefox Beta crashes from an index, we gain 22% of disk space. Removing all crashes for products other than Firefox means 32% of disk space gained. We can also see from this that removing Nightly and Aurora crashes has almost no impact (about 2% of disk space gain). So here's a summary of the options we have: 1. Reducing the number of weeks. Linear savings: removing one week means freeing ~365GB of disk space. 2. Removing the json dump. Estimated savings: 55% of disk space. 3. Removing products other than Firefox. Estimated savings: 32% of disk space. 4. Removing channels other than Release for FF. Estimated savings: 28% of disk space. 5. Removing raw crash (4% of savings), removing duplicates (less than 4% of savings) have too small impacts. Options 2 to 4 can be done on a subset of indices. For example, we could remove the dump from our documents after 4 weeks. That would be a bit computation-heavy and requires some code, but it's doable. Here are example use cases: 1. Keep 4 weeks of full data, keep 12 weeks of data without the json dump (total 16 weeks) Full week = 365GB Small week = 365GB - 55% = 165GB Total space = 4 * 365 + 12 * 165 = 3.4TB 2. Keep 2 weeks of full data, keep 10 weeks of data without the json dump (total 12 weeks) Full week = 365GB Small week = 365GB - 55% = 165GB Total space = 2 * 365 + 10 * 165 = 2.3TB 3. Keep 3 weeks of full data, keep 9 weeks of data without the json dump, without non-release Firefox crashes, without non-Firefox crashes (total 12 weeks) Full week = 365GB Small week = ( 365GB - (32% + 28%) ) - 55% = 66GB Total space = 3 * 365 + 9 * 66 = 1.7TB Note that those numbers are rough estimates based on quite a small dataset, so it is prone to errors, and that they represent disk space consumption. We should in any case always have more disk space available than our estimates. For reference, the current disk space usage in our production cluster is approximately 12TB (slightly less but going up). The estimated disk space needed with our current data retention policy and with no replicate would be approximately 10TB (slightly less but we need more just in case).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.