Closed
Bug 1271790
Opened 9 years ago
Closed 9 years ago
estimate processed crash storage metrics
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lonnen, Assigned: adrian)
References
Details
I'd like to be able to answer the following about our processed crash storage. Some of it we can get precisely, for hard stuff sampling is good enough:
* how much are we storing in TB?
* how many weeks of data does that represent?
* how many objects?
* what is the five number summary over the size of the objects?
Comment 1•9 years ago
|
||
Assigned this to adrian who knows so much more about the ES storage.
What does "the five number summary over the size of the objects" mean?
Don't we also still store processed crashes in PG?
Assignee: nobody → adrian
| Reporter | ||
Comment 2•9 years ago
|
||
5 number summary or Tukey's 5 numbers are a summary of a distribution
the sample minimum (smallest observation)
the lower quartile or first quartile
the median (middle value)
the upper quartile or third quartile
the sample maximum (largest observation)
We could do this for a day or a week or whatever and that would be good enough for me.
I'm interested in the whole processed crash, rather than the partially redacted stuff.
| Assignee | ||
Comment 3•9 years ago
|
||
I downloaded 10,000 processed crashes from our stage S3 onto my file system on May 12. Hopefully that's a big enough set, because it took a full night to download all that data. :)
Note that documents include sensitive data and the full json dump. Raw crash data is not included.
Here's an analysis of the size of those documents (numbers in kilobytes):
count 10000
mean 486
std 1170
min 1
25% 177
50% 250
75% 348
max 18944
Then, to find the average number of processed crashes per week, I used this query: https://crash-stats.mozilla.com/api/SuperSearch/?_results_number=0&_facets=_histogram.date&_histogram_interval.date=1w&date=%3E2016-01-04&date=%3C2016-05-16 (count of documents per week between the beginning of the year and now). Using the above average document size, I got those results:
Average number of documents per week: 2,533,611
Estimated number of documents for 26 weeks: 65,873,886
Estimated size of 1 week of data: 1,174 Gb
Estimated size of 26 weeks of data: 29 Tb
Lonnen, is that what you were expecting? Is something missing?
| Reporter | ||
Comment 4•9 years ago
|
||
Thanks. This is exactly what I was after!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•