Closed Bug 841441 Opened 11 years ago Closed 11 years ago

Increase retention time in Graphite

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Assigned: ericz)

Details

We need to figure out what a good retention time is for Graphite based on how much storage we have and projected number of metrics.
Of course the best case scenario would be forever ;) But that's probably not practical.

What are our options here? Can we have a year or two of data live or more? Can we "archive" old data and bring it back online if needed?
I'll see how many metrics we have and try to project disk usage in a few scenarios.  To give you an idea of production use, in my experience the typical scenario was to store most metrics for a week, some for longer (e.g. a month) and some for a shorter time.
(In reply to Shyam Mani [:fox2mike] from comment #1)
> What are our options here? Can we have a year or two of data live or more?
> Can we "archive" old data and bring it back online if needed?
that's not really how round robin databases work. they allocate the entire size of the file they are going to use as they are meant to write over old data when retention time completes.  rrd (ala rrdtool) is very rigid (and fast) but once you write the data, it's a bit painful to change the rollups/retention time/etc. whisper files (ala carbon/graphite) are more flexible (and slower), but i haven't done much on resizing those and what you lose when doing so.

Imo, we should shoot for a minimum of 18 months with as much granularity as we can spare. We can fudge some on the granularity if we need more space, but having a year+ lets you see seasonal changes/aberrations and makes some growth more visible.
(In reply to casey ransom [:casey] from comment #3)
> (In reply to Shyam Mani [:fox2mike] from comment #1)
> > What are our options here? Can we have a year or two of data live or more?
> > Can we "archive" old data and bring it back online if needed?
> that's not really how round robin databases work. they allocate the entire
> size of the file they are going to use as they are meant to write over old
> data when retention time completes.  rrd (ala rrdtool) is very rigid (and
> fast) but once you write the data, it's a bit painful to change the
> rollups/retention time/etc. whisper files (ala carbon/graphite) are more
> flexible (and slower), but i haven't done much on resizing those and what
> you lose when doing so.

Spot on here.  When we turn the retention up, we will see what happens to the whisper files.  I /think/ it will just pad the files with zeros.

> Imo, we should shoot for a minimum of 18 months with as much granularity as
> we can spare. We can fudge some on the granularity if we need more space,
> but having a year+ lets you see seasonal changes/aberrations and makes some
> growth more visible.
Good point about seasonal changes/aberrations.  Eric and I discussed 12 months earlier today.  That should be no problem, 18 months should be doable.  We need to get a better count of hosts in each data center and average number of metrics per hosts first.  

Eric:  I say we turn the retention up to 12 months and unleash all the collectd hosts.  From there we can measure the storage and make a better decision about a 18 month retention.
(In reply to Rick Bryce [:rbryce] from comment #4)
> Spot on here.  When we turn the retention up, we will see what happens to
> the whisper files.  I /think/ it will just pad the files with zeros.

I was morbidly curious so I looked at whisper-resize.py. It renames the old, creates the new, then back fills old into new.  That may take a while to complete but should be an online operation.  ymmv may vary in a cluster though.
This has mostly been completed with a little cleanup to do all around still.  We are fairly disk constrained at the moment so as a compromise have instated the following retention scheme:

Minutely data stored for 30 days
Hourly data stored for 2 years
We're getting a lot more metrics from collectd than expected so we're going to have to reduce retention time further until we get some bigger disks.  For now, I've chosen:

Minutely data for 20 days
Hourly data for 1 year
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
are there any intentions of going further than 1 year? hourly data from 20 days to 1 year is really harsh.
:casey yes, see 847994.  What would be a good trade off between disk space and retention time for you?  I'm hoping to retain more but I don't think this is bad either.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.