Closed Bug 775600 Opened 12 years ago Closed 12 years ago

Delete briarpatch-graphite-stage1.private.scl3

Categories

(Infrastructure & Operations :: Virtualization, task)

x86
Linux
task
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: gcox)

References

Details

(Whiteboard: [briarpatch])

The whisper db on briarpatch-graphite-stage1 is growing without bounds. I shut off the carbon collector this morning. If this is truly just a staging instance, we should have some kind of data culling in place to make sure we don't fill up the disk.
Until the production VM is setup in PHX (which was happening sometime this week) :lonnen was using the staging instance to test with. We did not know how large the disk database was going to get and we wanted to find out what the size would be given the graphite configuration.
Whisper should be sampling the values to lower precisions. It's unclear to me what the difference is between specifying multiple retentions and including the storage-aggregation.conf file. I will investigate further tonight. Downsampling should work. If we need to cull data, we can change the retentions in storage-schemas to hold the data for a shorter period of time. The docs mention a tool called whisper-resize.py for modifying existing whisper dbs, but a search for it doesn't show much more info. I'll also look into using that to trim what we already have.
I modified the storage-schemas file to retain the data binned at the same resolution for less time: 60 second resolution for 12 hours 1 hour resolution for 10 days 1 day resolution for 3 years I also created the storage-aggregation file, correcting a bug, so carbon should aggregate values with sum instead of mean. I believe carbon will need to be restarted for these changes to take effect, but they should help with the growth problem. It looks like the service has been more or less up for 10 days, which is reassuring from a disk space perspective.
QA Contact: lsblakk → hwine
The hard drive is full again on this box and it's mostly graphite data that's the problem.
(In reply to Chris Lonnen :lonnen from comment #3) > 60 second resolution for 12 hours > 1 hour resolution for 10 days > 1 day resolution for 3 years :lonnen: can you tell which of these buckets is responsible and adjust accordingly? I don't think we need 60 second resolution at all, frankly.
Coop: could you (or someone else from releng) recommend more sensible retention rates? If we don't need 60 seconds, we can drop that entirely for now.
(In reply to Chris Lonnen :lonnen from comment #6) > Coop: could you (or someone else from releng) recommend more sensible > retention rates? If we don't need 60 seconds, we can drop that entirely for > now. I will poll the group and see what we want.
The disk has filled up again fyi [root@briarpatch-graphite-stage1.private.scl3 ~]# df -h /dev/sda3 Filesystem Size Used Avail Use% Mounted on /dev/sda3 194G 186G 0 100% /
Blocks: 786712
[12:48pm] mburns: coop: ping re: https://bugzilla.mozilla.org/show_bug.cgi?id=775600 [12:51pm] coop: mburns: pong [12:52pm] mburns: Have you gotten a chance to poll the group on retention policies for briarpatch-graphite-stage1, or what older stuff can safely be rm'd until you can? [12:55pm] coop: mburns: we've decided to just shut it off [12:56pm] mburns: so want to decommission the whole server, or just turn it down for a while? [12:56pm] armenzg_lunch is now known as armenzg. [12:56pm] coop: we don't have anyone available to take over that work, and the data is not helping us at present [12:57pm] coop: we'll turn off the collectors for now, and will most like decommission the server soon
per comment 9, we have no use for this disk. Please decommission
Assignee: nobody → server-ops-storage
Component: Release Engineering: Developer Tools → Server Operations: Storage
QA Contact: hwine → dparsons
disk is attached to a VM, so decommissioning disk implies getting rid of the VM Please pass the VM info to :jhopkins and he'll make the final call.
Flags: needinfo?(jhopkins)
Please proceed with deleting the briarpatch-graphite-stage1.private.scl3 VM.
Flags: needinfo?(jhopkins)
Assignee: server-ops-storage → gcox
Component: Server Operations: Storage → Server Operations: Virtualization
Powered off VM, in case there's screaming.
Severity: normal → minor
Summary: Disk on briarpatch-graphite-stage1 is full → Delete briarpatch-graphite-stage1.private.scl3
Removed from RHN and inventory. Pulled from DNS (change 65310) and puppet (change 65313). Deleted VM from disk.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.