Closed Bug 1131603 Opened 10 years ago Closed 10 years ago

Reduce the stage data lifecycle since disk space is extremely low

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

we ran out of disk space on stage, since it only has a 300GB disk vs prod's 700GB, yet the data lifecycle is the same (4 months). We hadn't hit this problem until now, since stage was so new we hadn't yet been ingesting for the 4 months. Let's shorten it to something like 2 months and see how we go.
Assignee: nobody → emorley
fubar, am I correct in thinking I can't just modify treeherder-service/treeherder/settingsl/local.py, since the changes will be overwritten by puppet? If so, could you append the following line to /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/treeherder/settings/local.py DATA_CYCLE_INTERVAL = timedelta(days=30*2) Thanks :-)
Flags: needinfo?(klibby)
you are correct. added to puppet and have kicked off puppet runs on staging.
Flags: needinfo?(klibby)
Thank you :-)
Assignee: emorley → klibby
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Unfortunately, we're already using too much even though we just upped the disk space on stage to 400G: /dev/sdb1 394G 338G 37G 91% /data Here are the biggest databases, which make up over 70% of the used disk space: 11G mozilla_aurora_jobs_1 19G b2g_inbound_jobs_1 22G mozilla_central_jobs_1 38G try_jobs_1 40G fx_team_jobs_1 110G mozilla_inbound_jobs_1 We are keeping binary logs for only *2* days right now, which is itself over 80G, but.....why has the growth been so explosive? 2 nights ago I defragmented all the tables just in case, but that did not reclaim enough space. Can you double check to see why we're still having disk space issues? Maybe the data purge isn't working as expected?
Status: RESOLVED → REOPENED
Flags: needinfo?(emorley)
Resolution: FIXED → ---
I have reduced the binary logs to only keeping 1 day's worth of logs and we're still in a bad place: /dev/sdb1 394G 329G 46G 88% /data We cannot reduce binary logs any more without compromising our backups.
Good timing - I just opened bug 1134621 to look at why the explosive growth there - on disk usage has gone up 23% in 28 hours. Just to confirm - we're already only keeping logs for 1 day on prod - and now stage matches that too?
Blocks: 1134621
Flags: needinfo?(emorley)
fubar, could you adjust the puppet controlled copy of /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/ treeherder/settings/local.py's current line: DATA_CYCLE_INTERVAL = timedelta(days=30*2) To be: DATA_CYCLE_INTERVAL = timedelta(days=45) (Just for stage) Thanks :-)
updated; stage nodes will pick it up w/in ~60. ping if you want a manual update.
Thank you :-) (auto is fine) Will move the rest of the discussion to bug 1134621.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
(In reply to Ed Morley [:edmorley] from comment #6) > Good timing - I just opened bug 1134621 to look at why the explosive growth > there - on disk usage has gone up 23% in 28 hours. > > Just to confirm - we're already only keeping logs for 1 day on prod - and > now stage matches that too? Correct. Ideally we keep 7-10 days of logs, so we can trace issues like "when did X change happen"? 1 day of logs is the minimum we can do, but not recommended *at all*. Now that there's more disk space on production, we can increase that number to 2 days of logs (we could do 3 but that would put us right at the edge of the paging threshold). We'd love to do that on prod and stage, but.....stage can't handle it.
You need to log in before you can comment on or make changes to this bug.