Closed
Bug 1131603
Opened 10 years ago
Closed 10 years ago
Reduce the stage data lifecycle since disk space is extremely low
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: fubar)
References
Details
we ran out of disk space on stage, since it only has a 300GB disk vs prod's 700GB, yet the data lifecycle is the same (4 months).
We hadn't hit this problem until now, since stage was so new we hadn't yet been ingesting for the 4 months.
Let's shorten it to something like 2 months and see how we go.
Reporter | ||
Updated•10 years ago
|
Assignee: nobody → emorley
Reporter | ||
Comment 1•10 years ago
|
||
fubar, am I correct in thinking I can't just modify treeherder-service/treeherder/settingsl/local.py, since the changes will be overwritten by puppet?
If so, could you append the following line to /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/treeherder/settings/local.py
DATA_CYCLE_INTERVAL = timedelta(days=30*2)
Thanks :-)
Flags: needinfo?(klibby)
Assignee | ||
Comment 2•10 years ago
|
||
you are correct. added to puppet and have kicked off puppet runs on staging.
Flags: needinfo?(klibby)
Reporter | ||
Comment 3•10 years ago
|
||
Thank you :-)
Assignee: emorley → klibby
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 4•10 years ago
|
||
Unfortunately, we're already using too much even though we just upped the disk space on stage to 400G:
/dev/sdb1 394G 338G 37G 91% /data
Here are the biggest databases, which make up over 70% of the used disk space:
11G mozilla_aurora_jobs_1
19G b2g_inbound_jobs_1
22G mozilla_central_jobs_1
38G try_jobs_1
40G fx_team_jobs_1
110G mozilla_inbound_jobs_1
We are keeping binary logs for only *2* days right now, which is itself over 80G, but.....why has the growth been so explosive?
2 nights ago I defragmented all the tables just in case, but that did not reclaim enough space.
Can you double check to see why we're still having disk space issues? Maybe the data purge isn't working as expected?
Status: RESOLVED → REOPENED
Flags: needinfo?(emorley)
Resolution: FIXED → ---
Comment 5•10 years ago
|
||
I have reduced the binary logs to only keeping 1 day's worth of logs and we're still in a bad place:
/dev/sdb1 394G 329G 46G 88% /data
We cannot reduce binary logs any more without compromising our backups.
Reporter | ||
Comment 6•10 years ago
|
||
Good timing - I just opened bug 1134621 to look at why the explosive growth there - on disk usage has gone up 23% in 28 hours.
Just to confirm - we're already only keeping logs for 1 day on prod - and now stage matches that too?
Blocks: 1134621
Flags: needinfo?(emorley)
Reporter | ||
Comment 7•10 years ago
|
||
fubar, could you adjust the puppet controlled copy of /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/ treeherder/settings/local.py's current line:
DATA_CYCLE_INTERVAL = timedelta(days=30*2)
To be:
DATA_CYCLE_INTERVAL = timedelta(days=45)
(Just for stage)
Thanks :-)
Assignee | ||
Comment 8•10 years ago
|
||
updated; stage nodes will pick it up w/in ~60. ping if you want a manual update.
Reporter | ||
Comment 9•10 years ago
|
||
Thank you :-) (auto is fine)
Will move the rest of the discussion to bug 1134621.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 10•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #6)
> Good timing - I just opened bug 1134621 to look at why the explosive growth
> there - on disk usage has gone up 23% in 28 hours.
>
> Just to confirm - we're already only keeping logs for 1 day on prod - and
> now stage matches that too?
Correct. Ideally we keep 7-10 days of logs, so we can trace issues like "when did X change happen"? 1 day of logs is the minimum we can do, but not recommended *at all*.
Now that there's more disk space on production, we can increase that number to 2 days of logs (we could do 3 but that would put us right at the edge of the paging threshold). We'd love to do that on prod and stage, but.....stage can't handle it.
You need to log in
before you can comment on or make changes to this bug.
Description
•