Closed Bug 1131603 Opened 10 years ago Closed 10 years ago

Reduce the stage data lifecycle since disk space is extremely low

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

Ed Morley [:emorley]

Reporter

Description

•

10 years ago

we ran out of disk space on stage, since it only has a 300GB disk vs prod's 700GB, yet the data lifecycle is the same (4 months). We hadn't hit this problem until now, since stage was so new we hadn't yet been ingesting for the 4 months. Let's shorten it to something like 2 months and see how we go.

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Reporter

Comment 1

•

10 years ago

fubar, am I correct in thinking I can't just modify treeherder-service/treeherder/settingsl/local.py, since the changes will be overwritten by puppet? If so, could you append the following line to /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/treeherder/settings/local.py DATA_CYCLE_INTERVAL = timedelta(days=30*2) Thanks :-)

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Assignee

Comment 2

•

10 years ago

you are correct. added to puppet and have kicked off puppet runs on staging.

Flags: needinfo?(klibby)

Ed Morley [:emorley]

Reporter

Comment 3

•

10 years ago

Thank you :-)

Assignee: emorley → klibby

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Sheeri Cabral [:sheeri]

Comment 4

•

10 years ago

Unfortunately, we're already using too much even though we just upped the disk space on stage to 400G: /dev/sdb1 394G 338G 37G 91% /data Here are the biggest databases, which make up over 70% of the used disk space: 11G mozilla_aurora_jobs_1 19G b2g_inbound_jobs_1 22G mozilla_central_jobs_1 38G try_jobs_1 40G fx_team_jobs_1 110G mozilla_inbound_jobs_1 We are keeping binary logs for only *2* days right now, which is itself over 80G, but.....why has the growth been so explosive? 2 nights ago I defragmented all the tables just in case, but that did not reclaim enough space. Can you double check to see why we're still having disk space issues? Maybe the data purge isn't working as expected?

Status: RESOLVED → REOPENED

Flags: needinfo?(emorley)

Resolution: FIXED → ---

Sheeri Cabral [:sheeri]

Comment 5

•

10 years ago

I have reduced the binary logs to only keeping 1 day's worth of logs and we're still in a bad place: /dev/sdb1 394G 329G 46G 88% /data We cannot reduce binary logs any more without compromising our backups.

Ed Morley [:emorley]

Reporter

Comment 6

•

10 years ago

Good timing - I just opened bug 1134621 to look at why the explosive growth there - on disk usage has gone up 23% in 28 hours. Just to confirm - we're already only keeping logs for 1 day on prod - and now stage matches that too?

Blocks: 1134621

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 7

•

10 years ago

fubar, could you adjust the puppet controlled copy of /data/treeherder-stage/src/treeherder.allizom.org/treeherder-service/ treeherder/settings/local.py's current line: DATA_CYCLE_INTERVAL = timedelta(days=30*2) To be: DATA_CYCLE_INTERVAL = timedelta(days=45) (Just for stage) Thanks :-)

Kendall Libby [:fubar] (he/him)

Assignee

Comment 8

•

10 years ago

updated; stage nodes will pick it up w/in ~60. ping if you want a manual update.

Ed Morley [:emorley]

Reporter

Comment 9

•

10 years ago

Thank you :-) (auto is fine) Will move the rest of the discussion to bug 1134621.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Sheeri Cabral [:sheeri]

Comment 10

•

10 years ago

(In reply to Ed Morley [:edmorley] from comment #6) > Good timing - I just opened bug 1134621 to look at why the explosive growth > there - on disk usage has gone up 23% in 28 hours. > > Just to confirm - we're already only keeping logs for 1 day on prod - and > now stage matches that too? Correct. Ideally we keep 7-10 days of logs, so we can trace issues like "when did X change happen"? 1 day of logs is the minimum we can do, but not recommended *at all*. Now that there's more disk space on production, we can increase that number to 2 days of logs (we could do 3 but that would put us right at the edge of the paging threshold). We'd love to do that on prod and stage, but.....stage can't handle it.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Reduce the stage data lifecycle since disk space is extremely low

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: fubar)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10