963768 - stage NFS volume about to run out of space

Reporter

Description

•

10 years ago

The /stage and /ftp_stage are on a single aggregate (because of the move to our new filer, they had to share space).

Here they were kinda full:
/vol/ftp_stage/        17500GB    15217GB      401GB      98%  /ftp_stage
/vol/stage/            15000GB    13523GB      401GB      97%  /stage

A few minutes later:
/vol/ftp_stage/        17500GB    15220GB      361GB      98%  /ftp_stage
/vol/stage/            15000GB    13555GB      361GB      98%  /stage

At this growth rate, they probably have less than an hour before they fill up.
Due to their current position, I can't throw more space at them in the short term.

They need their growth curtailed asap.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 1

•

10 years ago

There are some very large directories in /pub/mozilla.org/b2g/tinderbox-builds/, and we started uploading more data there in the last couple of days.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 2

•

10 years ago

I'm syncing older b2g builds onto a 'cm-ixstore01' partition. Looking for other cleanup to by time because the throughput is not high enough.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 3

•

10 years ago

Don't know where the 125G drop came from in the hour after 2100 UTC. Only about 30G/hour was coming into ffxbld/mobile/b2g tinderbox-builds an hour or so later. The cleanup from comment #2 (which is b2g/tinderbox-builds -> old/ bind mount on cm-ixstore, which gcox added for me) is chomping through the b2g-inbound-*_gecko dirs, which are about 700G in size. We're hovering at about 300G free at the moment.

Greg Cox [:gcox]

Reporter

Comment 4

•

10 years ago

:nthomas and I worked on IRC for a good while.  As he noted, I added some rather hacky bind mounts (not my best puppet work), changes 81295, 81296, 81301 against
modules/productdelivery/manifests/ftp.pp
modules/productdelivery/manifests/rsync.pp
modules/productdelivery/manifests/upload.pp
modules/productdelivery/manifests/upload_cron.pp

With those bind mounts, he starting moving some data across to another volume.  I starting the weekly dedupe run (normally fires on Saturday).  Between the we stemmed the tide but were still teetering on the brink.  We got DCops to move a shelf (bug 963802) from the old filer to this filer, to give us some options.  I added that as a new aggregate to the new filer.

After nthomas / aki got bug 914111 backed out (c 54-56), we started making real headway.  The dedupe run took around 12 hours, and between (not taking on data at the old rate, dedupe making progress, and his old-data moves), we came back from 300G / ~1% free to 1800G / ~10% free.

At 0346 PST on Saturday 25Jan I started the volume move of 'stage' to the new aggregate.  Further deletes won't really help at this point due to the mechanics of volume moves.  Estimated completion time is around midday Tuesday when the volume should (if all goes right) flop over onto the new aggregate.

Comment 5

•

10 years ago

Thanks for summarizing. I would tweak it a bit - the backout of bug 914111 and really making progress with moving data around kicked in at about the same time, when I split the job into 4 running in parallel. One of those failed out, so some data still to move around, and the regular cron jobs will be running too (eg ffxbld@upload-cron). Hopefully that doesn't complicate the volume move. I'm going to work up something to analyze the disk usage to get some age distributions, which should help confirm/deny if bug 914111 was the whole problem.

As of now, 2.25TB of data was moved, and we have 1.8T free for /pub/mozilla.org/firefox, and 2.7T free on /pub/mozilla.org. Does that mean the volume move has completed ?

Greg Cox [:gcox]

Reporter

Comment 6

•

10 years ago

stage and ftp_stage share an aggregate.  ftp_stage is 17500GB, stage is 15000GB, and the containing aggr is 29802GB.  With both volumes full individually, the aggr couldn't hold both (we're oversubscribed).  When they get this way, both vols show the same amount of free space: the space remaining in the aggr.  That's where we were.  Now that there has been a lot of cleanup, things have fallen off to the point that stage's limiting factor is currently the amount of space left in the volume, rather than the aggr.  ftp_stage being bigger, it shows more space free.

The vol move is going to be going on til Mon/Tues (the estimate fluctuates some).  It's a lot of data to move across.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 7

•

10 years ago

Ah, ok. I added a cron to puppet in revision 81315, which will keep /pub/mozilla.org/b2g/tinderbox-builds size under control, at the expense of 10.22.75.117:/tinderbox_builds (still more than 3T free).

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 8

•

10 years ago

I've got a start on an age-distribution analysis but need to spend more on it to get useful conclusions out of it. In the meantime there's some work for 27.0 that needs doing, so I'll loop back to this bug in a day or two.

Severity: blocker → major

Hal Wine [:hwine] use NI!

Updated

•

10 years ago

Whiteboard: [reit-ops] [closed-trees]

Greg Cox [:gcox]

Reporter

Comment 9

•

10 years ago

The initial sync finished, which means the filer went in a keep-up/chase mode, looking for an opportunity where so little data has changed that it thinks it can (pause the source, do a final sync, cutover to the new volume) in under 3 minutes.

That pause of the source equates to a load spike on ftp (since NFS connections will pause).

Unfortunately, it didn't complete the cutover in the 3 minutes, after doing the pause.  Which ended up closing the trees (whee).  But I didn't have any tool in my chest to stop that once it started trying, except the abort-everything which would've destroyed everything we'd worked on, and the trees already had been affected by then.

Going to gripe to NetApp on the lack of options, and we'll circle back and try this again assuming bug 957502 takes down the trees tonight.

Greg Cox [:gcox]

Reporter

Comment 10

•

10 years ago

Filed case 2004811821 with NetApp over the vol move being... unkind.

Greg Cox [:gcox]

Reporter

Comment 11

•

10 years ago

vol move failed with read locks, even during the treecloser.  Had to abort and will have to set this up again for another try later.

Aki Sasaki (not active)

Updated

•

10 years ago

Blocks: 914111

Greg Cox [:gcox]

Reporter

Updated

•

10 years ago

Depends on: 965907

Greg Cox [:gcox]

Reporter

Updated

•

10 years ago

Depends on: 971684

Greg Cox [:gcox]

Reporter

Comment 12

•

10 years ago

The volumes in question have been variously broken apart and moved around to where this isn't a blocker.  We're tight on space until I get some behind-the-scenes work done, but we will have options going forward.

Thanks for keeping the wolves at bay.  I'm happy with a closeout if you are.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 13

•

10 years ago

RIP, bug.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

stage NFS volume about to run out of space

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: gcox, Assigned: nthomas)

References

Details

(Whiteboard: [reit-ops] [closed-trees])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Comment 12

Comment 13