Closed Bug 716441 Opened 12 years ago Closed 12 years ago

Disk usage report problem on 10.253.0.10:/vol/ftp_stage (mpt-netapp-a)

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: dparsons)

References

Details

Attachments

(1 file)

If I create a 1G file then delete it again, the free space does not recover. Eg on surf:

$ cd /mnt/netapp/stage
$ df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
10.253.0.10:/vol/ftp_stage
                       6291457   5714983    576474  91% /mnt/netapp/stage

$ dd if=/dev/zero of=testfile bs=1K count=1M
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 20.5196 seconds, 52.3 MB/s
$ df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
10.253.0.10:/vol/ftp_stage
                       6291457   5716018    575439  91% /mnt/netapp/stage

$ rm testfile 
rm: remove regular file `testfile'? y
$ df -m .
Filesystem           1M-blocks      Used Available Use% Mounted on
10.253.0.10:/vol/ftp_stage
                       6291457   5715988    575469  91% /mnt/netapp/stage

The 30M improvement in free space is likely due to other changes happening on this partition, but I was expecting 1G's worth.

I've noticed the same problem deleting many files in bug 715840 - a steady deletion of more than 300G of files didn't result in any change in the free space reported by df. However, some hours/days later the free space did jump up by about the right value. Nagios has had quite a bit to say about CPU usage on this netapp too. Are there known issues with it right now ?
FWIW, bug 715706 recently added a bind mount from another NFS mount into that share, and bug 715026 tracks potential corruption of the / partition on surf.
Trending is at http://people.mozilla.com/~nthomas/trend-recent.png, where "Everything else" is /mnt/netapp/stage. There seems to be an upward going spike every Pacific midnight, is this from data-deduplication or is the partition too big for that ?

Also, I can't reproduce the issue on mpt-netapp-b using dd.
Dedupe starts at 12AM Pacific time and is usually finished before 1AM.
FWIW, I've experienced this type of thing in the past as well- not specific to any particular share or netapp unit, or even netapp in general.

I think what's happening is that the delete operation is "succeeding" on the client far more quickly than the operation actually happens- hence you see a slow regain of the used space. "df" has no way of knowing about this, so it reports what happens to be free right at that time.

I don't understand why this would happen with a single 1GB file as you're example illustrates, which generally delete very quickly. I think it does explain why deleting many files would take hours or days to fully regain all of the space.

We do have some known load issues on certain NetApp filers. I don't know off-hand if this is one of the affected ones or how severe it is affected, but I suspect there are some issues.

Almost all NetApp-related performance issues (at least in SJC1, where surf is) should be resolved in Q1 or early-Q2 as we are moving to a new datacenter which has newer, much more powerful NetApp units. Way more CPU horsepower, IOPS capacity, and network bandwidth.

If that's not soon enough, we'll need to dig further into this for a cause and treatment options.
Assignee: server-ops → dparsons
Attached image trend
We're on a track to filling the partition, and from here it seems that either something is wrong with the netapp, or there are things going on behind the curtain which make it very difficult for me to keep nagios happy.

Backstory, for the last couple of days nagios has been WARNING or CRITICAL on surf:disk - /mnt/netapp/stage, which is mpt-netapp-a:/vol/ftp_stage. Recently we had > 550G free on this partition after granting it more space, and moving some bits off it.

On Jan 14 the space drops suddenly,
Jan 14 12:15:01 2012    561438 
Jan 14 12:45:01 2012    591861 
Jan 14 13:15:01 2012    590922 
Jan 14 13:45:01 2012    589807 
Jan 14 14:15:01 2012    400940 
Jan 14 14:45:01 2012    587656 
Jan 14 15:15:01 2012    586672 
Jan 14 15:45:02 2012    586319 
Jan 14 16:15:01 2012    587198 
Jan 14 16:45:01 2012    588884 
Jan 14 17:15:01 2012    588500 
Jan 14 17:45:01 2012    115746 
Jan 14 18:15:01 2012    115261 
Jan 14 18:45:01 2012    125088 
Jan 14 19:15:01 2012    273832 
Jan 14 19:45:01 2012    321371 
Jan 14 20:15:01 2012    319750 
Jan 14 20:45:02 2012    318705 
Jan 14 21:15:01 2012    317690 
(Pacific times, free space in MB from df). The increase at 19:15 & 45 corresponds to a 200G snapshot getting deleted, IIRC.

Currently we're down 115G free, and don't see any improvement when deleting files until the next dedupe runs at midnight. I am looking for recent additions of big files, but I'm pretty sure something is happening on the netapp too.
From IRC on the 19th, lerxst says the space was taken up in 313GB of snapshots, which were automatically created by the netapp to allow files to be restored to earlier states. Those snapshots have now been removed, and new ones disabled. Thanks!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: