Closed Bug 716441 Opened 13 years ago Closed 13 years ago

Disk usage report problem on 10.253.0.10:/vol/ftp_stage (mpt-netapp-a)

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: dparsons)

References

Details

Attachments

(1 file)

If I create a 1G file then delete it again, the free space does not recover. Eg on surf: $ cd /mnt/netapp/stage $ df -m . Filesystem 1M-blocks Used Available Use% Mounted on 10.253.0.10:/vol/ftp_stage 6291457 5714983 576474 91% /mnt/netapp/stage $ dd if=/dev/zero of=testfile bs=1K count=1M 1048576+0 records in 1048576+0 records out 1073741824 bytes (1.1 GB) copied, 20.5196 seconds, 52.3 MB/s $ df -m . Filesystem 1M-blocks Used Available Use% Mounted on 10.253.0.10:/vol/ftp_stage 6291457 5716018 575439 91% /mnt/netapp/stage $ rm testfile rm: remove regular file `testfile'? y $ df -m . Filesystem 1M-blocks Used Available Use% Mounted on 10.253.0.10:/vol/ftp_stage 6291457 5715988 575469 91% /mnt/netapp/stage The 30M improvement in free space is likely due to other changes happening on this partition, but I was expecting 1G's worth. I've noticed the same problem deleting many files in bug 715840 - a steady deletion of more than 300G of files didn't result in any change in the free space reported by df. However, some hours/days later the free space did jump up by about the right value. Nagios has had quite a bit to say about CPU usage on this netapp too. Are there known issues with it right now ?
FWIW, bug 715706 recently added a bind mount from another NFS mount into that share, and bug 715026 tracks potential corruption of the / partition on surf.
Trending is at http://people.mozilla.com/~nthomas/trend-recent.png, where "Everything else" is /mnt/netapp/stage. There seems to be an upward going spike every Pacific midnight, is this from data-deduplication or is the partition too big for that ? Also, I can't reproduce the issue on mpt-netapp-b using dd.
Dedupe starts at 12AM Pacific time and is usually finished before 1AM.
FWIW, I've experienced this type of thing in the past as well- not specific to any particular share or netapp unit, or even netapp in general. I think what's happening is that the delete operation is "succeeding" on the client far more quickly than the operation actually happens- hence you see a slow regain of the used space. "df" has no way of knowing about this, so it reports what happens to be free right at that time. I don't understand why this would happen with a single 1GB file as you're example illustrates, which generally delete very quickly. I think it does explain why deleting many files would take hours or days to fully regain all of the space. We do have some known load issues on certain NetApp filers. I don't know off-hand if this is one of the affected ones or how severe it is affected, but I suspect there are some issues. Almost all NetApp-related performance issues (at least in SJC1, where surf is) should be resolved in Q1 or early-Q2 as we are moving to a new datacenter which has newer, much more powerful NetApp units. Way more CPU horsepower, IOPS capacity, and network bandwidth. If that's not soon enough, we'll need to dig further into this for a cause and treatment options.
Assignee: server-ops → dparsons
Attached image trend
We're on a track to filling the partition, and from here it seems that either something is wrong with the netapp, or there are things going on behind the curtain which make it very difficult for me to keep nagios happy. Backstory, for the last couple of days nagios has been WARNING or CRITICAL on surf:disk - /mnt/netapp/stage, which is mpt-netapp-a:/vol/ftp_stage. Recently we had > 550G free on this partition after granting it more space, and moving some bits off it. On Jan 14 the space drops suddenly, Jan 14 12:15:01 2012 561438 Jan 14 12:45:01 2012 591861 Jan 14 13:15:01 2012 590922 Jan 14 13:45:01 2012 589807 Jan 14 14:15:01 2012 400940 Jan 14 14:45:01 2012 587656 Jan 14 15:15:01 2012 586672 Jan 14 15:45:02 2012 586319 Jan 14 16:15:01 2012 587198 Jan 14 16:45:01 2012 588884 Jan 14 17:15:01 2012 588500 Jan 14 17:45:01 2012 115746 Jan 14 18:15:01 2012 115261 Jan 14 18:45:01 2012 125088 Jan 14 19:15:01 2012 273832 Jan 14 19:45:01 2012 321371 Jan 14 20:15:01 2012 319750 Jan 14 20:45:02 2012 318705 Jan 14 21:15:01 2012 317690 (Pacific times, free space in MB from df). The increase at 19:15 & 45 corresponds to a 200G snapshot getting deleted, IIRC. Currently we're down 115G free, and don't see any improvement when deleting files until the next dedupe runs at midnight. I am looking for recent additions of big files, but I'm pretty sure something is happening on the netapp too.
From IRC on the 19th, lerxst says the space was taken up in 313GB of snapshots, which were automatically created by the netapp to allow files to be restored to earlier states. Those snapshots have now been removed, and new ones disabled. Thanks!
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: