Closed Bug 836014 Opened 11 years ago Closed 11 years ago

save space on puppetagain /data

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

We have a bunch of files in the various repo mirrors that are duplicates of one another.

For example:
> dmitchell@releng-puppet1 /tmp $ find /data -name zd1211-firmware-1.4-5.fc15.noarch.rpm | xargs md5sum
> d69bbecaca5191b089f1f30efe5fd8ab  /data/repos/yum/mirrors/fedora/16/2012-03-07/releases/Everything/x86_64/os/Packages/zd1211-firmware-1.4-5.fc15.noarch.rpm
> d69bbecaca5191b089f1f30efe5fd8ab  /data/repos/yum/mirrors/fedora/16/2012-03-07/releases/Everything/i386/os/Packages/zd1211-firmware-1.4-5.fc15.noarch.rpm

there are 28,545 such files.  At a few MB apiece, that can save us ~50G!

http://premium.caribe.net/~adrian2/fdupes.html seems to be the app to use for this, but it's horrendously slow.  So I'd like to find a way to run it only after actually mirroring a repo, and then have rsync preserve the hard-links when syncing.  Rsync's -H flag does so.
fdupes seems ridiculously slow, and only *finds* dupes - it doesn't symlink them (except in version 1.50, which isn't in RPM form).

'hardlink' does - http://pkgs.fedoraproject.org/cgit/hardlink.git/ - and it's in RPM form and already mirrored.

It ain't especially fast, either, but what can you do.  I'm running in on releng-puppet1.srv.releng.scl3.mozilla.com now, using 'time'.  I'll post the results tomorrow.
Summary: save space on puppetagain /data with fdupes → save space on puppetagain /data
Well, that wasn't long, but it wasn't much savings, either:

[root@releng-puppet1.srv.releng.scl3 data]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      342G  253G   72G  78% /
tmpfs                1004M     0 1004M   0% /dev/shm
/dev/sda1              97M   73M   19M  80% /boot
[root@releng-puppet1.srv.releng.scl3 data]# time hardlink -v /data/repos/yum


Directories 179
Objects 109395
IFREG 109210
Mmaps 6187
Comparisons 6186
Linked 6186
saved 4604907520

real    9m9.004s
user    0m1.342s
sys     0m7.526s
[root@releng-puppet1.srv.releng.scl3 data]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      342G  249G   77G  77% /
tmpfs                1004M     0 1004M   0% /dev/shm
/dev/sda1              97M   73M   19M  80% /boot

So 5g savings.  The ubuntu repos are arranged by filename, so I don't expect a lot of savings there - but I'll run it to see.
That saved another 45M in 10m.  Not worth it.

I've adjusted the rsync's to use -H and updated the docs, but this doesn't save an appreciable amount of space, unfortunately.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.