We have a bunch of files in the various repo mirrors that are duplicates of one another. For example: > dmitchell@releng-puppet1 /tmp $ find /data -name zd1211-firmware-1.4-5.fc15.noarch.rpm | xargs md5sum > d69bbecaca5191b089f1f30efe5fd8ab /data/repos/yum/mirrors/fedora/16/2012-03-07/releases/Everything/x86_64/os/Packages/zd1211-firmware-1.4-5.fc15.noarch.rpm > d69bbecaca5191b089f1f30efe5fd8ab /data/repos/yum/mirrors/fedora/16/2012-03-07/releases/Everything/i386/os/Packages/zd1211-firmware-1.4-5.fc15.noarch.rpm there are 28,545 such files. At a few MB apiece, that can save us ~50G! http://premium.caribe.net/~adrian2/fdupes.html seems to be the app to use for this, but it's horrendously slow. So I'd like to find a way to run it only after actually mirroring a repo, and then have rsync preserve the hard-links when syncing. Rsync's -H flag does so.
fdupes seems ridiculously slow, and only *finds* dupes - it doesn't symlink them (except in version 1.50, which isn't in RPM form). 'hardlink' does - http://pkgs.fedoraproject.org/cgit/hardlink.git/ - and it's in RPM form and already mirrored. It ain't especially fast, either, but what can you do. I'm running in on releng-puppet1.srv.releng.scl3.mozilla.com now, using 'time'. I'll post the results tomorrow.
Well, that wasn't long, but it wasn't much savings, either: [firstname.lastname@example.org data]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 342G 253G 72G 78% / tmpfs 1004M 0 1004M 0% /dev/shm /dev/sda1 97M 73M 19M 80% /boot [email@example.com data]# time hardlink -v /data/repos/yum Directories 179 Objects 109395 IFREG 109210 Mmaps 6187 Comparisons 6186 Linked 6186 saved 4604907520 real 9m9.004s user 0m1.342s sys 0m7.526s [firstname.lastname@example.org data]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 342G 249G 77G 77% / tmpfs 1004M 0 1004M 0% /dev/shm /dev/sda1 97M 73M 19M 80% /boot So 5g savings. The ubuntu repos are arranged by filename, so I don't expect a lot of savings there - but I'll run it to see.
That saved another 45M in 10m. Not worth it. I've adjusted the rsync's to use -H and updated the docs, but this doesn't save an appreciable amount of space, unfortunately.