Duff
Description
Duff is a Unix command-line utility for
quickly finding
duplicates in a given set of files. Duff is written in C and should
compile on most modern Unices.
Disambiguation
This is
not DUFF, the Windows
program. This is duff, the Unix command-line utility.
If you are using Windows and wish to find duplicate files, use
DUFF.
FAQ
1. Hey, that is O(n²)
right?
Well, sort of. Lots of very intelligent
people keep telling me it should be O(n
log n), and present complex
solutions that, when implemented, don't give any significant advantage
in real world use. Thus, my particular implementation is still O(n²).
I won't even pretend to call this a benchmark, but duff running cold on
a directory tree with over 2000 images (190MB) found the 13 duplicate
clusters in 4 seconds, and this on my ageing laptop.
2. How does it work, then?
The basic idea (as of version 0.3) is:
- Only compare files if they're of equal
size.
- Compare a few bytes before checksumming large files.
- Compare checksums before actual contents.
- Don't compare actual contents unless explicitly asked.
3. How do you calculate the checksum?
It is a regular SHA1 message digest,
calculated using
sha.
4. What is it good for?
Getting a list of, and then usually
removing or joining, duplicates in a given set of files. Note that duff
itself never modifies any files, but it's designed to play nice with
tools that do.
Some people find this
ability useful. If you don't, feel free not to use duff.
5. Is duff named after Tom Duff?
As much as I like
Duff's device,
no, it isn't. Duff stands for DUplicate File Finder.
(No, it's not named after the beer, either.)
6. Shouldn't duff also do x?
I don't know, but you're welcome to
write a patch to make it do x,
or send me an email explaining why you think duff should do x. If I like your patch or
email, a future version of duff will probably do x.
Screenshots
Err... duff is a command line utility,
but here are a few sample outputs.
1. Recursive search
Shows normal reporting of duplicate
clusters. Note that the cluster header can be customised or omitted, if
desired.
camilla@sharon~$ duff -r
images
2 files in cluster 1 (43935
bytes, checksum ea1a856854c166ebfc95ff96735ae3d03dd551a2)
images/strips/Nemi/n102.png
images/strips/Nemi/n58.png
3 files in cluster 2 (32846
bytes, checksum 00c819053a711a2f216a94f2a11a202e5bc604aa)
images/strips/Nemi/n386.png
images/strips/Nemi/n491.png
images/strips/Nemi/n512.png
2 files in cluster 3 (26596 bytes,
checksum b26a8fd15102adbb697cfc6d92ae57893afe1393)
images/strips/Nemi/n389.png
images/strips/Nemi/n465.png
2 files in cluster 4 (30332 bytes,
checksum 11ff80677c85005a5ff3e12199c010bfe3dc2608)
images/strips/Nemi/n380.png
images/strips/Nemi/n451.png
2. Excess mode
Reports all but one file from each
cluster of duplicates, useful for example when piping to `xargs rm'.
camilla@sharon~$ duff -e
-r images
images/strips/Nemi/n58.png
images/strips/Nemi/n491.png
images/strips/Nemi/n512.png
images/strips/Nemi/n465.png
images/strips/Nemi/n451.png
License
Duff is licensed
zlib/libpng,
which is similar to the new BSD license.
Download
Duff is currently in beta. The latest
release of duff is 0.4, which was released on January 13, 2006.
duff-0.4.tar.gz
(source archive)
duff-0.4.tar.bz2
(source archive)
See the
CVS
repository if you are looking for a development snapshot. Note that
this may be quite unstable.
Author
Duff is developed by
Camilla
Berglund, with bugfixes contributed by various people.
