Duff

Description

Duff is a Unix command-line utility for quickly finding duplicates in a given set of files. Duff is written in C and should compile on most modern Unices.

Disambiguation

This is not DUFF, the Windows program. This is duff, the Unix command-line utility.

If you are using Windows and wish to find duplicate files, use DUFF.

FAQ

1. Hey, that is O(n²) right?

Well, sort of. Lots of very intelligent people keep telling me it should be O(n log n), and present complex solutions that, when implemented, don't give any significant advantage in real world use. Thus, my particular implementation is still O().

I won't even pretend to call this a benchmark, but duff running cold on a directory tree with over 2000 images (190MB) found the 13 duplicate clusters in 4 seconds, and this on my ageing laptop.

2. How does it work, then?

The basic idea (as of version 0.3) is:

3. How do you calculate the checksum?

It is a regular SHA1 message digest, calculated using sha.

4. What is it good for?

Getting a list of, and then usually removing or joining, duplicates in a given set of files. Note that duff itself never modifies any files, but it's designed to play nice with tools that do.

Some people find this ability useful. If you don't, feel free not to use duff.

5. Is duff named after Tom Duff?

As much as I like Duff's device, no, it isn't. Duff stands for DUplicate File Finder.

(No, it's not named after the beer, either.)

6. Shouldn't duff also do x?

I don't know, but you're welcome to write a patch to make it do x, or send me an email explaining why you think duff should do x.  If I like your patch or email, a future version of duff will probably do x.

Screenshots

Err... duff is a command line utility, but here are a few sample outputs.

1. Recursive search

Shows normal reporting of duplicate clusters. Note that the cluster header can be customised or omitted, if desired.

camilla@sharon~$ duff -r images
2 files in cluster 1 (43935 bytes, checksum ea1a856854c166ebfc95ff96735ae3d03dd551a2)
images/strips/Nemi/n102.png
images/strips/Nemi/n58.png
3 files in cluster 2 (32846 bytes, checksum 00c819053a711a2f216a94f2a11a202e5bc604aa)
images/strips/Nemi/n386.png
images/strips/Nemi/n491.png
images/strips/Nemi/n512.png
2 files in cluster 3 (26596 bytes, checksum b26a8fd15102adbb697cfc6d92ae57893afe1393)
images/strips/Nemi/n389.png
images/strips/Nemi/n465.png
2 files in cluster 4 (30332 bytes, checksum 11ff80677c85005a5ff3e12199c010bfe3dc2608)
images/strips/Nemi/n380.png
images/strips/Nemi/n451.png

2. Excess mode

Reports all but one file from each cluster of duplicates, useful for example when piping to `xargs rm'.

camilla@sharon~$ duff -e -r images
images/strips/Nemi/n58.png
images/strips/Nemi/n491.png
images/strips/Nemi/n512.png
images/strips/Nemi/n465.png
images/strips/Nemi/n451.png

License

Duff is licensed zlib/libpng, which is similar to the new BSD license.

Download

Duff is currently in beta. The latest release of duff is 0.4, which was released on January 13, 2006.

duff-0.4.tar.gz (source archive)
duff-0.4.tar.bz2 (source archive)

See the CVS repository if you are looking for a development snapshot. Note that this may be quite unstable.

Author

Duff is developed by Camilla Berglund, with bugfixes contributed by various people.


SourceForge.net Logo