Duff is a Unix command-line utility for quickly finding duplicates in a given set of files.
Duff is written in C, uses gettext where available, is licensed under the zlib/libpng license and should compile on most modern Unices.
This is not DUFF, the Windows program. This is duff, the Unix command-line utility.
If you are using Windows and wish to find duplicate files, use DUFF.
Duff is documented in its man page.
Duff is currently considered stable. The latest release of duff is 0.5.2, which was released on January 29, 2012.
duff-0.5.2.tar.gz (source archive)
duff-0.5.2.tar.bz2 (source archive)
See the Git repository if you are looking for a development snapshot. Note that this may be quite unstable.
No, optimally it is O(n log n) and as of 0.5.2, duff is O(n log n).
The basic idea (as of version 0.5.2) is:
Various versions of duff have been successfully built on the following systems:
The following distributions and package systems already carry duff packages:
There is also an Ubuntu PPA for duff.
Please let me know if your system should be listed.
It is a regular SHA message digest, calculated using (by default) SHA-1, or if you prefer, SHA-256, SHA-384 or SHA-512.
The actual calculations are done using sha by Allan Saddi.
Getting a list of duplicates in a given set of files, which can then be acted upon by you or programs other than duff. Note that duff itself never modifies any files, but it's designed to play nice with tools that do.
As much as I like Duff's device, no, it isn't. Duff stands for DUplicate File Finder.
(No, it's not named after the beer, either.)
I don't know, but you're welcome to write a patch to make it do x, or send me an email explaining why you think duff should do x. If I like your patch or email, a future version of duff will probably do x. Several people have already had success using this method.
Note, however, that this method may exhibit some lag.
Shows normal output, with a header before each cluster of duplicate files,
in this case using recursive search into the images
directory.
elmindreda@balthasar~$ duff -r comics 2 files in cluster 1 (43935 bytes, digest ea1a856854c166ebfc95ff96735ae3d03dd551a2) comics/Nemi/n102.png comics/Nemi/n58.png 3 files in cluster 2 (32846 bytes, digest 00c819053a711a2f216a94f2a11a202e5bc604aa) comics/Nemi/n386.png comics/Nemi/n491.png comics/Nemi/n512.png 2 files in cluster 3 (26596 bytes, digest b26a8fd15102adbb697cfc6d92ae57893afe1393) comics/Nemi/n389.png comics/Nemi/n465.png 2 files in cluster 4 (30332 bytes, digest 11ff80677c85005a5ff3e12199c010bfe3dc2608) comics/Nemi/n380.png comics/Nemi/n451.png
The header can be customized for example outputing only the number of files that follow:
elmindreda@balthasar~$ duff -r -f '%n' comics 2 comics/Nemi/n102.png comics/Nemi/n58.png 3 comics/Nemi/n386.png comics/Nemi/n491.png comics/Nemi/n512.png 2 comics/Nemi/n389.png comics/Nemi/n465.png 2 comics/Nemi/n380.png comics/Nemi/n451.png
This mode reports all but one file from each cluster of duplicates. This can
be used in combination with for example rm to remove duplicates,
but should only be done if you don't care which duplicates are
removed.
elmindreda@balthasar~$ duff -re comics comics/Nemi/n58.png comics/Nemi/n491.png comics/Nemi/n512.png comics/Nemi/n465.png comics/Nemi/n451.png
Duff can read file and directory names from stdin and so is able to function as a filter in pipelines.
To handle file names containing whitespace and other special characters, duff can use null-terminated input and output. This should be used whenever duff is used as a filter and the other programs in the pipeline are able to output and/or parse null-terminated text.
Here, echo is used for demonstration purposes, but it can of
course be replaced by another program like mv or rm.
elmindreda@balthasar~$ duff -re0 documents | xargs -n1 -0 echo documents/Thinking in C++ - Second Edition - Volume 2/code/g++295.mac documents/Master Class Assembly Language/CH12/COMMON/GRAPH.H documents/Master Class Assembly Language/CH18/DEMO/COMMON/GRAPH.H
File and directory names are read from stdin if none are specified
on the command line. Here, find is used for searching and duff
on the resulting list of file names is used as a filter on the resulting list
of file names. Also note the use of xargs and echo
to turn the null characters into newlines:
elmindreda@balthasar~$ find . -name '*.pdf' -print0 | duff -0 | xargs -0 -n1 echo 2 files in cluster 1 (1070259 bytes, digest a0a6806859ce209e7ecd3191b0d3343438801f93) documents/A Real-Time Cloud Modeling, Rendering and Animation System.pdf documents/A Real-Time Cloud Modeling, Rendering, and Animation System.pdf 2 files in cluster 2 (32213588 bytes, digest b308ec6165679f210516ea802263964ca8a6b73c) documents/Terminators and Iron Men: Image-based lighting and physical shading at ILM.pdf documents/sigg2010_physhadcourse_ILM.pdf
This also works for directory names:
elmindreda@balthasar~$ find projects -name .git -print0 | duff -r0 | xargs -0 -n1 echo 2 files in cluster 1 (4061 bytes, digest 652f2780792e17f37797a56d56865b51753c8ab6) projects/duff-qsort/.git/objects/4d/4a9519eaf88b18fb157dfe5fae59c1c5d005c7 projects/duff-jc/.git/objects/4d/4a9519eaf88b18fb157dfe5fae59c1c5d005c7 2 files in cluster 2 (4088 bytes, digest e607f80e94b5b72fb6f444c208985b49b25d1186) projects/duff-qsort/.git/objects/89/4e786e16c1d0d94dfc08d6b475270fe1418d6a projects/duff-jc/.git/objects/89/4e786e16c1d0d94dfc08d6b475270fe1418d6a 2 files in cluster 3 (4088 bytes, digest 6c4a7a408b98a4d678cba146d5a33d811a6be677) projects/duff-qsort/.git/objects/23/e5f25d0e5f85798dcfb368ecb2f04f59777f61 projects/duff-jc/.git/objects/23/e5f25d0e5f85798dcfb368ecb2f04f59777f61
Duff is licensed zlib/libpng, which is similar in spirit to the new BSD license.
No program is perfect for everyone and duff is no exception. Here are a few stable alternatives to duff:
Duff is developed by Camilla Berglund, with bugfixes, suggestions, patches and encouragement contributed by various people.