# Deduper

A utility to detect and handle duplicate files, particulary PDFs in my collection of research articles.

This little script finds duplicate files based on their contents and not their file name, so you can even eliminate multiple copies saved under different names. Files are initially compared by size and only files with exactly the same size are checked with a hash (which you can specify, the default is sha1) to make sure their contents match. Optionally, you can specify further hashes to really make sure that the files are truly duplicates. If you're feeling lucky or have just a few massive files (say experimental data) that are unlikely to be the exact same size unless they are indeed the same file, then you enable a size only comparison for files larger than a certain size.

The code portions I contributed are released under the GPLv2, but I initially started from some public domain code. If you look hard enough, you can probably still recognize bits and pieces (and some of the design is clearly descended from that early script), but I've modified the majority of that original code and expanded upon it. More information can be found in the copyright notice.

## Usage

The code here is developed and tested on Python 2.7 (see below). No promises of functionality are made on anything else. Installation is currently "download the script to somewhere convenient and run it", but installation via distribute and the Cheese Shop is currently in development. If you want to play it safe, keep to the release versions, but generally, any development version I've pushed to Bitbucket works reasonably well, at least in my small tests (in the develop branch -- the other branches are often far from fully functional).

From the automatically generated help in the current development branch:

usage: deduper.py [-h] [--size-only SIZE] [--use-hash USE_HASH]
[--extra-hashes EXTRA_HASHES [EXTRA_HASHES ...]]
[--dupe-cost] [-b BASE] [--max-size MAX_SIZE]
[--min-size MIN_SIZE] [-v] [-c] [-a]
[-e EXTENSION [EXTENSION ...]] [--invert]
path [path ...]

A utility for finding and dealing with duplicate files

positional arguments:
path                  paths to search

optional arguments:
-h, --help            show this help message and exit
--size-only SIZE      Only use size comparison on files larger than SIZE
--use-hash USE_HASH   Cryptographic hash to use (must be in hashlib!)
--extra-hashes EXTRA_HASHES [EXTRA_HASHES ...]
List of hashes to be carried out in further passes but
only upon an initial match.
--dupe-cost           Calculate the cost of duplicated data in terms of
wasted space.
Make file sizes human readable in base BASE
--max-size MAX_SIZE   Ignore files larger than MAX_SIZE
--min-size MIN_SIZE   Ignore files smaller than MIN_SIZE
-v, --verbose         Display progress information on STDERR
-c, --summary-only    Display only summary information, i.e. without a list
of duplicates. Can be used with --verbose to display
progress without listing files.
-a, --prompt-for-action
Prompt for action by duplicate sets.
-e EXTENSION [EXTENSION ...], --extension EXTENSION [EXTENSION ...]
Limit search to files of a certain extension.
--invert              Invert selection of extensions, i.e. negative match.

How much disk space can you save?


Valid suffixes for size arguments are KiB, MiB, GiB, TiB (base 2) or KB, MB, GB, TB (base 10). The B is important and this is case sensitive. (The idea is to force you to be unambiguous to both your intended base and the bits/bytes distinction.) No suffix is taken to indicate size in bytes.

Currently, I've tested the hash argument with md5 and sha1, but this should work with any hash in hashlib. You can use md5 as your first hash and then use sha1 as the extra has to speed things up -- in most cases, you'll only have to compute the md5 (very fast), but you're still protected from most hash collisions and attacks. (I can't prove this, but if md5 and sha1 are more or less independent, then the odds of both colliding would truly be astronomical, and sha1 is still considered secure in its own right.)

## Development

Development and testing is done on Python 2.7, and Python 3 support is planned (see below). Things may work on earlier version of Python 2.x, but I will not support anything below Python 2.6 and even that is a relatively low priority.

### Branches

Current development work is focused on generally making it easier to use and useful for common tasks:

• config-file: Specify and save your preferences so that you don't have keep typing all those options in at the command line. This should also make more complicated options possible, like specifying actions or even commands to execute for matches.
• distribute: Package everything up nice so that it can be installed via distribute (the successor to setuptools), possibly even via easy_install and the Cheese Shop. Part of the work here will be making sure that the code plays nice with 2to3 so that we can support both Python 2 and Python 3.
• media-Extensions: Add more advanced matching features for media files, i.e. matching even when the meta-data (e.g. ID3 tags for MP3s) doesn't match, or possibly matching files whose meta-data but not the content matches. That way you can find two copies of the same media and pick the one that has the quality and/or file size you're looking for.
• multiple-passes: Add more fine tuned control over the matching process via multiple passes. The idea is that you can specify successive checks, each more costly than the last, so that you can check quickly for obvious non matches, yet still achieve a given level of certainty that two files are not a match. This is already partly implemented with the --extra-hashes option, but the idea here is eventually to even make byte-for-byte comparisons available to the end user.

### General development

One day, when I feel like I have the command-line interface stable enough, I'll add some measure of graphical interface. In general development, I would also like to add more advanced options for dealing with matches -- perhaps instead of "keep some, delete the rest", add an option to link them all together. Hard link is the obvious option here, but perhaps some users could also benefit from soft links. We also want to keep things as portable as possible here.