Find duplicate files fast using a randomized hashing method

rdupfind is a checksum-based duplicate file finder which attempts to be fast always. The program maintains a top-level hashmap in which files are mapped according to their sizes. The values in the map are trees initially containing only a single filename. For each new file encountered, if the file-size is not already in the hashmap, the file is just added to the hash against its file-size. If on the other hand the slot is already occupied, we get or compute a random sequence of byte offsets and the checksums (say sha1) of the corresponding file blocks for both the new and the conflicting files. If they match, the trial is repeated at most a fixed number (which can be changed via the option --ntrials) of times. If at any point the checksums do not match, a new entry is made in the hashmap at the current level. All the computed checksums are stored in tree structure for possible future reuse. The random sequences are stored in the top-level hashmap so that at most ntrials number of sequences are generated per size. If the new file finds a match at all trials, the program gets or computes the full hash of both the files involved and compares them as a final check. This can be disabled using the --noverify option. Files which match in all the checks are output, along with their matching counterparts.



Find duplicate files in the current directory.

rdupfind Dir1 Dir2 file1 ...

Find duplicates in Dir1, Dir2, file1, etc. Note that the program will output copies among the command-line arguments. So if file1 is a copy of a file in Dir1, it will be reported.

rdupfind --noverify Dir1 ...

The same as above, but the final full-hash verification is avoided and hence potentially much faster. However, there is a small probability that files which are not exactly identical but are close enough are also reported as copies.

rdupfind -z+100m -z-1g

Find duplicates in the current directory while ignoring files which are less than 100MB or greater than 1GB. Note that there should not be any space between -z and -1g, or else -1g will be considered (and rejected) as a command-line option

rdupfind -s file1 Dir1 Dir2 file2 ...

Do a content-based search for file1 in Dir1, Dir2, file2 etc. Note that this is potentially much faster than doing a rdupfind file1 Dir1 Dir2 file2 ... and then looking for file1 in the output, because in the -s version, files are only ever compared with file1 and hence potentially culled at an early stage.

rdupfind -s file1 -s file2 -s Dir1 Dir2 Dir3 file3 ...

Same as above but search for all files in Dir1 (recursive), file1, file2 among the files of Dir2, Dir3 and file3.

rdupfind -s file1

Similar to above but search for duplicates of file1 in the current directory.