Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

Close

rdupfind

Find duplicate files fast using a randomized hashing method

rdupfind is a checksum-based duplicate file finder which attempts to be fast always. The program maintains a top-level hashmap in which files are mapped according to their sizes. The values in the map are trees initially containing only a single filename. For each new file encountered, if the file-size is not already in the hashmap, the file is just added to the hash against its file-size. If on the other hand the slot is already occupied, we get or compute a random sequence of byte offsets and the checksums (say sha1) of the corresponding file blocks for both the new and the conflicting files. If they match, the trial is repeated at most a fixed number (which can be changed via the option --ntrials) of times. If at any point the checksums do not match, a new entry is made in the hashmap at the current level. All the computed checksums are stored in tree structure for possible future reuse. The random sequences are stored in the top-level hashmap so that at most ntrials number of sequences are generated per size. If the new file finds a match at all trials, the program gets or computes the full hash of both the files involved and compares them as a final check. This can be disabled using the --noverify option. Files which match in all the checks are output, along with their matching counterparts.

USAGE EXAMPLES:

rdupfind

Find duplicate files in the current directory.

rdupfind Dir1 Dir2 file1 ...

Find duplicates in Dir1, Dir2, file1, etc. Note that the program will output copies among the command-line arguments. So if file1 is a copy of a file in Dir1, it will be reported.

rdupfind --noverify Dir1 ...

The same as above, but the final full-hash verification is avoided and hence potentially much faster. However, there is a small probability that files which are not exactly identical but are close enough are also reported as copies.

rdupfind -z+100m -z-1g

Find duplicates in the current directory while ignoring files which are less than 100MB or greater than 1GB. Note that there should not be any space between -z and -1g, or else -1g will be considered (and rejected) as a command-line option

rdupfind -s file1 Dir1 Dir2 file2 ...

Do a content-based search for file1 in Dir1, Dir2, file2 etc. Note that this is potentially much faster than doing a rdupfind file1 Dir1 Dir2 file2 ... and then looking for file1 in the output, because in the -s version, files are only ever compared with file1 and hence potentially culled at an early stage.

rdupfind -s file1 -s file2 -s Dir1 Dir2 Dir3 file3 ...

Same as above but search for all files in Dir1 (recursive), file1, file2 among the files of Dir2, Dir3 and file3.

rdupfind -s file1

Similar to above but search for duplicates of file1 in the current directory.

Recent activity

Jyothis V

Commits by Jyothis V were pushed to jyothisv/rdupfind

5018b10 - Added the -z (--size) option which can be used to restrict sizes of the files considered.
Jyothis V

Commits by Jyothis V were pushed to jyothisv/rdupfind

3000204 - Fixed a nasty bug which caused noverify to change in between iterations. Skipping over empty files is now more modular.
Jyothis V

Commits by Jyothis V were pushed to jyothisv/rdupfind

ff3810a - Added the option -s which searches for the argument file in the other argument files and/or directories. Escape characters in printf argument string works as ...
Jyothis V

Commits by Jyothis V were pushed to jyothisv/rdupfind

7d2ab06 - Now for each size only at most 'ntrials' random sequences are generated in total. All the sequences are stored in the top level alone.
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.