Commits

Phillip Alday committed d28bb00 Merge

Merged in latest changes in the development branch

Comments (0)

Files changed (3)

 219c430e3c495ce301db4e25994969c55e8a7331 Deduplicator v-1.0
 2ab4adeeadf4acb7329c3d4764ec32725e1908fb v0.1
+372de071276175a3325bf62882e8dda181c970ec v0.2
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, write to the Free Software Foundation, Inc.,
+    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
 #! /usr/bin/env python
+# -*- coding: UTF-8 -*-
+#
+# Copyright (C) 2012 Phillip Alday <phillip.alday@staff.uni-marburg.de>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+#
+# This program incorporates work by David Mertz and Martin Blais,
+# previously released as CC0 (public domain).
+# The original CC0 code can be accessed as find-duplicate-contents.py
+# in earlier revisions and was "released" as version "-1.0"
+
+""" deduper.py: find (and remove) duplicate files.
+
+    Given a root directory, recurse in it and find all the duplicate files:
+    files that have the same contents, but not necessarily the same filename.
 """
-    Given a root directory, recurse in it and find all the duplicate files, files
-    that have the same contents, but not necessarily the same filename.
-"""
-# based on work by David Mertz and Martin Blais, previously released as CC0 (public domain)
-# the original CC0 code can be accessed as find-duplicate-contents.py in earlier revisions
-# and was "released" as version "-1.0"
-
-# all subsequent modifications are licensed under the GPLv2.
+from __future__ import print_function
+from __future__ import division
 
 import sys
 import os
 import sqlite3
 import argparse
 
-from sys import stderr, stdout
+from sys import stderr
 from itertools import groupby
 from math import log10
 
+# list of file size bases along with the post K/M/G/T prefix letter
+# ie KiB vs KB
+SIZE_BASES = {
+    10:{
+        'TB' : 10**12,
+        'GB' : 10**9,
+        'MB' : 10**6,
+        'KB' : 10**3,
+    },
+    2:{
+        'TiB': 2**40,
+        'GiB': 2**30,
+        'MiB': 2**20,
+        'KiB': 2**10,
+    }
+}
+
 def main(argv=None):
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--size-only', type=int, default=sys.maxint,
-                        help="Only use hashing (no secondary checks) on files larger than SIZE_ONLY")
-    parser.add_argument('--use-hash',type=str,default="sha1",
+    parser = argparse.ArgumentParser(
+                        description="A utility for finding and dealing with duplicate files",
+                        epilog = "How much disk space can you save?")
+    parser.add_argument('--size-only', metavar="SIZE", type=str,
+                        default=str(sys.maxint),
+                        help="Only use size comparison on files "
+                        "larger than SIZE")
+    parser.add_argument('--use-hash', type=str, default="sha1",
                         help="Cryptographic hash to use (must be in hashlib!")
-    parser.add_argument('--max-size', type=int, default=sys.maxint,
+    parser.add_argument('--extra-hashes', type=str, default="", nargs="+",
+                        help="List of hashes to be carried out in further passes"
+                        "but only upon an initial match.")
+    parser.add_argument('--dupe-cost', action="store_true", default=False,
+                        help="List of hashes to be carried out in further passes"
+                        "but only upon an initial match.")
+    parser.add_argument('-b','--human-readable',metavar="BASE",
+                        type=int, default=0, choices=SIZE_BASES.keys(),
+                        help="Make file sizes human readble in base BASE")
+    parser.add_argument('--max-size', type=str, default=str(sys.maxint),
                         help="Ignore files larger than MAX_SIZE")
-    parser.add_argument('--min-size', type=int, default=0,
+    parser.add_argument('--min-size', type=str, default="0",
                         help="Ignore files smaller than MIN_SIZE")
     parser.add_argument('-v', '--verbose', action="store_true", default=False,
                         help="Display progress information on STDERR")
-    parser.add_argument('-c', '--summary-only', action="store_true", default=False,
-                        help="Display only summary information, i.e. without a list of duplicates."
-                             "Can be used with --verbose to display progress without listing files.")
-    parser.add_argument('-a','--prompt-for-action', action="store_true",default=False,
+    parser.add_argument('-c', '--summary-only', action="store_true",
+                        default=False,
+                        help="Display only summary information, i.e. without a "
+                        "list of duplicates. Can be used with --verbose to "
+                        "display progress without listing files.")
+    parser.add_argument('-a', '--prompt-for-action', action="store_true",
+                        default=False,
                         help="Prompt for action by duplicate sets.")
-    parser.add_argument('path', type=str,nargs='+',
+    parser.add_argument('path', type=str, nargs='+',
                         help="paths to search")
-    parser.add_argument('-e','--extension', type=str,default=None,
+    parser.add_argument('-e','--extension', type=str, default=None, nargs='+',
                         help="Limit search to files of a certain extension.")
+    parser.add_argument('--invert', action="store_true",default=False,
+                        help="Invert selection of extensions, i.e. negative match.")
 
-    args = parser.parse_args()
+    args = parser.parse_args(argv)
+    args.final_byte_check = False
+    args.size_only, args.max_size, args.min_size = map(size_to_int, [args.size_only, args.max_size, args.min_size])
     find_duplicates(args.path, args)
 
-def find_files(args, ext=None):
+def find_files(args, ext=None, invert=False):
+    """Find all files in the search path optionally matching the extension.
+
+    Keyword arguments:
+    ext -- filename extension (default None = all extensions)
+    """
     for ddir in args:
         if os.path.isdir(ddir):
             for root, dirs, fnames in os.walk(ddir):
                 for f in fnames:
-                    if ext is None or os.path.splitext(f)[1] == ext:
-                        yield os.path.join(root, f)
+                    if ext is None or ((os.path.splitext(f)[1] in ext) != invert):
+                            yield os.path.join(root, f)
         else:
-            if ext is None or os.path.splitext(f)[1] == ext:
-                yield ddir
+            if ext is None or ((os.path.splitext(f)[1] in ext) != invert):
+                    yield ddir
 
 def group_pairs(pairs):
-    """This function is passed an interable each of whose values is a pair;
+    """ Group key-value pairs based on identical first item (key).
+
+       This function is passed an interable each of whose values is a pair;
        it yields a sequence of pairs whose first element is the identical first
        element from the original pairs, and whose second element is a list of
        second elements corresponding to the same first element. Only adjacent
         yield (idx, [v[1] for v in vals])
 
 def find_duplicates(dirs, opts):
-    "Find the duplicate files in the given root directory."
+    """Find the duplicate files in the given root directory(ies).
 
+        Arguments:
+
+        dirs -- an iterable of strings containing directories to search
+        opts -- a namespace type object (e.g. from argparse) containing
+                the following arguments:
+                -- summary_only      -- do not display list of duplicate files,
+                                        only display final statistics
+                -- use_hash          -- specify which hash to use,
+                                        hash must be in hashlib
+                -- verbose           -- display progress information during
+                                        computation of file sizes and
+                                        initial hashing
+                -- prompt_for_action -- prompt for action on each set of
+                                        duplicates logical mutually exclusive
+                                        with summary_only
+                -- extension         -- file name extensions to restrict search
+                -- invert            -- negative matching for file name extension
+                -- max_size          -- ignore files larger than this size
+                -- min_size          -- ignore files larger than this size
+                -- size_only         -- comparisons only on size for files
+                                        larger than this (i.e. no hashes)
+                -- dupe_cost         -- calculate the cost of duplication
+                -- human_readable    -- the base to use for pretty printing
+                                        the size; 0 for no pretty printing
+    """
+
+    # the selected hash in string and function form
     hashname = opts.use_hash.upper()
     hashfnc = eval("hashlib.{0}".format(opts.use_hash))
+    extra_hashfncs = [eval("hashlib.{0}".format(h)) for h in opts.extra_hashes]
+    extra_hashnames = opts.extra_hashes
 
     # Organize all filenames according to size.
-    count = 0
+    count = 0           # number of files examined
+
+    # initialize sqlite database
     if os.path.exists('sz.db'):
         os.remove('sz.db')
     conn = sqlite3.connect('sz.db')
     c = conn.cursor()
     c.execute('create table files_by_size (size int, fname text)')
+
     if opts.verbose:
-        print>>stderr, "Checking sizes (each '.' is 100k files):"
-    for fn in find_files(dirs, opts.extension):
+        print("Checking sizes (each '.' is 100k files):", file=stderr)
+
+    # traverse the directory tree, count the files and get their size
+    # this gets sizes for all files, even files in the ignore range
+    for fn in find_files(dirs, opts.extension, opts.invert):
         if not os.path.isfile(fn):
             continue
         count += 1
                   (sz, unicode(fn, 'utf-8')))
         if opts.verbose and count % 100000 == 0:
             stderr.write('.')
+
     conn.commit()
+
     if opts.verbose:
-        print>>stderr, "\nFound sizes on %d files..." % count
+        print("\nFound sizes on {0} files...".format(count), file=stderr)
 
+    # retrieve the files sorted by size, for min_size <= size <= max_size
     c.execute('''select size, fname from files_by_size
                  where size<=? and size>=?
                  order by size desc''', (opts.max_size, opts.min_size))
 
     if opts.verbose:
-        print>>stderr, "Grouping files by {0} (each '.' is 5k groups):".format(hashname)
+        print("Grouping files by {0} (each '.' is 5k groups):".format(hashname),
+              file=stderr)
 
-    distincts = 0
-    null_header = False
-    empties = 0
+    distincts = 0           # number of distinct sets of duplicates
+    dupe_cost = 0
+    base = opts.human_readable
+    null_header = False     # has the label for empty files been printed?
+    empties = 0             # number of empty files
 
     # the call to sqlite should have already sorted this list as required
     for sz, fnames in group_pairs(c):
         if sz == 0:
             if not null_header:
                 if not opts.summary_only:
-                    print "Size: 0 : Content: ''"
+                    print("Size: 0 : Content: ''")
                 null_header = True
             for f in fnames:
                 if not opts.summary_only:
-                    print ' ', f
+                    print(' ', f)
                 empties += 1
         else:
             # We have accumulated some dups that need to be printed
             if len(fnames) > 1:
                 hashes = []
+                # compute hashes only for files smaller than size_only
+                # otherwise go ahead and print sets of size_only matches
+
                 if sz <= opts.size_only:
                     for f in fnames:
-                        fh = open(f)
-                        content = fh.read()
-                        fh.close()
-                        hashes.append((hashfnc(content).hexdigest(), f))
+                        # some temporary / sqlite-journalling files get caught,
+                        # but then disappear, this basically skips any missing
+                        # files. Other calls to open don't skip -- they assume
+                        # you made it past this check, life must be okay, so
+                        # know that changes to the filesystem will performing
+                        # a traversal on said filesystem will cause problems!
+                        try:
+                            with open(f) as fh:
+                                content = fh.read()
+                            hashes.append((hashfnc(content).hexdigest(), f))
+                        except IOError as ioe:
+                            print(ioe)
+                            print("Skipping {}".format(f))
                     hashes.sort()
                 else:
-                    print 'Size:', sz, ': Size:', sz
-                    for f in fnames:
-                        print ' ', f
-                    print '--'
+                    if not opts.summary_only:
+                        print('Size: {size} : Size: {size}'.format(size=pretty_size(sz,base=base)))
+                        for f in fnames:
+                            print(' ', f)
+                        if opts.dupe_cost:
+                            print("Potentially duplicated space: {}".format(pretty_size(sz * (len(fnames)-1), base=base)))
+                        print('--')
                     distincts += 1
+                    dupe_cost += sz * (len(fnames)-1)
+
 
                 for idx, vals in group_pairs(hashes):
+                    # if there is more than one value per hash-set, then
+                    # that is another distinct set of duplicates
                     if len(vals) > 1:
-                        distincts +=1
+                        extra_match, hash_msgs =  additional_tests(vals,
+                                                    extra_hashfncs,
+                                                    extra_hashnames,
+                                                    opts.final_byte_check)
+                        distincts += 1
+                        dupe_cost += sz * (len(vals)-1)
                         if not opts.summary_only:
+                                # if the size of the file is non trivial,
+                            # then print the hash, else just print the contents
                             if sz > 40:
-                                print 'Size: ', sz, ': {0}:'.format(hashname), idx
+                                print(u'Size: {size}: {hname}:{hmsg} '
+                                        '{extra}'.format(size=pretty_size(sz,base=base),
+                                        hname=hashname, hmsg=idx,
+                                        extra=hash_msgs))
                             else:
-                                fh = open(vals[0])
-                                content = fh.read()
-                                fh.close()
-                                print 'Size: ', sz, ': Content:', repr(content)
+                                with open(vals[0]) as fh:
+                                    content = fh.read()
+
+                                print('Size: {size}: Content: {con}'.format(size=pretty_size(sz,base=base),con=repr(content)))
+                            # for now, we don't print the cost of duplication
+                            # when taking action -- this seems like it would be
+                            # a complicated and dynamic affair
+                            if opts.dupe_cost:
+                                print("Duplicated space: {}".format(pretty_size(sz * (len(vals)-1), base=base)))
+
                             if opts.prompt_for_action:
                                 action_on_file_list(vals)
                             else:
                                 for fn2 in vals:
-                                    print ' ', fn2
+                                    print(' ', fn2)
+
+                            # print a dot for every 5000 sets of duplicates
                             if opts.verbose and distincts % 5000 == 0:
                                 stderr.write('.')
 
+
     if opts.verbose or opts.summary_only:
-        print>>stderr, "\nFound %d empty files"% empties
-        print>>stderr, "Found %d non-empty duplicate sets" % distincts
+        print("\nFound {0} empty files".format(empties), file=stderr)
+        print("Found {0} non-empty duplicate sets".format(distincts),
+                file=stderr)
+    if opts.dupe_cost:
+        print("\nSpace cost of duplicates: {}".format(pretty_size(dupe_cost,base=base)), file=stderr)
+
+def additional_tests(fnames, hashfncs, hashnames, ncheck_bytes):
+    if len(hashfncs) == 0:
+        hash_match = True
+        hash_msgs = ""
+    else:
+        # place holder code!
+        hashes = dict()
+        hash_match = True
+        hash_msgs = ""
+
+        for f in fnames:
+            with open(f) as fh:
+                content = fh.read()
+            for (fnc,name) in zip(hashfncs,hashnames):
+                if name in hashes:
+                    hashes[name].append((fnc(content).hexdigest(),f))
+                else:
+                    hashes[name] = [(fnc(content).hexdigest(),f)]
+
+        for h in hashnames:
+            if len(list(group_pairs(hashes[h]))) > 1:
+                hash_msgs += "\n\t*****{} does not match*****".format(h)
+                hash_match = False
+
+        if hash_msgs != "":
+            hash_msgs = u"Extra hashes: {}".format(hash_msgs)
+
+    if ncheck_bytes == -1:
+        byte_match = True
+    else:
+        # place holder code!
+        byte_match = True
+
+    return hash_match and byte_match, hash_msgs
 
 def hyphen_range(s):
     """ yield each integer from a complex range string like "1-9,12, 15-20,23"
             for i in xrange(start, end+1):
                 yield i
         else: # more than one hyphen
-            raise ValueError('format error in %s' % x)
+            raise ValueError('format error in {0}'.format(x))
 
-def delete_file(fname):
+def remove_file(fname):
+    """ Remove a file.
+
+        Currently, this just invokes os.remove() to delete the file.
+        Future releases will support alternative action, e.g. moving all
+        duplicates to a single folder.
+    """
     os.remove(fname)
 
-def action_on_file_list(fnames):
+def action_on_file_list(fnames,**opts):
+    """ Prompt for action on a set of duplicates. """
     for i in range(len(fnames)):
-        print u"[{0:>{width}}] {1}".format(i, fnames[i], width=int(log10(len(fnames))))
+        print(u"[{0:>{width}}] {1}".format(
+                i, fnames[i], width=int(log10(len(fnames)))))
 
     items = list(hyphen_range(raw_input("  Entries to delete: ")))
     if len(items) > 0:
         for i in items:
-            delete_file(fnames[i])
+            remove_file(fnames[i])
     else:
-        items = hyphen_range(raw_input("  Entries to keep (all others will be deleted, enter none to keep all): "))
+        items = hyphen_range(raw_input("  Entries to keep "
+                                       "(all others will be deleted, "
+                                       "enter none to keep all): "))
         for i in range(len(fnames)):
             if i not in items:
-                delete_file(fnames[i])
-    print
+                remove_file(fnames[i])
+    print("")
+
+def pretty_size(bytes,base=2):
+    """ Pretty print the size of a file using the given base."""
+    global SIZE_BASES
+
+    if base == 0:
+        return bytes
+    elif base not in SIZE_BASES:
+        raise ValueError("Invalid metric prefix base: {}".format(base))
+    else:
+        for suffix in sorted(SIZE_BASES[base], key=SIZE_BASES[base].get, reverse=True):
+            if bytes > SIZE_BASES[base][suffix]:
+                return "{0:.2f}{1}".format(bytes / SIZE_BASES[base][suffix], suffix)
+        else:
+            # we can always fail back to non pretty printed output
+            return bytes
+
+def size_to_int(size):
+    """ Expand the size given with metric/binary suffixes."""
+    size = size.strip()
+
+    if size.isdigit():
+        return int(size)
+
+    global SIZE_BASES
+
+    for b in SIZE_BASES:
+        for suffix in sorted(SIZE_BASES[b], key=SIZE_BASES[b].get, reverse=True):
+            if size.endswith(suffix):
+                 s = int(float(size[:-(len(suffix)+1)]) *  SIZE_BASES[b][suffix])
+                 return s
+    else:
+        raise ValueError("Invalid Suffix on {}".format(size))
 
 if __name__ == '__main__':
     sys.exit(main())