Add image de-duplication

Issue #17 resolved
Peter Burner created an issue

I often dump my entire memory card onto my NAS. It would be great if Medio could compare the file hashes if there is a name conflict.

Comments (8)

  1. Jonathan Poland repo owner

    Unfortunately, I just use exiftool underneath to do the renames and it doesn’t support this. If you can find a way to make exiftool do it, then it’d be easy to add. Unfortunately, looking at the docs, it doesn’t seem to support doing anything with checksums.

  2. Peter Burner reporter

    First of all let me revise my original proposal. I think using checksums is a bad idea since there can be conflicts. Yes, that's not very likely but byte-by-byte comparison is safer and faster. Python has a lib for that: https://docs.python.org/3/library/filecmp.html

    Secondly: I don’t think this can be done with exiftool. You would have to do this in your Python code.

    I can see two possible ways here. Both happen after exiftool is done moving the file:

    1. You find a way to detect if the current file has a index in its name (%%c parameter). If it does you compare it to all other files that have the same name except for the index. Since the destination path pattern is dynamic this would require some regex magic on the pattern itself and the result path from exiftool. If you find an identical file to the current one, you can delete the current one.
    2. You always compare the current file to all other files in the current result subdirectory. As long as there are not thousands of photos in this folder this should not take much time if you use byte-by-byte comparison. Here as well: If you find an identical file to the current one, you can delete the current one.

    Solution 2 could be optimized my checking if the destination path pattern contains variables which would result in unknown filename lengths (like %B -> full locale month name). If it does not you could calculate the filename length from the pattern and compare it to the actual result filename. If the current file does not contain an index you don't need to compare it with the others in the directory.

  3. Jonathan Poland repo owner

    I do sometimes have this problem too. Thinking about it more, I think it should be the case that when there is a duplicate, it will hit the %c renaming. Right? So I’d probably lean more toward your #1. You can detect the index param because that part of the name will be more than 6 characters (as long as the user hasn’t changed the %Y/%m_%b/%Y%m%d_%H%M%S%%c.%%e format).

    I do parse the exiftool output for logging purposes, so I could do it at process time, I think. I could also do it as a periodic batch job across all photos.

    I am always hesitant to do any sort of deletion of files I’ll try to play around with this a bit.

  4. Jonathan Poland repo owner

    This may be slow if it’s checking lots of photos. As moving them and then comparing can be slow. Also worth noting, this only does the compare you haven’t changed the default file naming scheme.

  5. Log in to comment