Add image de-duplication
I often dump my entire memory card onto my NAS. It would be great if Medio could compare the file hashes if there is a name conflict.
Comments (8)
-
repo owner -
reporter First of all let me revise my original proposal. I think using checksums is a bad idea since there can be conflicts. Yes, that's not very likely but byte-by-byte comparison is safer and faster. Python has a lib for that: https://docs.python.org/3/library/filecmp.html
Secondly: I don’t think this can be done with exiftool. You would have to do this in your Python code.
I can see two possible ways here. Both happen after exiftool is done moving the file:
- You find a way to detect if the current file has a index in its name (%%c parameter). If it does you compare it to all other files that have the same name except for the index. Since the destination path pattern is dynamic this would require some regex magic on the pattern itself and the result path from exiftool. If you find an identical file to the current one, you can delete the current one.
- You always compare the current file to all other files in the current result subdirectory. As long as there are not thousands of photos in this folder this should not take much time if you use byte-by-byte comparison. Here as well: If you find an identical file to the current one, you can delete the current one.
Solution 2 could be optimized my checking if the destination path pattern contains variables which would result in unknown filename lengths (like %B -> full locale month name). If it does not you could calculate the filename length from the pattern and compare it to the actual result filename. If the current file does not contain an index you don't need to compare it with the others in the directory.
-
reporter - changed title to Add image de-duplication
-
repo owner I do sometimes have this problem too. Thinking about it more, I think it should be the case that when there is a duplicate, it will hit the %c renaming. Right? So I’d probably lean more toward your
#1. You can detect the index param because that part of the name will be more than 6 characters (as long as the user hasn’t changed the %Y/%m_%b/%Y%m%d_%H%M%S%%c.%%e format).I do parse the exiftool output for logging purposes, so I could do it at process time, I think. I could also do it as a periodic batch job across all photos.
I am always hesitant to do any sort of deletion of files I’ll try to play around with this a bit.
-
reporter please do
-
repo owner Good news!
This will be fixed in 1.0. There is an option to enable it during install, defaults to on. The way it works is it let’s exiftool move the renamed duplicate file into place with name like fname-A, then will remove it if it’s exactly the same as fname. It uses filecmp.
In this commit:
https://bitbucket.org/polandj/medio/commits/3d4f01c240bbb0bcf465d17edbf04c16c4b3130a
-
repo owner This may be slow if it’s checking lots of photos. As moving them and then comparing can be slow. Also worth noting, this only does the compare you haven’t changed the default file naming scheme.
-
repo owner - changed status to resolved
1.0
- Log in to comment
Unfortunately, I just use exiftool underneath to do the renames and it doesn’t support this. If you can find a way to make exiftool do it, then it’d be easy to add. Unfortunately, looking at the docs, it doesn’t seem to support doing anything with checksums.