Finding duplicate files using Python
I wrote this script to find and optionally delete duplicate files in a directory tree. The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | """Find duplicate files inside a directory tree.""" from os import walk, remove, stat from os.path import join as joinpath from md5 import md5 def find_duplicates( rootdir ): """Find duplicate files in directory tree.""" filesizes = {} # Build up dict with key as filesize and value is list of filenames. for path, dirs, files in walk( rootdir ): for filename in files: filepath = joinpath( path, filename ) filesize = stat( filepath ).st_size filesizes.setdefault( filesize, [] ).append( filepath ) unique = set () duplicates = [] # We are only interested in lists with more than one entry. for files in [ flist for flist in filesizes.values() if len (flist)> 1 ]: for filepath in files: with open ( filepath ) as openfile: filehash = md5( openfile.read() ).hexdigest() if filehash not in unique: unique.add( filehash ) else : duplicates.append( filepath ) return duplicates if __name__ = = '__main__' : from argparse import ArgumentParser PARSER = ArgumentParser( description = 'Finds duplicate files.' ) PARSER.add_argument( 'root' , metavar = 'R' , help = 'Dir to search.' ) PARSER.add_argument( '-remove' , action = 'store_true' , help = 'Delete duplicate files.' ) ARGS = PARSER.parse_args() DUPS = find_duplicates( ARGS.root ) print '%d Duplicate files found.' % len (DUPS) for f in sorted (DUPS): if ARGS.remove = = True : remove( f ) print '\tDeleted ' + f else : print '\t' + f |
I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.
UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.
UPDATE: You can now find this script on github at github.com/endlesslycurious/Duplicate-Files.
Both comments and pings are currently closed.
You should use a set() instead of an array to store uniques checksums, you’ll have a ~constant time while seeking for a collision. Using an array, python does a linear search in the array, taking more and more time. (A little test shows that to store 10000 md5 in an array checking for collisions takes ~2s, in a set: ~0.1s).
Then shouldn’t you use a fastest ‘hash’ in a first pass to take apart very different files without opening them, like, file size ?
Finaly why reinvent the whell, just use fdupes
Also before deleting you should binary compare the files, in case of md5 collision.
The return value of hexdigest() is a string. So in the code ‘unique’ can be a set instead of list. Making ‘unique’ a set makes the intent more clear and also a ‘not in’ on a set would be faster
Why not:
…
def find_duplicates( rootdir ):
“””Find duplicate files in directory tree.”””
unique = set()
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath(path, filename )
filehash = md5(open(filepath , ‘rb’).read()).digest()
if filehash not in unique:
unique.add(filehash)
else:
yield filepath
…
Hm, but this doesn’t show which files are the same? I would rather use a dictionary that maps from md5 sum to filepaths:
from collections import defaultdict
def find_duplicates( rootdir ):
dup = defaultdict(list)
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
try:
filehash = md5( file( filepath ).read() ).hexdigest()
except IOError:
continue
dup[filehash].append(filepath)
return [paths for paths in dup.values() if len(paths) > 1]
The whole script can be also simulated in this simple shell action:
$ find . -type f -exec md5sum {} \; | uniq -d -w 32
If your files tend to be large (like video files), or file access is low (like files on a network share), you may want to do the duplicate detection in stages.
It can be relatively expensive to read in a whole file and calculate an MD5 hash. It’s cheap to find out a file’s size, though, and files with different sizes cannot be duplicates of each other. So if you start by creating a list of files along with their sizes, you can go back and only calculate the MD5 hash for the files that have identical lengths.
Nice article. You could use set instead of list to speedup lookups.
If you’ve got a lot of files, the ‘filehash not in uniques’ check will get very expensive. This is because Python has to scan through the list one by one looking for filehash. So if you have N files, it will take O(N^2) time.
If you initialise unique with set() instead of [], the ‘not in’ check will be quick even with a lot of distinct files.
Thank you for all the comments!
RE: Set vs Array
Regarding using a set instead of a array I was only running over a set of 5000 files so I didn’t notice the speed issue too much as my harddrive is SSD. I’ll look into making the change to a set, as it sounds like a good optimisation given the performance numbers your giving me.
RE: Juian
I did not know about fdupes, as I’m still relatively new to the Unix/Mac command line. I inspected all the duplicate files manually before running the script again to remove them.
RE: Artur – dictionaries
I did think about a dictionary and more in depth reporting of the location of duplicates vs the original file but I found that duplicates and the original tended to be in the same folder if my use case. It could be a useful feature though.
RE: Jerry & Julian – File sizes test
I like the idea of a first pass using file sizes, that would be a good optimization.
I updated the post with a new version of find_duplicates which uses a first pass comparing file sizes and a second pass which compares hashes of files that are the same size The new version is allot faster than the original, many thanks for all the advice!
It would be fairly simple to tweak the new version to return a dictionary of duplicate files with the hash as the keys and a list of filenames as the values.
I wrote a tool like this, too. It compares files byte-by-byte, which can be quite a lot of files than hashing the whole contents. See http://liw.fi/dupfiles/ for my program. That page lists a few other tools too, the fastest of which seems to be the one called hardlink (http://jak-linux.org/projects/hardlink/).
Regarding the md5 hashes: it’s really not necessary to hash the whole file. When I did a similar task I indexed on file size as Jerry did, then on an md5 has of the first 1kbytes, and only if there were matches on that did I do a full file hash.
Conversely, if one compares the md5(1k) hashes (and deliberately ignores file sizes) the program can find files where one is a truncated version of another.
Lastly, your program now stores its results only in a volatile storage, which means if the task is big (if you’re looking for duplicates among millions of files, for example) interrupting it or rerunning it means all the work must be done again. If you store the data instead in a persistent store like PyBSDDB then you can resume and rerun with little additional work.
RE: Lars
I like your solution of replacing duplicates with links between the two distributions – very efficient!
RE: Finlay
My script is naive in a few ways as I wrote it to deal with my photograph collection which had approximately 5500 files of which about 250 were duplicates. So I felt that hashing the whole file was not too extreme a thing to do, although as you say it wouldn’t be a great idea for very large numbers of files unless you had a correspondingly large amount of RAM available.
Storing results between runs I in database like BerkeleyDB or SQLite would be very useful for very large filesets. For the fileset size I am dealing with I tend to use the csv module for outputting debug information as I can use Excel to inspect it.
Repost with working link:
I use fdupes for this, it’s packaged in at least Ubuntu (sudo apt-get install fdupes).
For photos there’s also findimagedupes (sudo apt-get install findimagedupes) which even finds visually similar images.
Why nobody is wondering, that the “open” file handles won’t be closed?
Open file objects should be closed: always!
The use of the “with” statement (Python 2.6+) may simplify this really good.
with file(filepath) as open_file:
filehash = md5(open_file.read()).hexdigest()
RE: Dag
I seem to suck at finding existing tools for these things, where is the best place to look for open source tools other than Google?
RE: Masaru
Good catch, I’ve updated the script.
Daniel, the two places I usually look at are Debian (30 kilopackages and growing), since that makes it easy for me to install if it’s there, and http://freshmeat.net/ (who would like to list everything free).
[…] enhance convenience I extended the original script from Daniel Brown by adding a graphical user interface using Tkinter, which is integrated in […]
Hey Daniel, thanks for the program! I’m new to python and your program is a great example and help to me for study as well as utility.
I am using this tool to Find Similar Documents.
Pretty impressive results – give it a try,