Finding duplicate files using Python

01.06.11Posted in Programming19 CommentsTags: Python

I wrote this script to find and optionally delete duplicate files in a directory tree. The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.

"""Find duplicate files inside a directory tree."""
 
from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5
 
def find_duplicates( rootdir ):
    """Find duplicate files in directory tree."""
    filesizes = {}
    # Build up dict with key as filesize and value is list of filenames.
    for path, dirs, files in walk( rootdir ):
        for filename in files:
            filepath = joinpath( path, filename )
            filesize = stat( filepath ).st_size
            filesizes.setdefault( filesize, [] ).append( filepath )
    unique = set()
    duplicates = []
    # We are only interested in lists with more than one entry.
    for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
        for filepath in files:
            with open( filepath ) as openfile:
                filehash = md5( openfile.read() ).hexdigest()
            if filehash not in unique:
                unique.add( filehash )
            else:
                duplicates.append( filepath )
    return duplicates
 
if __name__ == '__main__':
    from argparse import ArgumentParser
 
    PARSER = ArgumentParser( description='Finds duplicate files.' )
    PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
    PARSER.add_argument( '-remove', action='store_true',
                         help='Delete duplicate files.' )
    ARGS = PARSER.parse_args()
 
    DUPS = find_duplicates( ARGS.root )
 
    print '%d Duplicate files found.' % len(DUPS)
    for f in sorted(DUPS):
        if ARGS.remove == True:
            remove( f )
            print '\tDeleted '+ f
        else:
            print '\t'+ f

I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.

UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.

UPDATE: You can now find this script on github at github.com/endlesslycurious/Duplicate-Files.

Both comments and pings are currently closed.

19 Responses to “Finding duplicate files using Python”

Julien Palard says:

01/06/2011 at 04:34

You should use a set() instead of an array to store uniques checksums, you’ll have a ~constant time while seeking for a collision. Using an array, python does a linear search in the array, taking more and more time. (A little test shows that to store 10000 md5 in an array checking for collisions takes ~2s, in a set: ~0.1s).

Then shouldn’t you use a fastest ‘hash’ in a first pass to take apart very different files without opening them, like, file size ?

Finaly why reinvent the whell, just use fdupes

Also before deleting you should binary compare the files, in case of md5 collision.
Sateesh says:

01/06/2011 at 04:49

The return value of hexdigest() is a string. So in the code ‘unique’ can be a set instead of list. Making ‘unique’ a set makes the intent more clear and also a ‘not in’ on a set would be faster
Peter Ward says:

01/06/2011 at 04:53

Why not:

…
def find_duplicates( rootdir ):
“””Find duplicate files in directory tree.”””
unique = set()
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath(path, filename )
filehash = md5(open(filepath , ‘rb’).read()).digest()
if filehash not in unique:
unique.add(filehash)
else:
yield filepath
…
Artur says:

01/06/2011 at 06:47

Hm, but this doesn’t show which files are the same? I would rather use a dictionary that maps from md5 sum to filepaths:

from collections import defaultdict

def find_duplicates( rootdir ):
dup = defaultdict(list)
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
try:
filehash = md5( file( filepath ).read() ).hexdigest()
except IOError:
continue
dup[filehash].append(filepath)
return [paths for paths in dup.values() if len(paths) > 1]

The whole script can be also simulated in this simple shell action:

$ find . -type f -exec md5sum {} \; | uniq -d -w 32
Jerry says:

01/06/2011 at 07:01

If your files tend to be large (like video files), or file access is low (like files on a network share), you may want to do the duplicate detection in stages.

It can be relatively expensive to read in a whole file and calculate an MD5 hash. It’s cheap to find out a file’s size, though, and files with different sizes cannot be duplicates of each other. So if you start by creating a list of files along with their sizes, you can go back and only calculate the MD5 hash for the files that have identical lengths.
Marko Tasic says:

01/06/2011 at 07:40

Nice article. You could use set instead of list to speedup lookups.
xtian says:

01/06/2011 at 09:20

If you’ve got a lot of files, the ‘filehash not in uniques’ check will get very expensive. This is because Python has to scan through the list one by one looking for filehash. So if you have N files, it will take O(N^2) time.

If you initialise unique with set() instead of [], the ‘not in’ check will be quick even with a lot of distinct files.
Daniel says:

01/06/2011 at 10:17

Thank you for all the comments!

RE: Set vs Array
Regarding using a set instead of a array I was only running over a set of 5000 files so I didn’t notice the speed issue too much as my harddrive is SSD. I’ll look into making the change to a set, as it sounds like a good optimisation given the performance numbers your giving me.

RE: Juian
I did not know about fdupes, as I’m still relatively new to the Unix/Mac command line. I inspected all the duplicate files manually before running the script again to remove them.

RE: Artur – dictionaries
I did think about a dictionary and more in depth reporting of the location of duplicates vs the original file but I found that duplicates and the original tended to be in the same folder if my use case. It could be a useful feature though.

RE: Jerry & Julian – File sizes test
I like the idea of a first pass using file sizes, that would be a good optimization.
Daniel says:

01/06/2011 at 11:22

I updated the post with a new version of find_duplicates which uses a first pass comparing file sizes and a second pass which compares hashes of files that are the same size The new version is allot faster than the original, many thanks for all the advice!

It would be fairly simple to tweak the new version to return a dictionary of duplicate files with the hash as the keys and a list of filenames as the values.
Lars Wirzenius says:

01/06/2011 at 13:59

I wrote a tool like this, too. It compares files byte-by-byte, which can be quite a lot of files than hashing the whole contents. See http://liw.fi/dupfiles/ for my program. That page lists a few other tools too, the fastest of which seems to be the one called hardlink (http://jak-linux.org/projects/hardlink/).
Finlay McWalter says:

01/06/2011 at 14:20

Regarding the md5 hashes: it’s really not necessary to hash the whole file. When I did a similar task I indexed on file size as Jerry did, then on an md5 has of the first 1kbytes, and only if there were matches on that did I do a full file hash.

Conversely, if one compares the md5(1k) hashes (and deliberately ignores file sizes) the program can find files where one is a truncated version of another.

Lastly, your program now stores its results only in a volatile storage, which means if the task is big (if you’re looking for duplicates among millions of files, for example) interrupting it or rerunning it means all the work must be done again. If you store the data instead in a persistent store like PyBSDDB then you can resume and rerun with little additional work.
Daniel says:

01/06/2011 at 15:02

RE: Lars
I like your solution of replacing duplicates with links between the two distributions – very efficient!

RE: Finlay
My script is naive in a few ways as I wrote it to deal with my photograph collection which had approximately 5500 files of which about 250 were duplicates. So I felt that hashing the whole file was not too extreme a thing to do, although as you say it wouldn’t be a great idea for very large numbers of files unless you had a correspondingly large amount of RAM available.

Storing results between runs I in database like BerkeleyDB or SQLite would be very useful for very large filesets. For the fileset size I am dealing with I tend to use the csv module for outputting debug information as I can use Excel to inspect it.
Dag says:

02/06/2011 at 03:18

Repost with working link:

I use fdupes for this, it’s packaged in at least Ubuntu (sudo apt-get install fdupes).

For photos there’s also findimagedupes (sudo apt-get install findimagedupes) which even finds visually similar images.
Masaru says:

03/06/2011 at 03:16

Why nobody is wondering, that the “open” file handles won’t be closed?

Open file objects should be closed: always!

The use of the “with” statement (Python 2.6+) may simplify this really good.

with file(filepath) as open_file:
filehash = md5(open_file.read()).hexdigest()
Daniel says:

03/06/2011 at 18:28

RE: Dag
I seem to suck at finding existing tools for these things, where is the best place to look for open source tools other than Google?

RE: Masaru
Good catch, I’ve updated the script.
Lars Wirzenius says:

03/06/2011 at 22:34

Daniel, the two places I usually look at are Debian (30 kilopackages and growing), since that makes it easy for me to install if it’s there, and http://freshmeat.net/ (who would like to list everything free).
How to delete duplicate files using python | PANISCH says:

12/03/2012 at 23:36

[…] enhance convenience I extended the original script from Daniel Brown by adding a graphical user interface using Tkinter, which is integrated in […]
Thanks says:

21/05/2013 at 00:29

Hey Daniel, thanks for the program! I’m new to python and your program is a great example and help to me for study as well as utility.
Jack Dorsey says:

23/06/2013 at 23:20

I am using this tool to Find Similar Documents.
Pretty impressive results – give it a try,

Finding duplicate files using Python

19 Responses to “Finding duplicate files using Python”

Search this blog

Recent Posts

Recent Comments

About Daniel Brown

Post Categories

Post Archives