I wrote this script to find and optionally delete duplicate files in a directory tree.  The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
"""Find duplicate files inside a directory tree."""
 
from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5
 
def find_duplicates( rootdir ):
    """Find duplicate files in directory tree."""
    filesizes = {}
    # Build up dict with key as filesize and value is list of filenames.
    for path, dirs, files in walk( rootdir ):
        for filename in files:
            filepath = joinpath( path, filename )
            filesize = stat( filepath ).st_size
            filesizes.setdefault( filesize, [] ).append( filepath )
    unique = set()
    duplicates = []
    # We are only interested in lists with more than one entry.
    for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
        for filepath in files:
            with open( filepath ) as openfile:
                filehash = md5( openfile.read() ).hexdigest()
            if filehash not in unique:
                unique.add( filehash )
            else:
                duplicates.append( filepath )
    return duplicates
 
if __name__ == '__main__':
    from argparse import ArgumentParser
 
    PARSER = ArgumentParser( description='Finds duplicate files.' )
    PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
    PARSER.add_argument( '-remove', action='store_true',
                         help='Delete duplicate files.' )
    ARGS = PARSER.parse_args()
 
    DUPS = find_duplicates( ARGS.root )
 
    print '%d Duplicate files found.' % len(DUPS)
    for f in sorted(DUPS):
        if ARGS.remove == True:
            remove( f )
            print '\tDeleted '+ f
        else:
            print '\t'+ f

I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.

UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.

UPDATE: You can now find this script on github at github.com/endlesslycurious/Duplicate-Files.