Finding Duplicate Files and Removing Them

Find and remove duplicate files using Python

Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).

Try this:

for file_name in file_list:
if "(1)" not in file_name:
continue
original_file_name = file_name.replace('(1)', '')
if not os.path.exists(os.path.join(file_path, original_file_name):
continue # do not remove files which have no original
os.remove(os.path.join(file_path, file_name))

Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:

  • Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
  • sort all files by size, then
  • walk linearly through this list, identify the doubles (which are neighbours in this list) and
  • yield each such double-group (i. e. a small list of files (typically just two) which are identical).

Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).


By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).

Python - Find duplicate files and move to another folder

So you have two options here (as described my the comments to your question):

  1. Prompt for the target directory beforehand
  2. Prompt for the target directory afterward

The first option is probably the simplest, most efficient, and requires the smallest amount of refactoring. It does however require the user to input a target directory weather or not there are any duplicate files or an error occurs when searching so might be worse from a user's perspective:

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for folder, sub_folder, files in walker:
for file in files:
filepath = os.path.join(folder, file)
filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

if filehash in uniqueFiles:
shutil.move(filepath, destination, copy_function=shutil.copytree)
else:
uniqueFiles[filehash] = source

The second option would allow you to perform all the necessary checks and error handling, but is more complex and requires more refactoring:

# dictionary of hashes to all files
hashes = {}

for folder, sub_folder, files in walker:
for file in files:
filepath = os.path.join(folder, file)
filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

if filehash in hashes
hashes[filehash].append(filepath)
else:
hashes[filehash] = [filepath]

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for duplicates in hashes.values():
if len(duplicates) < 2:
continue

for duplicate in hashes:
shutil.move(duplicate, destination, copy_function=shutil.copytree)

As a side note, I am not familiar with hashlib but I suspect that you will want to be closing the files you are hashing especially if checking a large file tree:

with open(filepath, "rb") as file:
filehash = hashlib.md5(file.read()).hexdigest()

find duplicate files in one folder and removing the newest one? (Google Script)

This should work

function findEntry(arr, name, size) {
for(var i = 0; i < arr.length; i++) {
if(arr[i][0] === name && arr[i][1] === size) return true;
}
return false;
}

function deleteDuplicates() {
var files = DriveApp.getFiles(),
list = [];
while (files.hasNext()) {
var file = files.next(),
name = file.getName(),
size = file.getSize();
if(name.endsWith(".jpg") || name.endsWith(".jpeg") || name.endsWith(".png") || name.endsWith(".gif")) {
if(findEntry(list, name, size)) {
file.setTrashed(true);
} else {
list.push([name, size]);
}
}
}
}

Delete duplicate file based on modified time, remaining first created file

Here is my suggested solution. See the comments in the code itself for an detailed explanation, but the basic idea is that you build up a nested dictionary of lists of 2-element tuples, where the keys of the dictionary are the number of minutes since the start of Unix time, and the 2-tuples contain the filename and the remaining seconds. You then loop over the values of the dictionary (lists of tuples for files created within the same calendar minute), sort these by the seconds, and delete all except the first.

The use of a defaultdict here is just a convenience to avoid the need to explicitly add new lists to the dictionary when looping over files, because these will be added automatically when needed.

import os
import glob
from collections import defaultdict

files_by_minute = defaultdict(list)

# group together all the files according to the number of minutes since the
# start of Unix time, storing the filename and the number of remaining seconds
for filename in glob.glob("C:\\Users\\xxx\\*.AVI"):
time_mod = os.path.getmtime(filename)
mins = time_mod // 60
secs = time_mod % 60
files_by_minute[mins].append((filename, secs))

# go through each of these lists of files, removing the newer ones if
# there is more than one
for fileset in files_by_minute.values():
if len(fileset) > 1:
# sort tuples by second element (i.e. the seconds)
fileset.sort(key=lambda t:t[1])
# remove all except the first
for file_info in fileset[1:]:
filename = file_info[0]
print(f"removing {filename}")
os.remove(filename)


Related Topics



Leave a reply



Submit