Techniques for Finding Near Duplicate Records

Techniques for finding near duplicate records

If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.

I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).

If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.

So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.

You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.

Best way or algorithm to near duplicate check against huge list of files?

A partial answer, but one obvious optimization is to only compare files with approximately the same size. Also comparing file a and file b is the same as comparing b and a: 20000 files gives 20000 * (20000-1)/2 comparisons.
300 MB is not that big, you could try to read in all files first.

On indexing, it's just about describing each file with one ore more numbers. Size is one. Number of non-white space or white space or new line characters could be others. If the files all contain the same kind of data you can interpret the data to create more useful numbers.
Also, completely identical files will have the same SHA-256 hash. This will only help if a significant fraction of the files are identical.

import hashlib
m = hashlib.sha256()
m.update(file_contents)
this_files_hash_value=m.digest()

Unfortunately I can think of no method for completely accurately and correctly factorizing (parts of) what is done by difflib.SequenceMatcher since it's dynamically comparing all possible chunks of the input files.

For Loop in a Data Frame to find near duplicate rows

You can use df.eq(l).sum(axis=1) to compute the number of (aligned) common elements with your list:

l = [0,1,1,1,0,0,1,0,1,0]
df.eq(l).sum(axis=1)

x    6
y    4
z    4
dtype: int64

To filter with a threshold, use:

diff = 4
df[df.eq(l).sum(axis=1).ge(len(l)-diff)]

output:

   a  b  c  d  e  f  g  h  i  j
x  0  0  1  1  1  0  0  1  1  0

Near Duplicate Detection in Data Streams

First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.

Find near duplicate and faked images

Here's a quantitative method to determine duplicate and near-duplicate images using the sentence-transformers library which provides an easy way to compute dense vector representations for images. We can use the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model which is a neural network already trained on a variety of (image, text) pairs. To find image duplicates and near-duplicates, we encode all images into vector space and then find high density regions which correspond to areas where the images are fairly similar.

When two images are compared, they are given a score between 0 to 1.00. We can use a threshold parameter to identify two images as similar or different. By setting the threshold lower, you will get larger clusters which have fewer similar images in it. A duplicate image will have a score of 1.00 meaning the two images are exactly the same. To find near-duplicate images, we can set the threshold to any arbitrary value, say 0.9. For instance, if the determined score between two images are greater than 0.9 then we can conclude they are near-duplicate images.

An example:

Sample Image

This dataset has 5 images, notice how there are duplicates of cat #1 while the others are different.

Finding duplicate images

Score: 100.000%
.\cat1 copy.jpg
.\cat1.jpg

Both cat1 and its copy are the same.

Finding near-duplicate images

Score: 91.116%
.\cat1 copy.jpg
.\cat2.jpg

Score: 91.116%
.\cat1.jpg
.\cat2.jpg

Score: 91.097%
.\bear1.jpg
.\bear2.jpg

Score: 59.086%
.\bear2.jpg
.\cat2.jpg

Score: 56.025%
.\bear1.jpg
.\cat2.jpg

Score: 53.659%
.\bear1.jpg
.\cat1 copy.jpg

Score: 53.659%
.\bear1.jpg
.\cat1.jpg

Score: 53.225%
.\bear2.jpg
.\cat1.jpg

We get more interesting score comparison results between different images. The higher the score, the more similar; the lower the score, the less similar. Using a threshold of 0.9 or 90%, we can filter out near-duplicate images.

Comparison between just two images

Score: 91.097%
.\bear1.jpg
.\bear2.jpg

Score: 91.116%
.\cat1.jpg
.\cat2.jpg

Score: 93.715%
.\tower1.jpg
.\tower2.jpg

Code

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os

# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')

# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(Image.open(filepath))
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)

# Now we run the clustering algorithm. This function compares images aganist 
# all other images and returns a list with the pairs that have the highest 
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
NUM_SIMILAR_IMAGES = 10 

# =================
# DUPLICATES
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
duplicates = [image for image in processed_images if image[0] >= 1]

# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
    print("\nScore: {:.3f}%".format(score * 100))
    print(image_names[image_id1])
    print(image_names[image_id2])

# =================
# NEAR DUPLICATES
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower, 
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near 
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]

for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
    print("\nScore: {:.3f}%".format(score * 100))
    print(image_names[image_id1])
    print(image_names[image_id2])

Techniques for Finding Near Duplicate Records