Hamming Distance/Similarity Searches in a Database

Hamming Distance / Similarity searches in a database

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.
slice your query hash the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.
for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...]

Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)

[...]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).

The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

Hamming distance on binary strings in SQL

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.

FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

Quickly check large database for edit-distance similarity

I wrote a very brief prototype of a simple locality sensitive hashing algorithm in python. However there are a few caveats and you may want to optimize some pieces as well. I'll mention them when we see them.

Assume all your strings are stored in strings.

import random
from collections import Counter

MAX_LENGTH = 500
SAMPLING_LENGTH = 10

def bit_sampling(string, indices):
    return ''.join([string[i] if i<len(string) else ' ' for i in indices])

indices = random.sample(range(MAX_LENGTH),SAMPLING_LENGTH)
hashes = [bit_sampling(string, indices) for string in strings]

counter = Counter(hashes)
most_common, count = counter.most_common()[0]
while count > 1:
    dup_indices = [i for i, x in enumerate(hashes) if x == most_common]
    # You can use dup_indices to check the edit distance for original groups here.
    counter.pop(most_common)
    most_common, count = counter.most_common()[0]

First of all, this is a slight variant of bit sampling that works best for the general hamming distance. Ideally if all your string are of the same length, this can give a theoretical probability bound for the hamming distance. When the hamming distance between two string is small, it is very unlikely that they will have different hash. This can be specified by the parameter SAMPLING_LENGTH. A larger SAMPLING_LENGTH will make it more likely to hash similar string to different hash but also reduce the probability of hashing not very similar string to the same hash. For hamming distance, you can calculate this trade-off easily.

Run this snippet multiple times can increase your confident on no similar strings since each time you will sample different places.

To accommodate your purpose to compare different length strings, one possible approach is to left padding space on shorter strings and make copies of them.

Though all of the operation in this snippet are linear (O(n)), it may still consume significant memory and running time and it might be possible to reduce a constant factor.

You might also want to consider using more complicated locality sensitive hashing algorithm such as surveyed here: https://arxiv.org/pdf/1408.2927.pdf

LSH for fast NN similarity search based on hamming distance?

Hamming distance is equivalent to L1 (Manhattan) distance restricted to boolean vectors.

Data structure for finding nearby keys with similar bitvalues

What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.

BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.

Hamming Distance/Similarity Searches in a Database