Generate All Sequences of Bits Within Hamming Distance T

Generate all sequences of bits within Hamming distance t

#include <stdio.h>
#include <stdint.h>
#include <string.h>

void magic(char* str, int i, int changesLeft) {
        if (changesLeft == 0) {
                printf("%s\n", str);
                return;
        }
        if (i < 0) return;
        // flip current bit
        str[i] = str[i] == '0' ? '1' : '0';
        magic(str, i-1, changesLeft-1);
        // or don't flip it (flip it again to undo)
        str[i] = str[i] == '0' ? '1' : '0';
        magic(str, i-1, changesLeft);
}

int main(void) {
        char str[] = "011";
        printf("%s\n", str);
        size_t len = strlen(str);
        size_t maxDistance = len;
        for (size_t i = 1 ; i <= maxDistance ; ++i) {
                printf("Computing for distance %d\n", i);
                magic(str, len-1, i);
                printf("----------------\n");
        }
        return 0;
}

Output:

MacBook-Pro:hammingDist gsamaras$ nano kastrinis.cpp
MacBook-Pro:hammingDist gsamaras$ g++ -Wall kastrinis.cpp 
MacBook-Pro:hammingDist gsamaras$ ./a.out 
011
Computing for distance 1
010
001
111
----------------
Computing for distance 2
000
110
101
----------------
Computing for distance 3
100
----------------

generating numbers, with high hamming distance

Here's a fairly trivial way. Find the smallest number of bits = b that can express k different numbers. E.g. for k=4 use b = 2 bits. Divide 64 bits up into chunks of size 2. For each chunk, give each number to be generated a different number among the 2^b >= k available.

E.g. for k=4 numbers the b bits are 00, 01, 10, 11 and here are 4 possibilities:

0000

0101

1010

1111

If you have c chunks, then each number is different from each other number in at least one bit of each of c chunks, so the minimum guaranteed hamming separation is c.

You could also permute the selection of numbers within each chunk, which would produce more random looking examples such as

0011

0101

1000

1110

Data structure for Hamming cube

I'm rusty with C++, so I'll keep this abstract.

The neighbors of a given point of your Hamming cube are easily computable. Given a vertex's bit sequence, flip each bit individually.

You could precompute that, though. You could cache the results of your neighbors() function, or you could save them to an array. Each vertex would have its own neighbors, so you have one array for each vertex. That gives you, essentially, your adjacency list.

With that adjacency list, you can search your Hamming cube using depth-limited search, a variant of DFS (or BFS, I guess—but space complexity is worse) that only goes k units deep.

Your data structure is a good choice, but consider that your vertices are binary strings, so they cover all points from 0 to 2^n - 1. You might as well just use an array—lookup will still be O(1), and it'll be more compact because there aren't unused buckets sitting around.

Enumerate all binary numbers of length n and weight at most k lexicographically

You can do it recursively.

Here's an example in C# (sorry, I am not set up for doing c/c++ right now):

class MainClass
{
    static void Generate(int value, int onesLeft, int bitsLeft, int bits)
    {
        if (bitsLeft > 0)
        {
            Generate(value << 1, onesLeft, bitsLeft - 1, bits);
            if (onesLeft > 0)
                Generate((value << 1) + 1, onesLeft - 1, bitsLeft - 1, bits);
        }
        else
        {
            for (int i = 0; i < bits; i++)
            {
                Console.Write((value >> (bits - i) - 1) & 1);
            }
            Console.WriteLine();
        }
    }

    static void Generate(int bits, int maxOnes)
    {
        Generate(0, maxOnes, bits, bits);
    }

    public static void Main()
    {
        Generate(4, 2);
    }
}

Output for Generate(4, 2):

Output for Generate(5, 1):

Trying to compare a recursive and an iterative algorithm

The recursive algorithm is O((n choose t) * n) too, by an analysis that charges to each printed combination the cost of the entire call stack at the time that it is printed. We can do this because every invocation of magic (except the two O(1) leaf calls where i < 0, which we could easily do away with) prints something.

This bound is best possible if you assign printing its true cost. Otherwise, I'm pretty sure that both analyses can be tightened to O(n choose t) excluding printing for t > 0, with details in Knuth 4A.

Efficiently find binary strings with low Hamming distance in large set

Question: What do we know about the Hamming distance d(x,y)?

Answer:

It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query


                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

Computing hamming distances on large data set

This may be the fastest you can do (watchout, for your data size it'd require to allocate 74.5 GiB of memory):

import numpy as np

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

dists = 2 * np.inner(nodes-0.5, 0.5-nodes) + nodes.shape[1] / 2

Just for fun, here is a 40X faster version in Julia:

using LoopVectorization, Tullio

function hamming!(nodes,dists)
    @tullio dists[i,j] = sum(nodes[i,k] ⊻ nodes[j,k])
end

n = 10^5
nodes = rand(Int8[0,1],n,24)
dists = Matrix{Int8}(undef,n,n)
@time hamming!(nodes,dists) # Run twice
  # 1.886367 seconds (114 allocations: 6.594 KiB)

While we're at it, I invite you to enter the world of Julia. It offers similar speeds to C++ and a pleasant syntax similar to Python.

Generate all binary strings of length n with k bits set

This method will generate all integers with exactly N '1' bits.

From https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation

Compute the lexicographically next bit permutation

Suppose we have a pattern of N bits set to 1 in an integer and we want
the next permutation of N 1 bits in a lexicographical sense. For
example, if N is 3 and the bit pattern is 00010011, the next patterns
would be 00010101, 00010110, 00011001, 00011010, 00011100, 00100011,
and so forth. The following is a fast way to compute the next
permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits

unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for
x86, the intrinsic is _BitScanForward. These both emit a bsf
instruction, but equivalents may be available for other architectures.
If not, then consider using one of the methods for counting the
consecutive zero bits mentioned earlier. Here is another version that
tends to be slower because of its division operator, but it does not
require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.

Generate All Sequences of Bits Within Hamming Distance T