How to Find the Closest Pairs (Hamming Distance) of a String of Binary Bins in Ruby Without O^2 Issues

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).

Then, I used a BK Tree to compare the strings.

retrieve closest element from a set of elements

You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.

Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.

pseudo code:

findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram

An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).

k-d trees also require the elements to be comparable.

How to calculate Hemming Distance in CosmosDB?

To solve this I took code from long.js and ImageHash for using in CosmosDB UDF. All cudos to their authors.

See gist it here https://gist.github.com/okolobaxa/55cc08a0d67bc60505bfe3ab4f8bc33c

Usage:

SELECT udf.HAMMING_DISTANCE(files.ContentId, '1279796919517872320') FROM files

But please note a few things:

  1. CosmosDB doesn't support 64-bit numbers as numbers, you have to
    store them as strings.
  2. Using this UDF costs a lot of RUs

I created a feature request on the CosmosDB Feedback forum to add built-in support of such functions. Please vote for these ideas if you're interested in it too:

  • Built-in functions for bitwise operations

  • Built-in functions for calculating distance metrics



Related Topics



Leave a reply



Submit