Most Efficient Way to Calculate Hamming Distance in Ruby

Most efficient way to calculate hamming distance in ruby?

You can make use of the optimized String functions in Ruby to do the bit counting, instead of pure arithmetic. It turns out to be about 6 times faster with some quick benchmarking.

def h2(a, b)
  (a^b).to_s(2).count("1")
end

h1 is the normal way to calculate, while h2 converts the xor into a string, and counts the number of "1"s

Benchmark:

ruby-1.9.2-p180:001:0>> def h1(a, b)
ruby-1.9.2-p180:002:1*> ret = 0
ruby-1.9.2-p180:003:1*> xor = a ^ b
ruby-1.9.2-p180:004:1*> until xor == 0
ruby-1.9.2-p180:005:2*> ret += 1
ruby-1.9.2-p180:006:2*> xor &= xor - 1
ruby-1.9.2-p180:007:2*> end
ruby-1.9.2-p180:008:1*> ret
ruby-1.9.2-p180:009:1*> end
# => nil
ruby-1.9.2-p180:010:0>> def h2(a, b)
ruby-1.9.2-p180:011:1*> (a^b).to_s(2).count("1")
ruby-1.9.2-p180:012:1*> end
# => nil
ruby-1.9.2-p180:013:0>> h1(2323409845, 1782647144)
# => 17
ruby-1.9.2-p180:014:0>> h2(2323409845, 1782647144)
# => 17
ruby-1.9.2-p180:015:0>> quickbench(10**5) { h1(2323409845, 1782647144) }
Rehearsal ------------------------------------
   2.060000   0.000000   2.060000 (  1.944690)
--------------------------- total: 2.060000sec

       user     system      total        real
   1.990000   0.000000   1.990000 (  1.958056)
# => nil
ruby-1.9.2-p180:016:0>> quickbench(10**5) { h2(2323409845, 1782647144) }
Rehearsal ------------------------------------
   0.340000   0.000000   0.340000 (  0.333673)
--------------------------- total: 0.340000sec

       user     system      total        real
   0.320000   0.000000   0.320000 (  0.326854)
# => nil
ruby-1.9.2-p180:017:0>>

(Speed Challenge) Any faster way to compute distance matrix in terms of generic Hamming distance?

methodM <- function(x) {
  xt <- t(x)
  sapply(1:nrow(x), function(y) colSums(xt != xt[, y]))
}
microbenchmark::microbenchmark(
  methodB(m), methodM(m),
  unit = "relative", check = "equivalent", times = 50
)
# Unit: relative
#       expr  min       lq     mean   median       uq      max neval cld
# methodB(m) 1.00 1.000000 1.000000 1.000000 1.000000 1.000000    50  a 
# methodM(m) 1.25 1.224827 1.359573 1.219507 1.292463 4.550159    50   b

Measure the distance between two strings with Ruby?

I found this for you:

def levenshtein_distance(s, t)
  m = s.length
  n = t.length
  return m if n == 0
  return n if m == 0
  d = Array.new(m+1) {Array.new(n+1)}

  (0..m).each {|i| d[i][0] = i}
  (0..n).each {|j| d[0][j] = j}
  (1..n).each do |j|
    (1..m).each do |i|
      d[i][j] = if s[i-1] == t[j-1]  # adjust index into string
                  d[i-1][j-1]       # no operation required
                else
                  [ d[i-1][j]+1,    # deletion
                    d[i][j-1]+1,    # insertion
                    d[i-1][j-1]+1,  # substitution
                  ].min
                end
    end
  end
  d[m][n]
end

[ ['fire','water'], ['amazing','horse'], ["bamerindos", "giromba"] ].each do |s,t|
  puts "levenshtein_distance('#{s}', '#{t}') = #{levenshtein_distance(s, t)}"
end

That's awesome output: =)

levenshtein_distance('fire', 'water') = 4
levenshtein_distance('amazing', 'horse') = 7
levenshtein_distance('bamerindos', 'giromba') = 9

Source: http://rosettacode.org/wiki/Levenshtein_distance#Ruby

Sum of Hamming Distances

You can consider the bit-positions separately. That gives you 32 (or some other number) of easier problems, where you still have to calculate the sum of all pairs of hamming distances, except now it's over 1-bit numbers.

The hamming distance between two 1-bit numbers is their XOR.

And now it has become the easiest case of this problem - it's already split per bit.

So to reiterate the answer to that question, you take a bit position, count the number of 0's and the number of 1's, multiply those to get the contribution of this bit position. Sum those for all bit positions. It's even simpler than the linked problem, because the weight of the contribution of every bit is 1 in this problem.

Modification to Hamming distance/ Edit Distance

Here is a simple solution. For sure it could be optimized for better performance.

Note: I used Python 3, since Python 2 will retire soon.

def hamming_distance(s1, s2):
    assert len(s1) == len(s2)
    # b and d are interchangeable
    s1 = s1.replace('b', 'd').replace('B', 'D')
    s2 = s2.replace('b', 'd').replace('B', 'D')
    # add 1 for each different character
    hammingdist = sum(ch1 != ch2 for ch1, ch2 in zip(s1.lower(), s2.lower()))
    # add .5 for each lower/upper case difference (without first letter)
    for i in range(1, len(s1)):
        hammingdist += 0.5 * (s1[i] >= 'a' and s1[i] <= 'z' and\
                              s2[i] >= 'A' and s2[i] <= 'Z' or\
                              s1[i] >= 'A' and s1[i] <= 'Z' and\
                              s2[i] >= 'a' and s2[i] <= 'z')
    return hammingdist

def print_hamming_distance(s1, s2):
    print("hamming distance between", s1, "and", s2, "is",
          hamming_distance(s1, s2))

if __name__ == "__main__":
    assert hamming_distance('mark', 'Make') == 2
    assert hamming_distance('Killer', 'killer') == 0
    assert hamming_distance('killer', 'KiLler') == 0.5
    assert hamming_distance('bole', 'dole') == 0
    print("all fine")
    print_hamming_distance("organized", "orGanised")
    # prints: hamming distance between organized and orGanised is 1.5

How to calculate Hemming Distance in CosmosDB?

To solve this I took code from long.js and ImageHash for using in CosmosDB UDF. All cudos to their authors.

See gist it here https://gist.github.com/okolobaxa/55cc08a0d67bc60505bfe3ab4f8bc33c

Usage:

SELECT udf.HAMMING_DISTANCE(files.ContentId, '1279796919517872320') FROM files

But please note a few things:

CosmosDB doesn't support 64-bit numbers as numbers, you have to
store them as strings.
Using this UDF costs a lot of RUs

I created a feature request on the CosmosDB Feedback forum to add built-in support of such functions. Please vote for these ideas if you're interested in it too:

Built-in functions for bitwise operations
Built-in functions for calculating distance metrics

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).

Then, I used a BK Tree to compare the strings.

Most Efficient Way to Calculate Hamming Distance in Ruby