Cosine Similarity VS Hamming Distance

Cosine similarity vs Hamming distance

A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298

There are other useful similarity or distance metrics?

A quick search yielded, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions", which appears excellent.

Similarity distance measures

It sounds like you need cosine similarity measure:

similarity = cos(v1, v2) = v1 * v2 / (|v1| |v2|)

where v1 * v2 is dot product between v1 and v2:

v1 * v2 = v1[1]*v2[1] + v1[2]*v2[2] + ... + v1[n]*v2[n]

Essentially, dot product shows how many elements in both vectors have 1 at the same position: if v1[k] == 1 and v2[k] == 1, then final sum (and thus similarity) is increased, otherwise it isn't changed.

You can use dot product itself, but sometimes you would want final similarity to be normalized, e.g. be between 0 and 1. In this case you can divide dot product of v1 and v2 by their lengths - |v1| and |v2|. Essentially, vector length is square root of dot product of the vector with itself:

|v| = sqrt(v[1]*v[1] + v[2]*v[2] + ... + v[n]*v[n])

Having all of these, it's easy to implement cosine distance as follows (example in Python):

from math import sqrt

def dot(v1, v2):
    return sum(x*y for x, y in zip(v1, v2))

def length(v):
    return sqrt(dot(v, v))

def sim(v1, v2): 
    return dot(v1, v2) / (length(v1) * length(v2))

Note, that I described similarity (how much two vectors are close to each other), not distance (how far they are). If you need exactly distance, you can calculate it as dist = 1 / sim.

Can the cosine similarity when using Locality Sensitive Hashing be -1?

Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.

What's the relationship between Hamming distance and Simple Matching Coefficient?

Hamming / length = 1 - SMC

is a very strong relationship. Because of this they are equivalent in their capabilities.

You argumet of "looking at the whole data set" is wrong, each is defined on a pair of objects?

The point of this exercise is to practise your basic math skills, and transform equations into one another. This is a skill you will need frequently:

you don't need to explore equivalent functions, one is enough
of equivalent functions, one may be more efficient to compute than another
of equivalent functions, one may be more precise than another due to floating point math.

LSH for fast NN similarity search based on hamming distance?

Hamming distance is equivalent to L1 (Manhattan) distance restricted to boolean vectors.

Hamming distance measure for pvclust

Try this

hamming <- function(x) {
x <- as.matrix(x)
y <- (1 - x) %*% t(x)
res <- y + t(y)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}

Cosine similarity when one of vectors is all zeros

If you have 0 vectors, cosine is the wrong similarity function for your application.

Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.

The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.

While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.

Don't choose the distance by what you can easily compute.

Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...

Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.

How to obtain complexity cosine similarity in Matlab?

Better end with 49. Maybe you should also add an index to sim?

for j = 1:49
  x = dat(j,:);
  for i = j+1:50
      y = dat(i,:);
      c = dot(x,y);
      sim(j) = c/(norm(x,2)*norm(y,2));
  end
end

The complexity should be roughly like o(n^2), isn't it?
Maybe you should have a look at correlation functions ... I don't get what you want to write exactly, but it looks like you want to do something similar. There are built-in correlation functions in Matlab.

Cosine Similarity VS Hamming Distance