Cosine similarity vs Hamming distance
A Hamming distance should be done between two strings of equal length and with the order taken into account.
As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)
Here is a cosine similarity function of 2 arrays of words:
function cosineSimilarity($tokensA, $tokensB)
{
$a = $b = $c = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$a += $x * $y;
$b += $x;
$c += $y;
}
return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
It is fast (isset()
instead of in_array()
is a killer on large arrays).
As you can see, the results does not take into account the "magnitude" of each the word.
I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)
The best link about string similarity metrics:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
For further interesting readings:
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298
There are other useful similarity or distance metrics?
A quick search yielded, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions", which appears excellent.
Similarity distance measures
It sounds like you need cosine similarity measure:
similarity = cos(v1, v2) = v1 * v2 / (|v1| |v2|)
where v1 * v2
is dot product between v1
and v2
:
v1 * v2 = v1[1]*v2[1] + v1[2]*v2[2] + ... + v1[n]*v2[n]
Essentially, dot product shows how many elements in both vectors have 1 at the same position: if v1[k] == 1
and v2[k] == 1
, then final sum (and thus similarity) is increased, otherwise it isn't changed.
You can use dot product itself, but sometimes you would want final similarity to be normalized, e.g. be between 0 and 1. In this case you can divide dot product of v1
and v2
by their lengths - |v1|
and |v2|
. Essentially, vector length is square root of dot product of the vector with itself:
|v| = sqrt(v[1]*v[1] + v[2]*v[2] + ... + v[n]*v[n])
Having all of these, it's easy to implement cosine distance as follows (example in Python):
from math import sqrt
def dot(v1, v2):
return sum(x*y for x, y in zip(v1, v2))
def length(v):
return sqrt(dot(v, v))
def sim(v1, v2):
return dot(v1, v2) / (length(v1) * length(v2))
Note, that I described similarity (how much two vectors are close to each other), not distance (how far they are). If you need exactly distance, you can calculate it as dist = 1 / sim
.
Can the cosine similarity when using Locality Sensitive Hashing be -1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.
What's the relationship between Hamming distance and Simple Matching Coefficient?
Hamming / length = 1 - SMC
is a very strong relationship. Because of this they are equivalent in their capabilities.
You argumet of "looking at the whole data set" is wrong, each is defined on a pair of objects?
The point of this exercise is to practise your basic math skills, and transform equations into one another. This is a skill you will need frequently:
- you don't need to explore equivalent functions, one is enough
- of equivalent functions, one may be more efficient to compute than another
- of equivalent functions, one may be more precise than another due to floating point math.
LSH for fast NN similarity search based on hamming distance?
Hamming distance is equivalent to L1 (Manhattan) distance restricted to boolean vectors.
Hamming distance measure for pvclust
Try this
hamming <- function(x) {
x <- as.matrix(x)
y <- (1 - x) %*% t(x)
res <- y + t(y)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}
Cosine similarity when one of vectors is all zeros
If you have 0 vectors, cosine is the wrong similarity function for your application.
Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.
The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.
While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.
Don't choose the distance by what you can easily compute.
Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...
Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.
How to obtain complexity cosine similarity in Matlab?
Better end with 49. Maybe you should also add an index to sim?
for j = 1:49
x = dat(j,:);
for i = j+1:50
y = dat(i,:);
c = dot(x,y);
sim(j) = c/(norm(x,2)*norm(y,2));
end
end
The complexity should be roughly like o(n^2), isn't it?
Maybe you should have a look at correlation functions ... I don't get what you want to write exactly, but it looks like you want to do something similar. There are built-in correlation functions in Matlab.
Related Topics
HTML into PHP Variable (HTML Outside PHP Code)
Jetbrains Webide: PHP Variable Type Hinting
Get Code Line and File That's Executing the Current Function in PHP
How to Load View from Alternative Directory in Laravel 4
Best Way to Do a Weighted Search Over Multiple Fields in MySQL
Using Zend Framework for Highload Projects
PHP JSON Encode Output Number as String
Apache Permissions, PHP File Create, Mkdir Fail
How to Create a Random Hash/String
PHP Preg_Replace/Preg_Match VS PHP Str_Replace
Using Comparison Operators in a PHP 'Switch' Statement