Similarity Scores Based on String Comparison in R (Edit Distance)

Similarity scores based on string comparison in R (edit distance)

The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8

ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1
user.self sys.self user.child sys.child
1 3.583 0.452 0 0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707
sys.self user.child sys.child
1 0.461 0 0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1
user.self sys.self user.child sys.child
1 6.49 0.743 0 0

This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

> levenshteinSim
function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),
nchar(str2))))
}

FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

mylevsim = function (str1, str2) 
{
return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),
nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987
sys.self user.child sys.child
1 0.627 0 0

Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.

Calculating string similarity as a percentage

You can use RecordLinkage package and use the function levenshteinSim, i.e.

#This gives the similarity
RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.75

#so to get the distance just subtract from 1,
1 - RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.25

How to measure similarity between strings?

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

  • agrep: only return best match(es)
  • In R, how do I replace a string that contains a certain pattern with another string?
  • Fast Levenshtein distance in R?

But most often agrep will do what you want :

> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3

$`Bush, G.W.`
[1] 2

$`Obama, B.H.`
[1] 1 3

$`Clinton, W.J.`
[1] 4

Calculate edit distance percentage

You can construct a suitable matrix of the max length with outer and pmax, which you can then coerce to dist class (like edit_dist) so you can divide:

edit_dist <- stringdistmatrix(sequence)
n <- nchar(gsub('-', '', sequence))

edit_dist / as.dist(outer(n, n, pmax))
## 1 2 3
## 2 0.000000
## 3 0.812500 0.812500
## 4 1.076923 1.076923 0.687500

Compare a list of strings with each other in R

So, I think this might be what you want. The RecordLinkage package is not on CRAN anymore, so I went for another package that calculates the Levenshtein distance:

library(stringdist)

sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')

df <- expand.grid(sample, sample) # this creates a dataframe of all combinations of the sample elements

stringdist(df$Var1, df$Var2, method = "lv")

Output:

[1] 0 3 3 4 4 4 3 0 3 3 4 3 3 3 0 4 5 4 4 3 4 0 1 2 4 4 5 1 0 1 4 3 4 2 1 0

And maybe a little more appealing - the dplyr version:

library(dplyr)

df %>%
mutate(levenshtein = stringdist(Var1, Var2, method = "lv"))

which outputs

     Var1  Var2 levenshtein
1 apple apple 0
2 appeal apple 3
3 apparel apple 3
4 peel apple 4
5 peer apple 4
6 pear apple 4
...


Related Topics



Leave a reply



Submit