Similarity scores based on string comparison in R (edit distance)
The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).
The levenshteinSim
function in the RecordLinkage package also does this directly, and might be faster than adist
.
library(RecordLinkage)
> levenshteinSim("apple", "apple")
[1] 1
> levenshteinSim("apple", "aaple")
[1] 0.8
> levenshteinSim("apple", "appled")
[1] 0.8333333
> levenshteinSim("appl", "apple")
[1] 0.8
ETA: Interestingly, while levenshteinDist
in the RecordLinkage package appears to be slightly faster than adist
, levenshteinSim
is considerably slower than either. Using the rbenchmark package:
> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1
user.self sys.self user.child sys.child
1 3.583 0.452 0 0
> benchmark(adist("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707
sys.self user.child sys.child
1 0.461 0 0
> benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative
1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1
user.self sys.self user.child sys.child
1 6.49 0.743 0 0
This overhead is due simply to the code for levenshteinSim
, which is just a wrapper around levenshteinDist
:
> levenshteinSim
function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1),
nchar(str2))))
}
FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max
instead of pmax
and shave ~25% off the running time:
mylevsim = function (str1, str2)
{
return(1 - (levenshteinDist(str1, str2)/max(nchar(str1),
nchar(str2))))
}
> benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
test replications elapsed relative user.self
1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987
sys.self user.child sys.child
1 0.627 0 0
Long story short- there is little difference between adist
and levenshteinDist
in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.
Calculating string similarity as a percentage
You can use RecordLinkage
package and use the function levenshteinSim
, i.e.
#This gives the similarity
RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.75
#so to get the distance just subtract from 1,
1 - RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.25
How to measure similarity between strings?
This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
- agrep: only return best match(es)
- In R, how do I replace a string that contains a certain pattern with another string?
- Fast Levenshtein distance in R?
But most often agrep
will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4
Calculate edit distance percentage
You can construct a suitable matrix of the max length with outer
and pmax
, which you can then coerce to dist
class (like edit_dist
) so you can divide:
edit_dist <- stringdistmatrix(sequence)
n <- nchar(gsub('-', '', sequence))
edit_dist / as.dist(outer(n, n, pmax))
## 1 2 3
## 2 0.000000
## 3 0.812500 0.812500
## 4 1.076923 1.076923 0.687500
Compare a list of strings with each other in R
So, I think this might be what you want. The RecordLinkage
package is not on CRAN anymore, so I went for another package that calculates the Levenshtein distance:
library(stringdist)
sample <- c('apple', 'appeal', 'apparel', 'peel', 'peer', 'pear')
df <- expand.grid(sample, sample) # this creates a dataframe of all combinations of the sample elements
stringdist(df$Var1, df$Var2, method = "lv")
Output:
[1] 0 3 3 4 4 4 3 0 3 3 4 3 3 3 0 4 5 4 4 3 4 0 1 2 4 4 5 1 0 1 4 3 4 2 1 0
And maybe a little more appealing - the dplyr
version:
library(dplyr)
df %>%
mutate(levenshtein = stringdist(Var1, Var2, method = "lv"))
which outputs
Var1 Var2 levenshtein
1 apple apple 0
2 appeal apple 3
3 apparel apple 3
4 peel apple 4
5 peer apple 4
6 pear apple 4
...
Related Topics
Stacked Bar Chart in R (Ggplot2) with Y Axis and Bars as Percentage of Counts
How to Extract the Row with Min or Max Values
Change Row Order in a Matrix/Dataframe
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
Change the Default Colour Palette in Ggplot
Returning Anonymous Functions from Lapply - What Is Going Wrong
Struggling with Integers (Maximum Integer Size)
How to Change the First Row to Be the Header in R
Shiny App: Downloadhandler Does Not Produce a File
Find Which Interval Row in a Data Frame That Each Element of a Vector Belongs In
What Is the Meaning of the Dollar Sign "$" in R Function()
Using Stargazer with Rstudio and Knitr
Conditional Coloring of Cells in Table
Variable Name Restrictions in R