Fast Levenshtein Distance in R

Fast Levenshtein distance in R?

levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try.

Running levenshtein is taking more time in R

Edit: Original answer assumed both string vectors were same length when creating tibble.

Here's an approach that compares two vectors of 1000 strings (1M combinations). How long are the columns you are working with? If much longer, and assuming you need to compare every element of each to every element in the other, it require a different approach.

library(tidyverse); library(stringdist)
set.seed(42)
Response1 = stringi::stri_rand_strings(1000, 6)
Response2 = stringi::stri_rand_strings(1000, 6)

# EDIT, should work for different length vectors
combos <- expand.grid(Response1, Response2, stringsAsFactors = F) %>%
as_tibble() %>%

# Here, levenshtein distance based on the average length of the two strings
mutate(distance = stringdist(Var1, Var2, method = "lv") /
(nchar(Var1) + nchar(Var2) / 2)) %>%
filter(distance < 0.4)

R - return n matches via levenshtein distance

If I understand the question, the following does what you want.

First I will rerun the code line that creates dist.mat.ad, since your code had an error, it refers to columns address.full when they are named address.

dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

Now the results you want.

imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)

imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)

The results are in top.nm and top.ad.

Final clean up.

rm(imat, tmp)

Very Fast string fuzzy matching in R

You could try stringsdist-package.

It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

library(stringdist)

a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))

start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()

> end_time - start_time
Time difference of 0.006981134 secs
> res
[,1] [,2] [,3]
[1,] 0 2 3
[2,] 2 0 3
[3,] 3 3 0


diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3

Levenshtein / edit distance for arbitrary sequences

You can use intToUtf8 to map your integers to Unicode characters:

a2 <- intToUtf8(a)
b2 <- intToUtf8(b)

adist(a2, b2)
# [,1]
# [1,] 1


Related Topics



Leave a reply



Submit