Fast Levenshtein distance in R?
levenshteinDist (from the RecordLinkage
package) calls compiled C code. Give it a try.
Running levenshtein is taking more time in R
Edit: Original answer assumed both string vectors were same length when creating tibble.
Here's an approach that compares two vectors of 1000 strings (1M combinations). How long are the columns you are working with? If much longer, and assuming you need to compare every element of each to every element in the other, it require a different approach.
library(tidyverse); library(stringdist)
set.seed(42)
Response1 = stringi::stri_rand_strings(1000, 6)
Response2 = stringi::stri_rand_strings(1000, 6)
# EDIT, should work for different length vectors
combos <- expand.grid(Response1, Response2, stringsAsFactors = F) %>%
as_tibble() %>%
# Here, levenshtein distance based on the average length of the two strings
mutate(distance = stringdist(Var1, Var2, method = "lv") /
(nchar(Var1) + nchar(Var2) / 2)) %>%
filter(distance < 0.4)
R - return n matches via levenshtein distance
If I understand the question, the following does what you want.
First I will rerun the code line that creates dist.mat.ad
, since your code had an error, it refers to columns address.full
when they are named address
.
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
Now the results you want.
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
The results are in top.nm
and top.ad
.
Final clean up.
rm(imat, tmp)
Very Fast string fuzzy matching in R
You could try stringsdist
-package.
It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.
library(stringdist)
a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))
start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()
> end_time - start_time
Time difference of 0.006981134 secs
> res
[,1] [,2] [,3]
[1,] 0 2 3
[2,] 2 0 3
[3,] 3 3 0
diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3
Levenshtein / edit distance for arbitrary sequences
You can use intToUtf8
to map your integers to Unicode characters:
a2 <- intToUtf8(a)
b2 <- intToUtf8(b)
adist(a2, b2)
# [,1]
# [1,] 1
Related Topics
Calculate Cumsum() While Ignoring Na Values
Pass Function Arguments to Both Dplyr and Ggplot
Referring to Data.Table Columns by Names Saved in Variables
Embedded Nul in String' Error When Importing CSV with Fread
Plot One Numeric Variable Against N Numeric Variables in N Plots
Filling Area Under Curve Based on Value
Display Weighted Mean by Group in the Data.Frame
R - How to Get Row & Column Subscripts of Matched Elements from a Distance Matrix
Finding Out Which Functions Are Called Within a Given Function
R Function with No Return Value
Sorting Each Row of a Data Frame
How to Create Two Independent Drill Down Plot Using Highcharter
What Leads the First Element of a Printed List to Be Enclosed with Backticks in R V3.5.1
R - Converting Date and Time Fields to Posixct with Hhmmss Format