R fuzzy string match to return specific column based on matched string
You are 90% of the way there...
You say you want to
know with which row of data the string was matched from df2
You just need to understand the code you already have. See ?amatch
:
amatch
returns the position of the closest match ofx
intable
. When multiple matches with the same smallest distance metric exist, the first one is returned.
In other words, amatch
gives you the index for the row in df2
(which is your table
) that is the closest match of each address in df1
(which is your x
). You are prematurely wrapping this index by returning the new address instead.
Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.
Illustration of both approaches:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]
How to fuzzy match by words (not letters) in R?
This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A.
library(tidyverse)
A1 <- tibble(
nombre_completo = c("martin gallardo", "raul gimenez")
) %>%
mutate(
id_A = row_number()
)
B1 <- tibble(
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B")
) %>%
mutate(
id_B = row_number()
)
A2 <- A1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words))
B2 <- B1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words)) %>%
select(name_words, id_B )
left_join(A2, B2, by = "name_words") %>%
group_by(nombre_completo, id_A, id_B) %>%
count() %>% ungroup() %>%
group_by(nombre_completo, id_A) %>%
slice_max(order_by = n) %>%
select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
left_join(B1, by = "id_B")
Replace string with most frequent fuzzy match
This dplyr
pipe statement will return a data frame with 9 rows, one for each of the unique elements in your original words
vector. First we group_by
the raw
column which creates a group for each unique word, then filter
by your distance threshold, then find the corresponding word in clean
with the highest frequency in the original dataset. In your example all words match themselves except for the two variants of "dog."
Code
words_df %>%
group_by(raw) %>%
filter(dist < 0.085) %>%
summarize(clean = clean[which.max(count)])
Output
# A tibble: 9 x 2
raw clean
<chr> <chr>
1 cat cat
2 con con
3 croak croak
4 cry cry
5 dog dog
6 dogg dog
7 dogy dog
8 don don
9 dot dot
How can I match fuzzy match strings from two datasets?
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
Selecting columns based on matching/fuzzy matching value of two columns
The following approach uses foreach
instead of your two nested for
loops, which should make your computation much faster on large data frames. See this for a nice overview of the package. You should also look at the vignettes.
library(foreach)
library(stringdist)
match.cond <- function(ij, df1, df2) { ## 1.
i = floor((ij-1) / nrow(df2)) + 1
j = ij - (i-1) * nrow(df2)
if ((ain(df2$Arranger[j], df1$Agent[[i]], maxDist=0.3,
method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){
return(df2[j, c('Rank', 'Share', 'Issues')])
}
}
leadrep <- foreach(ij = 1:(nrow(df1)*nrow(df2)), .combine=rbind) %do% ## 2.
match.cond(ij, df1, df2)
Notes:
match.cond
is your match condition encapsulated into a kernel function that will be vectorized byforeach
over all pairings of rows fromdf1
anddf2
. Its inputs areij
, which is an index to the pairings, and the two data frames. Withinmatch.cond
:ij
is converted to the row indicesi
fordf1
andj
fordf2
- Your condition is evaluated, and if the condition is met,
- The columns from
df2
for the matched row is returned.
- This is the
foreach
call.- We loop over the indices
ij
from1
tonrow(df1)*nrow(df2)
, which enumerates all pairings of rows fromdf1
anddf2
and%do%
thematch.cond
function. Note that this is all one line. - The
.combine=rbind
argument states that we want to gather all results frommatch.cond
and bind them as rows. - This returns the data frame
leadrep
- We loop over the indices
I've tested on your data, which I've dput
:
df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv",
"suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"),
c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"
), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank",
"suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L,
3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger",
"Share", "Issues", "Year"), class = "data.frame", row.names = c(NA,
-3L))
which gives me the desired result:
print(leadrep)
## Rank Share Issues
##3 3 2.1 3
##2 1 1.2 7
##21 2 1.8 4
Hope this helps.
Limiting the amount of fuzzy string comparisons by comparing by subgroup
You were on the right track - just a few typos/bugs and you need to finish changing/replacing the column names.
Also, in your first one, you will need to figure out how you want to pick the "best match" based on Municipality.dist, Province.dist, and Year.dist.
Maybe the second one works better if you get the years and provinces sorted out first.
DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing"), Values = c(0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))
DT2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None"), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))
library(fuzzyjoin); library(dplyr);
stringdist_join(DT1, DT2,
by = c("Municipality", "Year", "Province"),
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x) %>%
slice_min(Municipality.dist)
#> # A tibble: 135 x 12
#> # Groups: Municipality.x [3]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 1 2001 Anything 0.94 1 2000 Anything
#> # ... with 125 more rows, and 5 more variables: `Other Values` <dbl>,
#> # Municipality.dist <dbl>, Province.dist <dbl>, Year.dist <dbl>, dist <lgl>
stringdist_join(DT1, DT2,
by = "Municipality",
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x, Year.x, Province.x) %>%
slice_min(dist)
#> # A tibble: 135 x 9
#> # Groups: Municipality.x, Year.x, Province.x [27]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 2 2000 Anything 0.580 1 2000 Anything
#> # ... with 125 more rows, and 2 more variables: `Other Values` <dbl>,
#> # dist <dbl>
Created on 2020-12-07 by the reprex package (v0.3.0)
Related Topics
How to Pass Dynamic Column Names in Dplyr into Custom Function
Import Data into R with an Unknown Number of Columns
How to Change Order of Array Dimensions
Delete a Column in a Data Frame Within a List
Dplyr - Using Mutate() Like Rowmeans()
R: How to Get the Week Number of the Month
Bigrams Instead of Single Words in Termdocument Matrix Using R and Rweka
How to Access and Edit Rprofile
How to Determine the Namespace of a Function
Stop an R Program Without Error
Smaller Gap Between Two Legends in One Plot (E.G. Color and Size Scale)
Ggplot2 Multiple Scales/Legends Per Aesthetic, Revisited
How to Draw a Nice Arrow in Ggplot2
Conditional Binary Join and Update by Reference Using the Data.Table Package
How to Stack Error Bars in a Stacked Bar Plot Using Geom_Errorbar