Merging Two Data Frames Using Fuzzy/Approximate String Matching in R

Merging two Data Frames using Fuzzy/Approximate String Matching in R

Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

Suppose you have the data:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

Which would give us:

  idx name.x price     name.y qty
1 1 Ace 10 Ace Co. 9
2 2 Bayes 13 Bayes Inc. 99

R: fuzzy merge two data frame

This works:

library(fuzzyjoin)
pqr <- pqr %>% stringdist_inner_join(abc, by = c(bin1 = "bin1"))

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Merging two Data frames with fuzzy merge/sqldf

One way is to first create "+/- 30 day" columns in one of them, then do a standard date-range join. Using sqldf:

Prep:

library(dplyr)
df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# # A tibble: 8 x 9
# UserID Full.Name Info EncounterID Date Temp misc Date_m30 Date_p30
# <int> <chr> <chr> <int> <dttm> <chr> <chr> <dttm> <dttm>
# 1 1 John Smith yes 13 2021-01-02 00:00:00 100 (null) 2020-12-03 00:00:00 2021-02-01 00:00:00
# 2 2 Jack Peters no 14 2021-01-05 00:00:00 103 no 2020-12-06 00:00:00 2021-02-04 00:00:00
# 3 3 Bob Brown yes 15 2021-01-01 00:00:00 104 (null) 2020-12-02 00:00:00 2021-01-31 00:00:00
# 4 4 Jane Doe yes 16 2021-01-05 00:00:00 103 (null) 2020-12-06 00:00:00 2021-02-04 00:00:00
# 5 5 Jackie Jane yes 17 2021-05-09 00:00:00 101 (null) 2021-04-09 00:00:00 2021-06-08 00:00:00
# 6 6 Sarah Brown yes 18 2021-05-08 00:00:00 102 (null) 2021-04-08 00:00:00 2021-06-07 00:00:00
# 7 7 Chloe Brown no 19 2021-12-12 00:00:00 103 (null) 2021-11-12 00:00:00 2022-01-11 00:00:00
# 8 1 John Smith yes 13 2021-12-11 00:00:00 105 (null) 2021-11-11 00:00:00 2022-01-10 00:00:00

The join:

sqldf::sqldf("
select df11.*, df22.DOB, df22.EncounterDate, df22.Type, df22.responses
from df11
left join df22 on df11.UserID = df22.UserID
and df22.EncounterDate between df11.Date_m30 and df11.Date_p30") %>%
select(-Date_m30, -Date_p30)
# UserID Full.Name Info EncounterID Date Temp misc DOB EncounterDate Type responses
# 1 1 John Smith yes 13 2021-01-01 19:00:00 100 (null) 1/1/90 2020-12-31 19:00:00 Intro (null)
# 2 2 Jack Peters no 14 2021-01-04 19:00:00 103 no 1/10/90 2021-01-01 19:00:00 Intro no
# 3 3 Bob Brown yes 15 2020-12-31 19:00:00 104 (null) 1/2/90 2020-12-31 19:00:00 Intro yes
# 4 4 Jane Doe yes 16 2021-01-04 19:00:00 103 (null) 2/20/80 2021-01-05 19:00:00 Intro no
# 5 5 Jackie Jane yes 17 2021-05-08 20:00:00 101 (null) 2/2/80 2021-05-06 20:00:00 Care no
# 6 6 Sarah Brown yes 18 2021-05-07 20:00:00 102 (null) 12/2/80 2021-05-07 20:00:00 Out unsat
# 7 7 Chloe Brown no 19 2021-12-11 19:00:00 103 (null) <NA> <NA> <NA> <NA>
# 8 1 John Smith yes 13 2021-12-10 19:00:00 105 (null) <NA> <NA> <NA> <NA>

R: Fuzzy merge using agrep and data.table

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
ignore.case=TRUE,
max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
# Name.x A Name.y B
#1 ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2 ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

Merging through fuzzy matching of variables in R

The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
father_id <- with(df,which(student_name[i] == father_name))
if(length(father_id) == 1){
matches[['father']][[i]] <- father_id
} else {
old_father_id <- NULL
## try to find the total
for(m in 10:1){ ## m is the maximum distance
father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
if(length(father_id) == 1 || m == 1){
## if we find a unique match or if we are in our last round, then stop
matches[['father']][[i]] <- father_id
break
} else if(length(father_id) == 0 && length(old_father_id) > 0) {
## if we can't do better than multiple matches, then record them anyway
matches[['father']][[i]] <- old_father_id
break
} else if(length(father_id) == 0 && length(old_father_id) == 0) {
## if the nearest match is more than 10 different from the current pattern, then stop
break
}
}
}
}

The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

R fuzzy string match to return specific column based on matched string

You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

is it possible to do fuzzy match merge with python pandas?

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e

In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e

.

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)


Related Topics



Leave a reply



Submit