Merging Two Data Frames Using Fuzzy/Approximate String Matching in R

Merging two Data Frames using Fuzzy/Approximate String Matching in R

Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

Suppose you have the data:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

Which would give us:

  idx name.x price     name.y qty
1   1    Ace    10    Ace Co.   9
2   2  Bayes    13 Bayes Inc.  99

R: fuzzy merge two data frame

This works:

library(fuzzyjoin)
pqr <- pqr %>% stringdist_inner_join(abc, by = c(bin1 = "bin1"))

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Merging two Data frames with fuzzy merge/sqldf

One way is to first create "+/- 30 day" columns in one of them, then do a standard date-range join. Using sqldf:

Prep:

library(dplyr)
df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# # A tibble: 8 x 9
#   UserID Full.Name   Info  EncounterID Date                Temp  misc   Date_m30            Date_p30           
#    <int> <chr>       <chr>       <int> <dttm>              <chr> <chr>  <dttm>              <dttm>             
# 1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null) 2020-12-03 00:00:00 2021-02-01 00:00:00
# 2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no     2020-12-06 00:00:00 2021-02-04 00:00:00
# 3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null) 2020-12-02 00:00:00 2021-01-31 00:00:00
# 4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null) 2020-12-06 00:00:00 2021-02-04 00:00:00
# 5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null) 2021-04-09 00:00:00 2021-06-08 00:00:00
# 6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null) 2021-04-08 00:00:00 2021-06-07 00:00:00
# 7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null) 2021-11-12 00:00:00 2022-01-11 00:00:00
# 8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null) 2021-11-11 00:00:00 2022-01-10 00:00:00

The join:

sqldf::sqldf("
    select df11.*, df22.DOB, df22.EncounterDate, df22.Type, df22.responses
    from df11
      left join df22 on df11.UserID = df22.UserID
        and df22.EncounterDate between df11.Date_m30 and df11.Date_p30") %>%
  select(-Date_m30, -Date_p30)
#   UserID   Full.Name Info EncounterID                Date Temp   misc     DOB       EncounterDate  Type responses
# 1      1  John Smith  yes          13 2021-01-01 19:00:00  100 (null)  1/1/90 2020-12-31 19:00:00 Intro    (null)
# 2      2 Jack Peters   no          14 2021-01-04 19:00:00  103     no 1/10/90 2021-01-01 19:00:00 Intro        no
# 3      3   Bob Brown  yes          15 2020-12-31 19:00:00  104 (null)  1/2/90 2020-12-31 19:00:00 Intro       yes
# 4      4    Jane Doe  yes          16 2021-01-04 19:00:00  103 (null) 2/20/80 2021-01-05 19:00:00 Intro        no
# 5      5 Jackie Jane  yes          17 2021-05-08 20:00:00  101 (null)  2/2/80 2021-05-06 20:00:00  Care        no
# 6      6 Sarah Brown  yes          18 2021-05-07 20:00:00  102 (null) 12/2/80 2021-05-07 20:00:00   Out     unsat
# 7      7 Chloe Brown   no          19 2021-12-11 19:00:00  103 (null)    <NA>                <NA>  <NA>      <NA>
# 8      1  John Smith  yes          13 2021-12-10 19:00:00  105 (null)    <NA>                <NA>  <NA>      <NA>

R: Fuzzy merge using agrep and data.table

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

Merging through fuzzy matching of variables in R

The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

R fuzzy string match to return specific column based on matched string

You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

is it possible to do fuzzy match merge with python pandas?

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

Merging Two Data Frames Using Fuzzy/Approximate String Matching in R